[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082790#comment-16082790 ] Andrian Jardan commented on SPARK-6962: --- We're also facing this issue on 1.6, are there any plans to solve it ? Can we help ? > Netty BlockTransferService hangs in the middle of SQL query > --- > > Key: SPARK-6962 > URL: https://issues.apache.org/jira/browse/SPARK-6962 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.2.0, 1.2.1, 1.3.0 >Reporter: Jon Chase > Attachments: jstacks.txt > > > Spark SQL queries (though this seems to be a Spark Core issue - I'm just > using queries in the REPL to surface this, so I mention Spark SQL) hang > indefinitely under certain (not totally understood) circumstances. > This is resolved by setting spark.shuffle.blockTransferService=nio, which > seems to point to netty as the issue. Netty was set as the default for the > block transport layer in 1.2.0, which is when this issue started. Setting > the service to nio allows queries to complete normally. > I do not see this problem when running queries over smaller (~20 5MB files) > datasets. When I increase the scope to include more data (several hundred > ~5MB files), the queries will get through several steps but eventuall hang > indefinitely. > Here's the email chain regarding this issue, including stack traces: > http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/> For context, here's the announcement regarding the block transfer service > change: > http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005851#comment-15005851 ] Romi Kuntsman commented on SPARK-6962: -- what's the status of this? something similar happens to me in 1.4.0 and also in 1.5.1 the job hangs forever with the largest shuffle when increasing the number of partitions (as a function of the data size), the issue is fixed > Netty BlockTransferService hangs in the middle of SQL query > --- > > Key: SPARK-6962 > URL: https://issues.apache.org/jira/browse/SPARK-6962 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.2.1, 1.3.0 >Reporter: Jon Chase > Attachments: jstacks.txt > > > Spark SQL queries (though this seems to be a Spark Core issue - I'm just > using queries in the REPL to surface this, so I mention Spark SQL) hang > indefinitely under certain (not totally understood) circumstances. > This is resolved by setting spark.shuffle.blockTransferService=nio, which > seems to point to netty as the issue. Netty was set as the default for the > block transport layer in 1.2.0, which is when this issue started. Setting > the service to nio allows queries to complete normally. > I do not see this problem when running queries over smaller (~20 5MB files) > datasets. When I increase the scope to include more data (several hundred > ~5MB files), the queries will get through several steps but eventuall hang > indefinitely. > Here's the email chain regarding this issue, including stack traces: > http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/> For context, here's the announcement regarding the block transfer service > change: > http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633005#comment-14633005 ] Reynold Xin commented on SPARK-6962: [~jonchase] do you still see the problem on 1.4 or in master branch? Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633016#comment-14633016 ] Jon Chase commented on SPARK-6962: -- I'll check tomorrow on 1.4.0. Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502270#comment-14502270 ] Aaron Davidson commented on SPARK-6962: --- I created SPARK-7003 to track a fix to the potential problem I noted, and a PR to follow: https://github.com/apache/spark/pull/5584 If it would be possible to pull in that patch, it either may fix the issue you're seeing (by performing retries in the event of network faults) or at least fail after a few minutes rather than hanging indefinitely -- either result would be interesting. Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499308#comment-14499308 ] Aaron Davidson commented on SPARK-6962: --- Executor logs in particular. Are all remaining tasks hanging and on all different machines? Similar to what Patrick said, if there's an asymmetry on the machines it could suggest one has stopped responding and everyone is waiting on it. It's possible that only one Executor is behaving in an erratic way, though it's abnormal too that the connection didn't just timeout after some time and the task be retried. Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500224#comment-14500224 ] Michael Allman commented on SPARK-6962: --- [~ilikerps] Okay. We're still in the process of our Spark 1.3 migration. Once that's complete I will run some test queries and check the executor logs. Should I set the log level to debug or is that too noisy? Also, I forgot to mention here that we seem to have found an effective workaround by setting spark.shuffle.blockTransferService to nio rather than the default netty. This has been confirmed to be working by two other members of the mailing list. Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500406#comment-14500406 ] Jon Chase commented on SPARK-6962: -- I'm tailing the executor logs before/as this is happening and I don't see anything out of the ordinary (errors, etc.) Here's what the logs look like when the lockup occurs (again, not seeing anything out of the ordinary). I tailed all executor's, and all of the logs look similar to this. == /mnt/var/log/hadoop/yarn-hadoop-nodemanager-ip-XX-XX-XX-XXX.eu-west-1.compute.internal.log == 2015-04-17 18:27:58,206 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 11216 for container-id container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 11.3 GB of 50 GB virtual memory used 2015-04-17 18:28:01,214 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 11216 for container-id container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 11.3 GB of 50 GB virtual memory used 2015-04-17 18:28:04,221 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 11216 for container-id container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 11.3 GB of 50 GB virtual memory used 2015-04-17 18:28:07,229 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 11216 for container-id container_1429189930421_0012_01_02: 6.7 GB of 10 GB physical memory used; 11.3 GB of 50 GB virtual memory used Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500419#comment-14500419 ] Jon Chase commented on SPARK-6962: -- Looking at the UI when the lock up occurs, I see that every executor has 4 active tasks. It's not the case that, say, only a single executor has a task running - they all appear to be busy while locked up. Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500429#comment-14500429 ] Jon Chase commented on SPARK-6962: -- Here's the stderr from the executors at the time of the lock up (there are 3 executors). 18:26:00 is when the lockup happened, and after 20+ minutes, these are still the most recent logs in executor 1: 15/04/17 18:26:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1132 15/04/17 18:26:00 INFO executor.Executor: Running task 110.0 in stage 15.0 (TID 1132) 15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 non-empty blocks out of 1008 blocks 15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote fetches in 3 ms 15/04/17 18:26:00 INFO executor.Executor: Finished task 107.0 in stage 15.0 (TID 1129). 8325 bytes result sent to driver 15/04/17 18:26:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1133 15/04/17 18:26:00 INFO executor.Executor: Running task 111.0 in stage 15.0 (TID 1133) 15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 non-empty blocks out of 1008 blocks 15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote fetches in 2 ms Here's executor 2, it doesn't have any activity for about 20 minutes (again, the lockup happened at ~18:26:00): 15/04/17 18:25:48 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 non-empty blocks out of 1008 blocks 15/04/17 18:25:48 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote fetches in 11 ms 15/04/17 18:25:49 INFO executor.Executor: Finished task 13.0 in stage 15.0 (TID 1035). 12013 bytes result sent to driver 15/04/17 18:25:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1068 15/04/17 18:25:49 INFO executor.Executor: Running task 46.0 in stage 15.0 (TID 1068) 15/04/17 18:25:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1008 non-empty blocks out of 1008 blocks 15/04/17 18:25:49 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote fetches in 16 ms 15/04/17 18:41:19 WARN server.TransportChannelHandler: Exception in connection from /10.106.144.109:49697 java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) 15/04/17 18:41:27 WARN server.TransportChannelHandler: Exception in connection from /10.106.145.10:38473 java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) Same with executor 3: 15/04/17 18:25:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1092 15/04/17 18:25:52 INFO
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14501054#comment-14501054 ] Aaron Davidson commented on SPARK-6962: --- Thanks for those log excerpts. It is likely significant that each IP appeared exactly once in a connection exception among the executors. Given this warning, but no corresponding error Still have X requests outstanding when connection from 10.106.143.39 is closed, I also would be inclined to deduce that only the TransportServer-side of the socket is timing out, and that for some reason the connection exception is not reaching the client side of the socket (which would have caused the outstanding fetch requests to fail promptly). If this situation could arise, then each client could be waiting indefinitely for some other server to respond, which it will not. Is your cluster in any sort of unusual network configuration? Even so, this only could explain why the hang is indefinite, not why all communication is paused for 20 minutes leading up to it. To further diagnose this, it would actually be very useful if you could turn on TRACE level debugging for org.apache.spark.storage.ShuffleBlockFetcherIterator and org.apache.spark.network (this should look like {{log4j.logger.org.apache.spark.network=TRACE}} in the log4j.properties). Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14501063#comment-14501063 ] Aaron Davidson commented on SPARK-6962: --- I have a hypothesis that the above is caused by assumptions we make about the eventuality/symmetry of socket timeouts that are not guaranteed in arbitrary network topologies. If this is the case, though, then I would also expect nio to have intermittent failures, though it could at least recover from them. A potential fix would be a timer thread in [TransportResponseHandler|https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java] which would require that there is some message received every N seconds (the default network timeout in Spark is 120 seconds) as long as there is some outstanding request. This should be fairly robust due to our use of retries on IOExceptions in [RetryingBlockFetcher|https://github.com/apache/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java]. Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498991#comment-14498991 ] Michael Allman commented on SPARK-6962: --- [~adav]Which logs would be helpful? [~pwend...@gmail.com]I've seen this problem occur where a stage is hung waiting for multiple tasks from more than one executor to complete. Also, the GC time as reported for the blocked tasks is insignificant, or at least nothing odd compared to the other tasks. Additionally, I see no unusual CPU usage or load level. The tasks seem to be simply idle, waiting for some never-to-be-received input. Also, I see the same thread stack trace as the OP (the thread whose stack includes the line org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:278)). I think that signal can be used to distinguish this hang from others. I've also just confirmed with [~rxin] on the mailing list that I'm still seeing this problem on branch-1.3 as of https://github.com/apache/spark/commit/6d3c4d8b04b2738a821dfcc3df55a5635b89e506. Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org