[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265194#comment-14265194 ] Stephen Haberman commented on SPARK-5085: - I think I've found a good clue: {code} [ 2527.610744] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2555.073922] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2606.652438] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2615.427676] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2626.008450] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2742.996355] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2744.434263] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2744.623440] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.204023] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.470360] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.517182] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.616516] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2818.547464] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2824.525844] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2831.868035] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2833.644154] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2895.396963] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2896.939451] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2901.464337] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2902.461459] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2904.840728] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2908.156252] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2908.925033] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2933.240541] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2975.218843] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2975.333279] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2980.533872] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2984.017055] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2984.107575] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2991.252054] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2993.965474] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3000.521793] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3057.080236] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3067.674541] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3077.984465] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3139.590085] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3140.145975] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3217.729824] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3249.614154] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3252.775976] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3261.234940] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3302.538848] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3325.811720] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3332.873067] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3340.724759] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3349.646235] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3354.857573] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3361.728122] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3373.623622] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3384.29] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3394.701554] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3402.048682] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3408.972487] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3415.697781] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3415.746289] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3428.234060] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3438.317541] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3467.061761] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3592.827300] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3598.320551] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3601.487113] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3636.656200] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3670.347676] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3672.573875] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3720.392902] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3748.385374] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3763.997229] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3776.472560] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3783.343165] xen_netfront: xennet: skb rides the rocket: 19 slots [
[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265472#comment-14265472 ] Stephen Haberman commented on SPARK-5085: - Looks like this is probably a known EC2/Ubuntu/Linux/Xen issue: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811 http://www.spinics.net/lists/netdev/msg282340.html I'm trying to run with those flags (tso/sg) off to see if that fixes it. netty shuffle service causing connection timeouts - Key: SPARK-5085 URL: https://issues.apache.org/jira/browse/SPARK-5085 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Environment: EMR, transient cluster of 10 m3.2xlarges, spark 1.2.0 Here's our spark-defaults: {code} spark.master spark://$MASTER_IP:7077 spark.eventLog.enabled true spark.eventLog.dir /mnt/spark/work/history spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.memory ${EXECUTOR_MEM}m spark.core.connection.ack.wait.timeout 600 spark.storage.blockManagerSlaveTimeoutMs 6 spark.shuffle.consolidateFiles true spark.shuffle.service.enabled false spark.shuffle.blockTransferService nio # works with nio, fails with netty # Use snappy because LZF uses ~100-300k buffer per block spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec spark.shuffle.file.buffer.kb 10 spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -Xss2m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRati... spark.akka.logLifecycleEvents true spark.akka.timeout 360 spark.akka.askTimeout 120 spark.akka.lookupTimeout 120 spark.akka.frameSize 100 spark.files.userClassPathFirst true spark.shuffle.memoryFraction 0.5 spark.storage.memoryFraction 0.2 {code} Reporter: Stephen Haberman In Spark 1.2.0, the netty backend is causing our report's cluster to lock up with connection timeouts, ~75% of the way through the job. It happens with both the external shuffle server and the non-external/executor-hosted shuffle server, but if I change the shuffle service from netty to nio, it immediately works. Here's log output from one executor (I turned on trace output for the network package and ShuffleBlockFetcherIterator; all executors in the cluster have basically the same pattern of ~15m of silence then timeouts): {code} // lots of log output, doing fine... 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(107)) - Received req from /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, chunkIndex=170} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data, offset=4574685, length=20939}} to client /10.169.175.179:57056 // note 15m of silence here... 15/01/03 05:48:13 WARN [shuffle-server-7] server.TransportChannelHandler (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection from /10.33.166.218:42780 java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at
[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265642#comment-14265642 ] Stephen Haberman commented on SPARK-5085: - I've confirmed that adding: sudo ethtool -K eth0 tso off sudo ethtool -K eth0 sg off To our cluster setup script fixed the issue and the job runs perfectly now. (I had initially tried only tso off; that did not fix it, tso off + sg off did fix it. I have not tried only sg off, as from running ethtool, it my naive/quick interpretation was that sg off implied tso off (but don't hold me to that). Closing this ticket. netty shuffle service causing connection timeouts - Key: SPARK-5085 URL: https://issues.apache.org/jira/browse/SPARK-5085 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Environment: EMR, transient cluster of 10 m3.2xlarges, spark 1.2.0 Here's our spark-defaults: {code} spark.master spark://$MASTER_IP:7077 spark.eventLog.enabled true spark.eventLog.dir /mnt/spark/work/history spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.memory ${EXECUTOR_MEM}m spark.core.connection.ack.wait.timeout 600 spark.storage.blockManagerSlaveTimeoutMs 6 spark.shuffle.consolidateFiles true spark.shuffle.service.enabled false spark.shuffle.blockTransferService nio # works with nio, fails with netty # Use snappy because LZF uses ~100-300k buffer per block spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec spark.shuffle.file.buffer.kb 10 spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -Xss2m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRati... spark.akka.logLifecycleEvents true spark.akka.timeout 360 spark.akka.askTimeout 120 spark.akka.lookupTimeout 120 spark.akka.frameSize 100 spark.files.userClassPathFirst true spark.shuffle.memoryFraction 0.5 spark.storage.memoryFraction 0.2 {code} Reporter: Stephen Haberman In Spark 1.2.0, the netty backend is causing our report's cluster to lock up with connection timeouts, ~75% of the way through the job. It happens with both the external shuffle server and the non-external/executor-hosted shuffle server, but if I change the shuffle service from netty to nio, it immediately works. Here's log output from one executor (I turned on trace output for the network package and ShuffleBlockFetcherIterator; all executors in the cluster have basically the same pattern of ~15m of silence then timeouts): {code} // lots of log output, doing fine... 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(107)) - Received req from /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, chunkIndex=170} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data, offset=4574685, length=20939}} to client /10.169.175.179:57056 // note 15m of silence here... 15/01/03 05:48:13 WARN [shuffle-server-7] server.TransportChannelHandler (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection from /10.33.166.218:42780 java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264196#comment-14264196 ] Stephen Haberman commented on SPARK-5085: - There is no interesting output in the driver logs...everything was fine, until 5:33, tasks just stop getting completed across the entire cluster: {code} // lots of logs... 15/01/03 05:33:37 INFO [sparkDriver-akka.actor.default-dispatcher-15] scheduler.TaskSetManager (Logging.scala:logInfo(59)) - Starting task 231.0 in stage 5.0 (TID 7310, ip-10-169-175-179.ec2.internal, PROCESS_LOCAL, 2004 bytes) 15/01/03 05:33:37 INFO [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logInfo(59)) - Finished task 209.0 in stage 5.0 (TID 7288) in 26069 ms on ip-10-169-175-179.ec2.internal (152/2639) {code} There are several stages before this stage 5.0 that involve shuffles as well, so this is not the first shuffle done by the job. The executor thread dumps don't show anything interesting...all of the task threads are blocked in ShuffleBlockFetcherIterator, where it's trying to take blocks from the LinkedBlockingQueue. Here are a few netstats from before, right before, after the timeouts: {code} {code} $ cat netstat-0531.txt Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp0 0 ip-10-181-127-41.ec2.:51718 ip-10-169-82-122:cslistener ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.in:ssh ip-10-169-82-122.ec2.:39880 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:49745 ip-10-71-184-33.ec2.i:53275 ESTABLISHED tcp0134 ip-10-181-127-41.ec2.:36031 s3-1-w.amazonaws.com:https ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:44171 ip-10-181-127-41.ec2.:39134 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:35301 ip-10-63-193-69.ec2.i:49596 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:60981 ip-10-63-193-69.ec2.i:35531 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:36594 ip-10-169-82-122.ec2.:36365 ESTABLISHED tcp 38 0 ip-10-181-127-41.ec2.:48338 s3-1-w.amazonaws.com:https CLOSE_WAIT tcp0 0 ip-10-181-127-41.ec2.:38358 ip-10-169-82-122.ec2.:swa-1 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:35301 ip-10-13-165-77.ec2.i:44291 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:36030 s3-1-w.amazonaws.com:https ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:36028 s3-1-w.amazonaws.com:https ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:35301 ip-10-169-175-179.ec2:39157 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:35941 ip-10-169-247-194.ec2:55602 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:44389 ip-10-31-61-144.ec2.i:59883 ESTABLISHED tcp 38 0 ip-10-181-127-41.ec2.:48342 s3-1-w.amazonaws.com:https CLOSE_WAIT tcp0 0 ip-10-181-127-41.ec2.:39794 ip-10-169-82-122.ec2.:52518 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:35301 ip-10-69-85-11.ec2.in:41533 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:37274 ip-10-13-165-77.ec2.i:56072 ESTABLISHED tcp0134 ip-10-181-127-41.ec2.:36032 s3-1-w.amazonaws.com:https ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:36029 s3-1-w.amazonaws.com:https ESTABLISHED tcp 38 0 ip-10-181-127-41.ec2.:48339 s3-1-w.amazonaws.com:https CLOSE_WAIT tcp0 0 ip-10-181-127-41.ec2.:35301 ip-10-71-184-33.ec2.i:56943 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:56283 ip-10-169-82-12:pcsync-http TIME_WAIT tcp0 0 ip-10-181-127-41.ec2.:35301 ip-10-33-166-218.ec2.:46591 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:39778 ip-10-69-85-11.ec2.in:34222 ESTABLISHED tcp 38 0 ip-10-181-127-41.ec2.:48340 s3-1-w.amazonaws.com:https CLOSE_WAIT tcp0 0 ip-10-181-127-41.ec2.:35301 ip-10-181-61-165.ec2.:41291 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:48805 ip-10-181-61-165.ec2.:57986 ESTABLISHED tcp 2255316 0 ip-10-181-127-41.ec2.:36024 s3-1-w.amazonaws.com:https ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:39134 ip-10-181-127-41.ec2.:44171 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:36023 s3-1-w.amazonaws.com:https ESTABLISHED tcp 367803 0 ip-10-181-127-41.ec2.:36026 s3-1-w.amazonaws.com:https ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:45560 ip-10-33-166-218.ec2.:42942 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:35301 ip-10-169-247-194.ec2:42871 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:35301 ip-10-31-61-144.ec2.i:41293 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:36581 ip-10-169-82-122.ec2.i:7077 ESTABLISHED tcp0 0 ip-10-181-127-41.ec2.:38684 ip-10-169-175-179.ec2:45582 ESTABLISHED tcp0 0