Can you look at bit deeper in the executor logs? It may happen that it hit
the GC Overhead etc which lead to the connection failures.

Thanks
Best Regards

On Tue, Sep 1, 2015 at 5:43 AM, Suman Somasundar <
suman.somasun...@oracle.com> wrote:

> Hi,
>
>
>
> I am getting the following error while trying to run a 10GB terasort under
> Yarn with 8 nodes.
>
> The command is:
>
> spark-submit --class com.github.ehiggs.spark.terasort.TeraSort --master
> yarn-cluster --num-executors 10 --executor-memory 32g
> spark-terasort-master/target/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar
> hdfs://hadoop-solaris-a:8020/user/hadoop/terasort/input-10
> hdfs://hadoop-solaris-a:8020/user/hadoop/terasort/output-10
>
>
>
> What might be causing this error?
>
>
>
> 15/08/31 17:09:48 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019052,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/3e/shuffle_1_9_0.data,
> offset=0, length=1059423784}} to /199.199.35.5:52486; closing connection
>
> java.io.IOException: Broken pipe
>
>         at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
>
>         at
> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:443)
>
>         at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:575)
>
>         at
> org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96)
>
>         at
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:89)
>
>         at
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:237)
>
>         at
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:233)
>
>         at
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:264)
>
>         at
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707)
>
>         at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:321)
>
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:519)
>
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>
>         at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> 15/08/31 17:10:48 ERROR server.TransportChannelHandler: Connection to
> hadoop-solaris-c/199.199.35.4:48540 has been quiet for 120000 ms while
> there are outstanding requests. Assuming connection is dead; please adjust
> spark.network.timeout if this is wrong.
>
> 15/08/31 17:10:48 ERROR client.TransportResponseHandler: Still have 1
> requests outstanding when connection from hadoop-solaris-c/
> 199.199.35.4:48540 is closed
>
> 15/08/31 17:10:48 INFO shuffle.RetryingBlockFetcher: Retrying fetch (3/3)
> for 1 outstanding blocks after 5000 ms
>
> 15/08/31 17:10:49 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019053,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/1b/shuffle_1_6_0.data,
> offset=0, length=1052128440}} to /199.199.35.6:45201; closing connection
>
> java.nio.channels.ClosedChannelException
>
> 15/08/31 17:10:53 INFO client.TransportClientFactory: Found inactive
> connection to hadoop-solaris-c/199.199.35.4:48540, creating a new one.
>
> 15/08/31 17:11:31 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019054,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/1b/shuffle_1_6_0.data,
> offset=0, length=1052128440}} to /199.199.35.10:55082; closing connection
>
> java.nio.channels.ClosedChannelException
>
> 15/08/31 17:11:31 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019055,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/3e/shuffle_1_9_0.data,
> offset=0, length=1059423784}} to /199.199.35.7:54328; closing connection
>
> java.nio.channels.ClosedChannelException
>
> 15/08/31 17:11:53 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019056,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/3e/shuffle_1_9_0.data,
> offset=0, length=1059423784}} to /199.199.35.5:50573; closing connection
>
> java.nio.channels.ClosedChannelException
>
> 15/08/31 17:12:54 ERROR server.TransportChannelHandler: Connection to
> hadoop-solaris-c/199.199.35.4:48540 has been quiet for 120000 ms while
> there are outstanding requests. Assuming connection is dead; please adjust
> spark.network.timeout if this is wrong.
>
> 15/08/31 17:12:54 ERROR client.TransportResponseHandler: Still have 1
> requests outstanding when connection from hadoop-solaris-c/
> 199.199.35.4:48540 is closed
>
> 15/08/31 17:12:54 ERROR shuffle.RetryingBlockFetcher: Failed to fetch
> block shuffle_1_7_7, and will not retry (3 retries)
>
> java.io.IOException: Connection from hadoop-solaris-c/199.199.35.4:48540
> closed
>
>         at
> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
>
>         at
> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
>
>         at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
>
>         at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
>
>         at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
>
>         at
> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:738)
>
>         at
> io.netty.channel.AbstractChannel$AbstractUnsafe$6.run(AbstractChannel.java:606)
>
>         at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
>
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>
>         at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> 15/08/31 17:12:54 ERROR storage.ShuffleBlockFetcherIterator: Failed to get
> block(s) from hadoop-solaris-c:48540
>
> java.io.IOException: Connection from hadoop-solaris-c/199.199.35.4:48540
> closed
>
>         at
> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
>
>         at
> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
>
>         at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
>
>         at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
>
>         at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
>
>         at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
>
>         at
> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:738)
>
>         at
> io.netty.channel.AbstractChannel$AbstractUnsafe$6.run(AbstractChannel.java:606)
>
>         at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
>
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>
>         at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> 15/08/31 17:12:54 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019057,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/1b/shuffle_1_6_0.data,
> offset=0, length=1052128440}} to /199.199.35.6:45044; closing connection
>
> java.nio.channels.ClosedChannelException
>
>
>
>
>
> Thanks,
> Suman.
>

Reply via email to