Re: Connection closed error while running Terasort

2015-09-03 Thread Akhil Das
Can you look at bit deeper in the executor logs? It may happen that it hit
the GC Overhead etc which lead to the connection failures.

Thanks
Best Regards

On Tue, Sep 1, 2015 at 5:43 AM, Suman Somasundar <
suman.somasun...@oracle.com> wrote:

> Hi,
>
>
>
> I am getting the following error while trying to run a 10GB terasort under
> Yarn with 8 nodes.
>
> The command is:
>
> spark-submit --class com.github.ehiggs.spark.terasort.TeraSort --master
> yarn-cluster --num-executors 10 --executor-memory 32g
> spark-terasort-master/target/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar
> hdfs://hadoop-solaris-a:8020/user/hadoop/terasort/input-10
> hdfs://hadoop-solaris-a:8020/user/hadoop/terasort/output-10
>
>
>
> What might be causing this error?
>
>
>
> 15/08/31 17:09:48 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019052,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/3e/shuffle_1_9_0.data,
> offset=0, length=1059423784}} to /199.199.35.5:52486; closing connection
>
> java.io.IOException: Broken pipe
>
> at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
>
> at
> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:443)
>
> at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:575)
>
> at
> org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96)
>
> at
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:89)
>
> at
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:237)
>
> at
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:233)
>
> at
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:264)
>
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707)
>
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:321)
>
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:519)
>
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 15/08/31 17:10:48 ERROR server.TransportChannelHandler: Connection to
> hadoop-solaris-c/199.199.35.4:48540 has been quiet for 12 ms while
> there are outstanding requests. Assuming connection is dead; please adjust
> spark.network.timeout if this is wrong.
>
> 15/08/31 17:10:48 ERROR client.TransportResponseHandler: Still have 1
> requests outstanding when connection from hadoop-solaris-c/
> 199.199.35.4:48540 is closed
>
> 15/08/31 17:10:48 INFO shuffle.RetryingBlockFetcher: Retrying fetch (3/3)
> for 1 outstanding blocks after 5000 ms
>
> 15/08/31 17:10:49 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019053,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/1b/shuffle_1_6_0.data,
> offset=0, length=1052128440}} to /199.199.35.6:45201; closing connection
>
> java.nio.channels.ClosedChannelException
>
> 15/08/31 17:10:53 INFO client.TransportClientFactory: Found inactive
> connection to hadoop-solaris-c/199.199.35.4:48540, creating a new one.
>
> 15/08/31 17:11:31 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019054,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/1b/shuffle_1_6_0.data,
> offset=0, length=1052128440}} to /199.199.35.10:55082; closing connection
>
> java.nio.channels.ClosedChannelException
>
> 15/08/31 17:11:31 ERROR server.TransportRequestHandler: Error sending
> result
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019055,
> chunkIndex=0},
> buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/3e/shuffle_1_9_0.data,
> offset=0, length=1059423784}} to /199.199.35.7:54328; closing connection
>
> java.nio.channels.ClosedChannelException
>
> 15/08/31 17:11:53 ERROR server.TransportRequestHandler: Error sending
> result
> 

Connection closed error while running Terasort

2015-08-31 Thread Suman Somasundar
Hi,

 

I am getting the following error while trying to run a 10GB terasort under Yarn 
with 8 nodes.

The command is:  

spark-submit --class com.github.ehiggs.spark.terasort.TeraSort --master 
yarn-cluster --num-executors 10 --executor-memory 32g 
spark-terasort-master/target/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar
 hdfs://hadoop-solaris-a:8020/user/hadoop/terasort/input-10 
hdfs://hadoop-solaris-a:8020/user/hadoop/terasort/output-10

 

What might be causing this error?

 

15/08/31 17:09:48 ERROR server.TransportRequestHandler: Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019052, 
chunkIndex=0}, 
buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/3e/shuffle_1_9_0.data,
 offset=0, length=1059423784}} to /199.199.35.5:52486; closing connection

java.io.IOException: Broken pipe

at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)

at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:443)

at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:575)

at 
org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96)

at 
org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:89)

at 
io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:237)

at 
io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:233)

at 
io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:264)

at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707)

at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:321)

at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:519)

at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)

at java.lang.Thread.run(Thread.java:745)

15/08/31 17:10:48 ERROR server.TransportChannelHandler: Connection to 
hadoop-solaris-c/199.199.35.4:48540 has been quiet for 12 ms while there 
are outstanding requests. Assuming connection is dead; please adjust 
spark.network.timeout if this is wrong.

15/08/31 17:10:48 ERROR client.TransportResponseHandler: Still have 1 requests 
outstanding when connection from hadoop-solaris-c/199.199.35.4:48540 is closed

15/08/31 17:10:48 INFO shuffle.RetryingBlockFetcher: Retrying fetch (3/3) for 1 
outstanding blocks after 5000 ms

15/08/31 17:10:49 ERROR server.TransportRequestHandler: Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019053, 
chunkIndex=0}, 
buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/1b/shuffle_1_6_0.data,
 offset=0, length=1052128440}} to /199.199.35.6:45201; closing connection

java.nio.channels.ClosedChannelException

15/08/31 17:10:53 INFO client.TransportClientFactory: Found inactive connection 
to hadoop-solaris-c/199.199.35.4:48540, creating a new one.

15/08/31 17:11:31 ERROR server.TransportRequestHandler: Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019054, 
chunkIndex=0}, 
buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/1b/shuffle_1_6_0.data,
 offset=0, length=1052128440}} to /199.199.35.10:55082; closing connection

java.nio.channels.ClosedChannelException

15/08/31 17:11:31 ERROR server.TransportRequestHandler: Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019055, 
chunkIndex=0}, 
buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/3e/shuffle_1_9_0.data,
 offset=0, length=1059423784}} to /199.199.35.7:54328; closing connection

java.nio.channels.ClosedChannelException

15/08/31 17:11:53 ERROR server.TransportRequestHandler: Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1867783019056, 
chunkIndex=0}, 
buffer=FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1441064487503_0001/blockmgr-c3c8dbb3-9ae2-4e45-b537-fd0beeff98b5/3e/shuffle_1_9_0.data,
 offset=0, length=1059423784}} to /199.199.35.5:50573; closing connection

java.nio.channels.ClosedChannelException

15/08/31 17:12:54 ERROR