RE: application failed on large dataset

java8964 Tue, 15 Sep 2015 08:51:55 -0700

When you saw this error, does any executor die due to whatever error?
Do you check to see if any executor restarts during your job?
It is hard to help you just with the stack trace. You need to tell us the whole 
picture when your jobs are running.
Yong


From: qhz...@apache.org
Date: Tue, 15 Sep 2015 15:02:28 +0000
Subject: Re: application failed on large dataset
To: user@spark.apache.org

has anyone met the same problems?
周千昊 <qhz...@apache.org>于2015年9月14日周一 下午9:07写道：
Hi, community      I am facing a strange problem:      all executors does not 
respond, and then all of them failed with the ExecutorLostFailure.      when I 
look into yarn logs, there are full of such exception
15/09/14 04:35:33 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocks (after 3 retries)java.io.IOException: Failed to 
connect to host/ip:port        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
        at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)        
at java.util.concurrent.FutureTask.run(FutureTask.java:262)        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
       at java.lang.Thread.run(Thread.java:745)Caused by: 
java.net.ConnectException: Connection refused: host/ip:port        at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)        
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)     
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)    
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        ... 1 more

      The strange thing is that, if I reduce the input size, the problems just 
disappeared. I have found a similar issue in the 
mail-archive(http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3CCAOHP_tHRtuxDfWF0qmYDauPDhZ1=MAm5thdTfgAhXDN=7kq...@mail.gmail.com%3E),
 however I didn't see the solution. So I am wondering if anyone could help with 
that?
      My env is:      hdp 2.2.6      spark(1.4.1)      mode: yarn-client      
spark-conf:      spark.driver.extraJavaOptions -Dhdp.version=2.2.6.0-2800      
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.6.0-2800      
spark.executor.memory 6g      spark.storage.memoryFraction 0.3      
spark.dynamicAllocation.enabled true
      spark.shuffle.service.enabled true

RE: application failed on large dataset

Reply via email to