Hi Zhiliang,

Yes, find the exact reason of failure is very difficult. We have issue with 
similar behavior, due to limited time for investigation, we reduce the number 
of processed data, and problem has gone.

Some points which may help you in investigations:

·         If you start spark-history-server (or monitoring running application 
on 4040 port), look into failed stages (if any). By default Spark try to retry 
stage execution 2 times, after that job fails

·         Some useful information may contains in yarn logs on Hadoop nodes 
(yarn-<user>-nodemanager-<host>.log), but this is only information about killed 
container, not about the reasons why this stage took so much memory

As I can see in your logs, failed step relates to shuffle operation, could you 
change your job to avoid massive shuffle operation?

WBR, Alexander

From: Zhiliang Zhu<mailto:zchl.j...@yahoo.com.INVALID>
Sent: 17 июня 2016 г. 14:10
To: User<mailto:user@spark.apache.org>; 
Subject: Re: spark job automatically killed without rhyme or reason

  Show original message

 Hi Alexander,
is your yarn userlog   just for the executor log ?
as for those logs seem a little difficult to exactly decide the wrong point, 
due to sometimes successful job may also have some kinds of the error  ... but 
will repair itself.spark seems not that stable currently     ...
Thank you in advance~  

    On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote:

 Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id,
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to 
ip-172-31-20-104/              <------ may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/        at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at 
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)        at 

16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size = 
16777216 bytes, TID = 100323           <-----       would it be memory leak 
issue? though no GC exception threw for other normal kinds of out of memory 
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage 
112.0 (TID 100323)java.io.IOException: Filesystem closed        at 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)        at 
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)        at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)        at 
java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
sorry, there is some information in the middle of the log file, but all is okay 
at the end  part of the log .in the run log file as log_file generated by 
command:nohup spark-submit --driver-memory 20g  --num-executors 20 --class 
com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar  
doAnalysisExtremeLender  /tmp/drretention/test/output  0.96  
 50 > log_file

executor 40 lost                        <------    would it be due to this, 
sometimes job may fail for the reason
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)  
      at java.io.DataInputStream.readFully(DataInputStream.java:195)        at 

Thanks in advance!

    On Friday, June 17, 2016 3:52 PM, Alexander Kapustin <kp...@hotmail.com> 

 #yiv7679307012 -- filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv7679307012 
filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv7679307012 
p.yiv7679307012MsoNormal, #yiv7679307012 li.yiv7679307012MsoNormal, 
#yiv7679307012 div.yiv7679307012MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv7679307012 a:link, 
#yiv7679307012 span.yiv7679307012MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv7679307012 a:visited, #yiv7679307012 
.yiv7679307012MsoChpDefault {}#yiv7679307012 filtered {margin:2.0cm 42.5pt 
2.0cm 3.0cm;}#yiv7679307012 div.yiv7679307012WordSection1 {}#yiv7679307012 Hi,  
 Did you submit spark job via YARN? In some cases (memory configuration 
probably), yarn can kill containers where spark tasks are executed. In this 
situation, please check yarn userlogs for more information…    --WBR, Alexander 
  From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu; User
Subject: Re: spark job automatically killed without rhyme or reason   anyone 
ever met the similar problem, which is quite strange ...

On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID> 

Hi All,
I have a big job which mainly takes more than one hour to run the whole, 
however, it is very much unreasonable to exit & finish to run midway (almost 
80% of the job finished actually, but not all), without any apparent error or 
exception log.
I submitted the same job for many times, it is same as that.In the last line of 
the run log, just one word "killed" to end, or sometimes not any  other wrong 
log, all seems okay but should not finish.
What is the way for the problem? Is there any other friends that ever met the 
similar issue ...
Thanks in advance!

Reply via email to