Hey, I've come across this. There's a command called "yarn application -kill <application name>", which kills the application with a one liner 'Killed'.
If it is memory issue, the error shows up in form of 'GC Overhead' or forming up tree or something of the sort. So, I think someone killed your job by that command I gave. To the person who's running, in the log, it will just give that one word, 'Killed' in the end. Maybe this is what you faced. Maybe! Thanks, Aakash. On 23-Jun-2016 11:52 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com.invalid> wrote: > Thanks a lot for all the comments, and the useful information . > > Yes, I have much experience to write and run spark jobs, something > unstable will be there while it run on more data or more time. > Sometimes it would be not okay while reset some parameter in command line, > but will be okay while removing it by using default setting. Sometimes it > is opposite, proper parameter setting needs to be set. > > Here is installing spark 1.5 by other person. > > > > > On Wednesday, June 22, 2016 1:59 PM, Nirav Patel <npa...@xactlycorp.com> > wrote: > > > spark is memory hogger and suicidal if you have a job processing bigger > dataset. however databricks claims that spark > 1.6 have optimization > related to memory footprint as well as processing. It will only be > available if you use dataframe or dataset. if you are using rdd you have to > do lot of testing and tuning. > > On Mon, Jun 20, 2016 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote: > > I'm not sure that's the conclusion. It's not trivial to tune and > configure YARN and Spark to match your app's memory needs and profile, > but, it's also just a matter of setting them properly. I'm not clear > you've set the executor memory for example, in particular > spark.yarn.executor.memoryOverhead > > Everything else you mention is a symptom of YARN shutting down your > jobs because your memory settings don't match what your app does. > They're not problems per se, based on what you have provided. > > > On Mon, Jun 20, 2016 at 9:17 AM, Zhiliang Zhu > <zchl.j...@yahoo.com.invalid> wrote: > > Hi Alexander , > > > > Thanks a lot for your comments. > > > > Spark seems not that stable when it comes to run big job, too much data > or > > too much time, yes, the problem is gone when reducing the scale. > > Sometimes reset some job running parameter (such as --drive-memory may > help > > in GC issue) , sometimes may rewrite the codes by applying other > algorithm. > > > > As you commented the shuffle operation, it sounds some as the reason ... > > > > Best Wishes ! > > > > > > > > On Friday, June 17, 2016 8:45 PM, Alexander Kapustin <kp...@hotmail.com> > > wrote: > > > > > > Hi Zhiliang, > > > > Yes, find the exact reason of failure is very difficult. We have issue > with > > similar behavior, due to limited time for investigation, we reduce the > > number of processed data, and problem has gone. > > > > Some points which may help you in investigations: > > · If you start spark-history-server (or monitoring running > > application on 4040 port), look into failed stages (if any). By default > > Spark try to retry stage execution 2 times, after that job fails > > · Some useful information may contains in yarn logs on Hadoop > nodes > > (yarn-<user>-nodemanager-<host>.log), but this is only information about > > killed container, not about the reasons why this stage took so much > memory > > > > As I can see in your logs, failed step relates to shuffle operation, > could > > you change your job to avoid massive shuffle operation? > > > > -- > > WBR, Alexander > > > > From: Zhiliang Zhu > > Sent: 17 июня 2016 г. 14:10 > > To: User; kp...@hotmail.com > > Subject: Re: spark job automatically killed without rhyme or reason > > > > > > Show original message > > > > > > Hi Alexander, > > > > is your yarn userlog just for the executor log ? > > > > as for those logs seem a little difficult to exactly decide the wrong > point, > > due to sometimes successful job may also have some kinds of the error > ... > > but will repair itself. > > spark seems not that stable currently ... > > > > Thank you in advance~ > > > > > > > > On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.j...@yahoo.com> > wrote: > > > > > > Hi Alexander, > > > > Thanks a lot for your reply. > > > > Yes, submitted by yarn. > > Do you just mean in the executor log file by way of yarn logs > -applicationId > > id, > > > > in this file, both in some containers' stdout and stderr : > > > > 16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive > > connection to ip-172-31-20-104/172.31.20.104:49991, creating a new one. > > 16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while > > beginning fetch of 1 outstanding blocks > > java.io.IOException: Failed to connect to > > ip-172-31-20-104/172.31.20.104:49991 <------ may it be due > to > > that spark is not stable, and spark may repair itself for these kinds of > > error ? (saw some in successful run ) > > > > at > > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) > > at > > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) > > ............ > > Caused by: java.net.ConnectException: Connection refused: > > ip-172-31-20-104/172.31.20.104:49991 > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > at > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > > at > > > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > > at > > > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > > at > > > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > > at > > > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > > at > > > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > > at > > > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > > > > > > 16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; > > size = 16777216 bytes, TID = 100323 <----- would it be > > memory leak issue? though no GC exception threw for other normal kinds of > > out of memory > > 16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in > stage > > 112.0 (TID 100323) > > java.io.IOException: Filesystem closed > > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837) > > at > > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679) > > at > > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903) > > at java.io.DataInputStream.readFully(DataInputStream.java:195) > > at > > > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265) > > at > > > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635) > > ........... > > > > sorry, there is some information in the middle of the log file, but all > is > > okay at the end part of the log . > > in the run log file as log_file generated by command: > > nohup spark-submit --driver-memory 20g --num-executors 20 --class > > com.dianrong.Main --master yarn-client dianrong-retention_2.10-1.0.jar > > doAnalysisExtremeLender /tmp/drretention/test/output 0.96 > > /tmp/drretention/evaluation/test_karthik/lgmodel > > > /tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live > > 50 > log_file > > > > executor 40 lost <------ would it be due to > this, > > sometimes job may fail for the reason > > .......... > > > > at > > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903) > > at java.io.DataInputStream.readFully(DataInputStream.java:195) > > at > > > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265) > > at > > > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635) > > .......... > > > > > > Thanks in advance! > > > > > > > > > > > > On Friday, June 17, 2016 3:52 PM, Alexander Kapustin <kp...@hotmail.com> > > wrote: > > > > > > Hi, > > > > Did you submit spark job via YARN? In some cases (memory configuration > > probably), yarn can kill containers where spark tasks are executed. In > this > > situation, please check yarn userlogs for more information… > > > > -- > > WBR, Alexander > > > > From: Zhiliang Zhu > > Sent: 17 июня 2016 г. 9:36 > > To: Zhiliang Zhu; User > > Subject: Re: spark job automatically killed without rhyme or reason > > > > anyone ever met the similar problem, which is quite strange ... > > > > > > On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu > <zchl.j...@yahoo.com.INVALID> > > wrote: > > > > > > Hi All, > > > > I have a big job which mainly takes more than one hour to run the whole, > > however, it is very much unreasonable to exit & finish to run midway > (almost > > 80% of the job finished actually, but not all), > > without any apparent error or exception log. > > > > I submitted the same job for many times, it is same as that. > > In the last line of the run log, just one word "killed" to end, or > sometimes > > not any other wrong log, all seems okay but should not finish. > > > > What is the way for the problem? Is there any other friends that ever met > > the similar issue ... > > > > Thanks in advance! > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> > > >