Re: spark job automatically killed without rhyme or reason

Aakash Basu Thu, 23 Jun 2016 09:09:12 -0700

Hey,

I've come across this. There's a command called "yarn application -kill
<application name>", which kills the application with a one liner 'Killed'.


If it is memory issue, the error shows up in form of 'GC Overhead' or
forming up tree or something of the sort.

So, I think someone killed your job by that command I gave. To the person
who's running, in the log, it will just give that one word, 'Killed' in the
end.

Maybe this is what you faced. Maybe!

Thanks,
Aakash.
On 23-Jun-2016 11:52 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com.invalid> wrote:

> Thanks a lot for all  the comments, and the useful  information .
>
> Yes, I have much experience to write and run spark jobs, something
> unstable will be there while it run on more data or more time.
> Sometimes it would be not okay while reset some parameter in command line,
> but will be okay while removing it by using default setting. Sometimes it
> is opposite, proper parameter setting needs to be set.
>
> Here is installing spark 1.5 by other person.
>
>
>
>
> On Wednesday, June 22, 2016 1:59 PM, Nirav Patel <npa...@xactlycorp.com>
> wrote:
>
>
> spark is memory hogger and suicidal if you have a job processing bigger
> dataset. however databricks claims that  spark > 1.6  have optimization
> related to memory footprint as well as processing. It will only be
> available if you use dataframe or dataset. if you are using rdd you have to
> do lot of testing and tuning.
>
> On Mon, Jun 20, 2016 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:
>
> I'm not sure that's the conclusion. It's not trivial to tune and
> configure YARN and Spark to match your app's memory needs and profile,
> but, it's also just a matter of setting them properly. I'm not clear
> you've set the executor memory for example, in particular
> spark.yarn.executor.memoryOverhead
>
> Everything else you mention is a symptom of YARN shutting down your
> jobs because your memory settings don't match what your app does.
> They're not problems per se, based on what you have provided.
>
>
> On Mon, Jun 20, 2016 at 9:17 AM, Zhiliang Zhu
> <zchl.j...@yahoo.com.invalid> wrote:
> > Hi Alexander ,
> >
> > Thanks a lot for your comments.
> >
> > Spark seems not that stable when it comes to run big job, too much data
> or
> > too much time, yes, the problem is gone when reducing the scale.
> > Sometimes reset some job running parameter (such as --drive-memory may
> help
> > in GC issue) , sometimes may rewrite the codes by applying other
> algorithm.
> >
> > As you commented the shuffle operation, it sounds some as the reason ...
> >
> > Best Wishes !
> >
> >
> >
> > On Friday, June 17, 2016 8:45 PM, Alexander Kapustin <kp...@hotmail.com>
> > wrote:
> >
> >
> > Hi Zhiliang,
> >
> > Yes, find the exact reason of failure is very difficult. We have issue
> with
> > similar behavior, due to limited time for investigation, we reduce the
> > number of processed data, and problem has gone.
> >
> > Some points which may help you in investigations:
> > ·         If you start spark-history-server (or monitoring running
> > application on 4040 port), look into failed stages (if any). By default
> > Spark try to retry stage execution 2 times, after that job fails
> > ·         Some useful information may contains in yarn logs on Hadoop
> nodes
> > (yarn-<user>-nodemanager-<host>.log), but this is only information about
> > killed container, not about the reasons why this stage took so much
> memory
> >
> > As I can see in your logs, failed step relates to shuffle operation,
> could
> > you change your job to avoid massive shuffle operation?
> >
> > --
> > WBR, Alexander
> >
> > From: Zhiliang Zhu
> > Sent: 17 июня 2016 г. 14:10
> > To: User; kp...@hotmail.com
> > Subject: Re: spark job automatically killed without rhyme or reason
> >
> >
> > Show original message
> >
> >
> > Hi Alexander,
> >
> > is your yarn userlog   just for the executor log ?
> >
> > as for those logs seem a little difficult to exactly decide the wrong
> point,
> > due to sometimes successful job may also have some kinds of the error
> ...
> > but will repair itself.
> > spark seems not that stable currently     ...
> >
> > Thank you in advance~
> >
> >
> >
> > On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.j...@yahoo.com>
> wrote:
> >
> >
> > Hi Alexander,
> >
> > Thanks a lot for your reply.
> >
> > Yes, submitted by yarn.
> > Do you just mean in the executor log file by way of yarn logs
> -applicationId
> > id,
> >
> > in this file, both in some containers' stdout  and stderr :
> >
> > 16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive
> > connection to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
> > 16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while
> > beginning fetch of 1 outstanding blocks
> > java.io.IOException: Failed to connect to
> > ip-172-31-20-104/172.31.20.104:49991              <------ may it be due
> to
> > that spark is not stable, and spark may repair itself for these kinds of
> > error ? (saw some in successful run )
> >
> >         at
> >
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
> >         at
> >
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
> > ............
> > Caused by: java.net.ConnectException: Connection refused:
> > ip-172-31-20-104/172.31.20.104:49991
> >         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >         at
> > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> >         at
> >
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> >         at
> >
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> >         at
> >
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> >         at
> >
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> >         at
> >
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> >         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> >         at
> >
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> >
> >
> > 16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected;
> > size = 16777216 bytes, TID = 100323           <-----       would it be
> > memory leak issue? though no GC exception threw for other normal kinds of
> > out of memory
> > 16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in
> stage
> > 112.0 (TID 100323)
> > java.io.IOException: Filesystem closed
> >         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)
> >         at
> > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)
> >         at
> > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)
> >         at java.io.DataInputStream.readFully(DataInputStream.java:195)
> >         at
> >
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
> >         at
> >
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)
> > ...........
> >
> > sorry, there is some information in the middle of the log file, but all
> is
> > okay at the end  part of the log .
> > in the run log file as log_file generated by command:
> > nohup spark-submit --driver-memory 20g  --num-executors 20 --class
> > com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar
> > doAnalysisExtremeLender  /tmp/drretention/test/output  0.96
> > /tmp/drretention/evaluation/test_karthik/lgmodel
> >
> /tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
> > 50 > log_file
> >
> > executor 40 lost                        <------    would it be due to
> this,
> > sometimes job may fail for the reason
> > ..........
> >
> >         at
> > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)
> >         at java.io.DataInputStream.readFully(DataInputStream.java:195)
> >         at
> >
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
> >         at
> >
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)
> > ..........
> >
> >
> > Thanks in advance!
> >
> >
> >
> >
> >
> > On Friday, June 17, 2016 3:52 PM, Alexander Kapustin <kp...@hotmail.com>
> > wrote:
> >
> >
> > Hi,
> >
> > Did you submit spark job via YARN? In some cases (memory configuration
> > probably), yarn can kill containers where spark tasks are executed. In
> this
> > situation, please check yarn userlogs for more information…
> >
> > --
> > WBR, Alexander
> >
> > From: Zhiliang Zhu
> > Sent: 17 июня 2016 г. 9:36
> > To: Zhiliang Zhu; User
> > Subject: Re: spark job automatically killed without rhyme or reason
> >
> > anyone ever met the similar problem, which is quite strange ...
> >
> >
> > On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu
> <zchl.j...@yahoo.com.INVALID>
> > wrote:
> >
> >
> > Hi All,
> >
> > I have a big job which mainly takes more than one hour to run the whole,
> > however, it is very much unreasonable to exit & finish to run midway
> (almost
> > 80% of the job finished actually, but not all),
> > without any apparent error or exception log.
> >
> > I submitted the same job for many times, it is same as that.
> > In the last line of the run log, just one word "killed" to end, or
> sometimes
> > not any  other wrong log, all seems okay but should not finish.
> >
> > What is the way for the problem? Is there any other friends that ever met
> > the similar issue ...
> >
> > Thanks in advance!
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>
>
>

Re: spark job automatically killed without rhyme or reason

Reply via email to