>> the executor receives a SIGTERM (from whom???) >From YARN Resource Manager.
Check if yarn fair scheduler preemption and/or speculative execution are turned on, then it's quite possible and not a bug. -- Ruslan Dautkhanov On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim <jongw...@nyu.edu> wrote: > Based on my experience, YARN containers can get SIGTERM when > > - it produces too much logs and use up the hard drive > - it uses off-heap memory more than what is given by > spark.yarn.executor.memoryOverhead configuration. It might be due to too > many classes loaded (less than MaxPermGen but more than memoryOverhead), or > some other off-heap memory allocated by networking library, etc. > - it opens too many file descriptors, which you can check on the executor > node's /proc/<executor jvm's pid>/fd/ > > Does any of these apply to your situation? > > Jong Wook > > On Jul 7, 2015, at 19:16, Kostas Kougios <kostas.koug...@googlemail.com> > wrote: > > I am still receiving these weird sigterms on the executors. The driver > claims > it lost the executor, the executor receives a SIGTERM (from whom???) > > It doesn't seem a memory related issue though increasing memory takes the > job a bit further or completes it. But why? there is no memory pressure on > neither driver nor executor. And nothing in the logs indicating so. > > driver: > > 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 14762.0 in > stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 bytes) > 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 14517.0 in > stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified (14507/42240) > 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated > or disconnected! Shutting down. cruncher05.stratified:32976 > 15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 1 on > cruncher05.stratified: remote Rpc client disassociated > 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 1 > from TaskSet 0.0 > 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated > or disconnected! Shutting down. cruncher05.stratified:32976 > 15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: Association with > remote system [akka.tcp://sparkExecutor@cruncher05.stratified:32976] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > > 15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 in stage > 0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1 > lost) > > gc log for driver, it doesnt look like it run outofmem: > > 2015-07-07T10:45:19.887+0100: [GC (Allocation Failure) > 1764131K->1391211K(3393024K), 0.0102839 secs] > 2015-07-07T10:46:00.934+0100: [GC (Allocation Failure) > 1764971K->1391867K(3405312K), 0.0099062 secs] > 2015-07-07T10:46:45.252+0100: [GC (Allocation Failure) > 1782011K->1392596K(3401216K), 0.0167572 secs] > > executor: > > 15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in stage 0.0 > (TID 14750) > 15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not > found, computing it > 15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL 15: SIGTERM > 15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called > > executor gc log (no outofmem as it seems): > 2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC) > 24696750K->23712939K(33523712K), 0.0416640 secs] > 2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC) > 24700520K->23722043K(33523712K), 0.0391156 secs] > 2015-07-07T10:47:02.862+0100: [GC (Allocation Failure) > 24709182K->23726510K(33518592K), 0.0390784 secs] > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >