yes YARN was terminating the executor because the off heap memory limit was exceeded.

On 13/07/15 06:55, Ruslan Dautkhanov wrote:
>> the executor receives a SIGTERM (from whom???)

From YARN Resource Manager.

Check if yarn fair scheduler preemption and/or speculative execution are turned on,
then it's quite possible and not a bug.



--
Ruslan Dautkhanov

On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim <jongw...@nyu.edu <mailto:jongw...@nyu.edu>> wrote:

    Based on my experience, YARN containers can get SIGTERM when

    - it produces too much logs and use up the hard drive
    - it uses off-heap memory more than what is given by
    spark.yarn.executor.memoryOverhead configuration. It might be due
    to too many classes loaded (less than MaxPermGen but more than
    memoryOverhead), or some other off-heap memory allocated by
    networking library, etc.
    - it opens too many file descriptors, which you can check on the
    executor node's /proc/<executor jvm's pid>/fd/

    Does any of these apply to your situation?

    Jong Wook

    On Jul 7, 2015, at 19:16, Kostas Kougios
    <kostas.koug...@googlemail.com
    <mailto:kostas.koug...@googlemail.com>> wrote:

    I am still receiving these weird sigterms on the executors. The
    driver claims
    it lost the executor, the executor receives a SIGTERM (from whom???)

    It doesn't seem a memory related issue though increasing memory
    takes the
    job a bit further or completes it. But why? there is no memory
    pressure on
    neither driver nor executor. And nothing in the logs indicating so.

    driver:

    15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task
    14762.0 in
    stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069
    bytes)
    15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task
    14517.0 in
    stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified
    (14507/42240)
    15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver
    terminated
    or disconnected! Shutting down. cruncher05.stratified:32976
    15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost
    executor 1 on
    cruncher05.stratified: remote Rpc client disassociated
    15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing
    tasks for 1
    from TaskSet 0.0
    15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver
    terminated
    or disconnected! Shutting down. cruncher05.stratified:32976
    15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor:
    Association with
    remote system
    [akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
    failed, address is now gated for [5000] ms. Reason is:
    [Disassociated].

    15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task
    14591.0 in stage
    0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure
    (executor 1
    lost)

    gc log for driver, it doesnt look like it run outofmem:

    2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
    1764131K->1391211K(3393024K), 0.0102839 secs]
    2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
    1764971K->1391867K(3405312K), 0.0099062 secs]
    2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
    1782011K->1392596K(3401216K), 0.0167572 secs]

    executor:

    15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in
    stage 0.0
    (TID 14750)
    15/07/07 10:47:03 INFO spark.CacheManager: Partition
    rdd_493_14750 not
    found, computing it
    15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend:
    RECEIVED
    SIGNAL 15: SIGTERM
    15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

    executor gc log (no outofmem as it seems):
    2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
    24696750K->23712939K(33523712K), 0.0416640 secs]
    2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
    24700520K->23722043K(33523712K), 0.0391156 secs]
    2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
    24709182K->23726510K(33518592K), 0.0390784 secs]





    --
    View this message in context:
    
http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html
    Sent from the Apache Spark User List mailing list archive at
    Nabble.com <http://Nabble.com>.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
    <mailto:user-unsubscr...@spark.apache.org>
    For additional commands, e-mail: user-h...@spark.apache.org
    <mailto:user-h...@spark.apache.org>




Reply via email to