Dear Sparkers,

Once again in times of desperation, I leave what remains of my mental sanity to 
this wise and knowledgeable community.

I have a Spark job (on EMR 5.8.0) which had been running daily for months, if 
not the whole year, with absolutely no supervision. This changed all of sudden 
for reasons I do not understand.

The volume of data processed daily has been slowly increasing over the past 
year but has been stable in the last couple months. Since I'm only processing 
the past 8 days's worth of data I do not think that increased data volume is to 
blame here. Yes, I did check the volume of data for the past few days.

Here is a short description of the issue.

- The Spark job starts normally and proceeds successfully with the first few 
stages.
- Once we reach the dreaded stage, all tasks are performed successfully (they 
typically take not more than 1 minute each), except for the /very/ first one 
(task 0.0) which never finishes.

Here is what the log looks like (simplified for readability):

----------------------------------------
INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412 ms on 
... (executor 12) (254/256)
INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394 ms on 
... (executor 7) (255/256)
INFO ExecutorAllocationManager: Request to remove executorIds: 14
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 14
INFO YarnAllocator: Driver requested a total number of 0 executor(s).
----------------------------------------

Why is that? There is still a task waiting to be completed right? Isn't an 
executor needed for that?

Afterwards, all executors are getting killed (dynamic allocation is turned on):

----------------------------------------
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
INFO ExecutorAllocationManager: Removing executor 14 because it has been idle 
for 60 seconds (new desired total will be 5)
    .
    .
    .
INFO ExecutorAllocationManager: Request to remove executorIds: 7
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 7
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
INFO ExecutorAllocationManager: Removing executor 7 because it has been idle 
for 60 seconds (new desired total will be 1)
INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
INFO DAGScheduler: Executor lost: 7 (epoch 4)
INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from 
BlockManagerMaster.
INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, ..., 
44289, None)
INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
INFO ExecutorAllocationManager: Existing executor 7 has been removed (new total 
is 1)
----------------------------------------

Then, there's nothing more in the driver's log. Nothing. The cluster then run 
for hours, with no progress being made, and no executors allocated.

Here is what I tried:

    - More memory per executor: from 13 GB to 24 GB by increments.
    - Explicit repartition() on the RDD: from 128 to 256 partitions.

The offending stage used to be a rather innocent looking keyBy(). After adding 
some repartition() the offending stage was then a mapToPair(). During my last 
experiments, it turned out the repartition(256) itself is now the culprit.

I like Spark, but its mysteries will manage to send me in a mental hospital one 
of those days.

Can anyone shed light on what is going on here, or maybe offer some suggestions 
or pointers to relevant source of information?

I am completely clueless.

Seasons greetings,

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to