Dear community,

I have a job that runs quite well for most stages: resource are consumed quite 
optimal (not much memoy/vcoresleft idle). My cluster is managed and works well.
I end up with 27 executors and have 2 cores for each, so can run 54 tasks. For 
many stages I see I have a high number of tasks running in parallel. For the 
longest stage (80k tasks) however I see in the beginning for first 20k tasks 
that I still have many tasks in parallel so it advances fastly), but after a 
while   only 7 tasks run concurrently, all on the same 3 executors residing on 
the same node. So that means that my other nodes (8 of them and over 45 healthy 
executors) are idle for over 3 hours. 
I notice in the logs that all tasks are run at "NODE_LOCAL"

I wonder what is causing this and if I can do something to make the idle 
executors also do work. 2 options:
1)It is just the way it is: at some point in this stage, there are dependencies 
of the further tasks. So the task manager can not submit more tasks at once?
I thought that especially the stages have dependencies on each other-thats why 
often (not always) they have to wait on the previous one for the next to start. 
But I cant find anywhere if also the tasks within a stage can be dependent on 
each other? I thought the number of tasks was the number of partitions and that 
these by definition could be executed in parallel. I guess probably they are 
dependent, as when I look at the DAG of this stage, it is very complex.
2)I am playing with spark.local.wait to see if this can help. Maybe somehow he 
likes to run everything as close to the data as possible . HEnce NODE_LOCAL for 
all tasks. Maybe if I decrease the time spark.local.wait from default 3s to 
lower (1s), then he will also start shuffling more data and give more tasks to 
the idle executors.



Anyone any idea?
THanks!
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to