Good morning,

I have a conceptual question. In an application I am working on, when I
write to HDFS some results (*action 1*), I use ~30 executors out of 200. I
would like to improve resource utilization in this case.
I am aware that repartitioning the df to 200 before action 1 would produce
200 tasks and full executors utilization, but for several reasons is not
what I want to do.
What I would like to do is using the other ~170 executors to work on the
actions (jobs) coming after action 1. The normal case would be that *action
2* starts after action 1 (FIFO), but here I want them to start at the same
time, using the idle executors.

My question is: is it something achievable with the FAIR scheduler approach
and if yes how?

As I read the fair scheduler needs a pool of jobs and then it schedules
their tasks in a round-robin fashion. If I submit action 1 and action 2 at
the same time (multi-threading) to a fair pool, which of the following
things happen?

   1. at every moment, all (or almost all) executors are used in parallel
   (30 for action 1, the rest for action 2)
   2. for a certain small amount of time X, 30 executors are used for
   action 1, then for another time X the other executors are used for action
   2, then again X unit of time for action 1 and so on...

Among the two, 1 will actually improve cluster utlization, while 2 will
allow only to have both jobs advancing at the same time. Can someone who
has knowledge about the FAIR scheduler help me understand how it works?

Thanks,
*Alessandro Liparoti*

Reply via email to