[Spark-Core] Improving Reliability of spark when Executors OOM

kalyan Tue, 16 Jan 2024 22:46:58 -0800

Hello All,

At Uber, we had recently, done some work on improving the reliability of
spark applications in scenarios of fatter executors going out of memory and
leading to application failure. Fatter executors are those that have more
than 1 task running on it at a given time concurrently. This has
significantly improved the reliability of many spark applications for us at
Uber. We made a blog about this recently. Link:
https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/


At a high level, we have done the below changes:

   1. When a Task fails with the OOM of an executor, we update the core
   requirements of the task to max executor cores.
   2. When the task is picked for rescheduling, the new attempt of the task
   happens to be on an executor where no other task can run concurrently. All
   cores get allocated to this task itself.
   3. This way we ensure that the configured memory is completely at the
   disposal of a single task. Thus eliminating contention of memory.

The best part of this solution is that it's reactive. It kicks in only when
the executors fail with the OOM exception.

We understand that the problem statement is very common and we expect our
solution to be effective in many cases.

There could be more cases that can be covered. Executor failing with OOM is
like a hard signal. The framework(making the driver aware of
what's happening with the executor) can be extended to handle scenarios of
other forms of memory pressure like excessive spilling to disk, etc.

While we had developed this on Spark 2.4.3 in-house, we would like to
collaborate and contribute this work to the latest versions of Spark.

What is the best way forward here? Will an SPIP proposal to detail the
changes help?

Regards,
Kalyan.
Uber India.

[Spark-Core] Improving Reliability of spark when Executors OOM

Reply via email to