Oh interesting solution, a co-worker was suggesting something similar using resource profiles to increase memory -- but your approach avoids a lot of complexity I like it (and we could extend it out to support resource profile growth too).
I think an SPIP sounds like a great next step. On Tue, Jan 16, 2024 at 10:46 PM kalyan <justfors...@gmail.com> wrote: > Hello All, > > At Uber, we had recently, done some work on improving the reliability of > spark applications in scenarios of fatter executors going out of memory and > leading to application failure. Fatter executors are those that have more > than 1 task running on it at a given time concurrently. This has > significantly improved the reliability of many spark applications for us at > Uber. We made a blog about this recently. Link: > https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/ > > At a high level, we have done the below changes: > > 1. When a Task fails with the OOM of an executor, we update the core > requirements of the task to max executor cores. > 2. When the task is picked for rescheduling, the new attempt of the > task happens to be on an executor where no other task can run concurrently. > All cores get allocated to this task itself. > 3. This way we ensure that the configured memory is completely at the > disposal of a single task. Thus eliminating contention of memory. > > The best part of this solution is that it's reactive. It kicks in only > when the executors fail with the OOM exception. > > We understand that the problem statement is very common and we expect our > solution to be effective in many cases. > > There could be more cases that can be covered. Executor failing with OOM > is like a hard signal. The framework(making the driver aware of > what's happening with the executor) can be extended to handle scenarios of > other forms of memory pressure like excessive spilling to disk, etc. > > While we had developed this on Spark 2.4.3 in-house, we would like to > collaborate and contribute this work to the latest versions of Spark. > > What is the best way forward here? Will an SPIP proposal to detail the > changes help? > > Regards, > Kalyan. > Uber India. > -- Cell : 425-233-8271