Re: [Spark-Core] Improving Reliability of spark when Executors OOM

Mridul Muralidharan Mon, 18 Mar 2024 12:51:03 -0700

Hi Ashish,

  This is something we are still actively working on internally, but is
unfortunately not yet in a state to share widely yet.


Regards,
Mridul

On Mon, Mar 11, 2024 at 6:23 PM Ashish Singh <asi...@apache.org> wrote:

> Hi Kalyan,
>
> Is this something you are still interested in pursuing? There are some
> open discussion threads on the doc you shared.
>
> @Mridul Muralidharan <mri...@gmail.com> In what state are your efforts
> along this? Is it something that your team is actively pursuing/ building
> or are mostly planning right now? Asking so that we can align efforts on
> this.
>
> On Sun, Feb 18, 2024 at 10:32 PM xiaoping.huang <1754789...@qq.com> wrote:
>
>> Hi all,
>> Any updates on this project? This will be a very useful feature.
>>
>> xiaoping.huang
>> 1754789...@qq.com
>>
>> ---- Replied Message ----
>> From kalyan<justfors...@gmail.com> <justfors...@gmail.com>
>> Date 02/6/2024 10:08
>> To Jay Han<tunyu...@gmail.com> <tunyu...@gmail.com>
>> Cc Ashish Singh<asi...@apache.org> ,
>> <asi...@apache.org> Mridul Muralidharan<mri...@gmail.com> ,
>> <mri...@gmail.com> dev<dev@spark.apache.org> ,
>> <dev@spark.apache.org> <tgraves...@yahoo.com.invalid>
>> <tgraves...@yahoo.com.invalid>
>> Subject Re: [Spark-Core] Improving Reliability of spark when Executors
>> OOM
>> Hey,
>> Disk space not enough is also a reliability concern, but might need a
>> diff strategy to handle it.
>> As suggested by Mridul, I am working on making things more configurable
>> in another(new) module… with that, we can plug in new rules for each type
>> of error.
>>
>> Regards
>> Kalyan.
>>
>> On Mon, 5 Feb 2024 at 1:10 PM, Jay Han <tunyu...@gmail.com> wrote:
>>
>>> Hi,
>>> what about supporting for solving the disk space problem of "device
>>> space isn't enough"? I think it's same as OOM exception.
>>>
>>> kalyan <justfors...@gmail.com> 于2024年1月27日周六 13:00写道：
>>>
>>>> Hi all,
>>>>
>>>
>>>> Sorry for the delay in getting the first draft of (my first) SPIP out.
>>>>
>>>> https://docs.google.com/document/d/1hxEPUirf3eYwNfMOmUHpuI5dIt_HJErCdo7_yr9htQc/edit?pli=1
>>>>
>>>> Let me know what you think.
>>>>
>>>> Regards
>>>> kalyan.
>>>>
>>>> On Sat, Jan 20, 2024 at 8:19 AM Ashish Singh <asi...@apache.org> wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> Thanks for this discussion, the timing of this couldn't be better!
>>>>>
>>>>> At Pinterest, we recently started to look into reducing OOM failures
>>>>> while also reducing memory consumption of spark applications. We 
>>>>> considered
>>>>> the following options.
>>>>> 1. Changing core count on executor to change memory available per task
>>>>> in the executor.
>>>>> 2. Changing resource profile based on task failures and gc metrics to
>>>>> grow or shrink executor memory size. We do this at application level based
>>>>> on the app's past runs today.
>>>>> 3. K8s vertical pod autoscaler
>>>>> <https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler>
>>>>>
>>>>> Internally, we are mostly getting aligned on option 2. We would love
>>>>> to make this happen and are looking forward to the SPIP.
>>>>>
>>>>>
>>>>> On Wed, Jan 17, 2024 at 9:34 AM Mridul Muralidharan <mri...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>   We are internally exploring adding support for dynamically changing
>>>>>> the resource profile of a stage based on runtime characteristics.
>>>>>> This includes failures due to OOM and the like, slowness due to
>>>>>> excessive GC, resource wastage due to excessive overprovisioning, etc.
>>>>>> Essentially handles scale up and scale down of resources.
>>>>>> Instead of baking these into the scheduler directly (which is already
>>>>>> complex), we are modeling it as a plugin - so that the 'business logic' 
>>>>>> of
>>>>>> how to handle task events and mutate state is pluggable.
>>>>>>
>>>>>> The main limitation I find with mutating only the cores is the limits
>>>>>> it places on what kind of problems can be solved with it - and mutating
>>>>>> resource profiles is a much more natural way to handle this
>>>>>> (spark.task.cpus predates RP).
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>> On Wed, Jan 17, 2024 at 9:18 AM Tom Graves
>>>>>> <tgraves...@yahoo.com.invalid> wrote:
>>>>>>
>>>>>>> It is interesting. I think there are definitely some discussion
>>>>>>> points around this.  reliability vs performance is always a trade off 
>>>>>>> and
>>>>>>> its great it doesn't fail but if it doesn't meet someone's SLA now that
>>>>>>> could be as bad if its hard to figure out why.   I think if something 
>>>>>>> like
>>>>>>> this kicks in, it needs to be very obvious to the user so they can see 
>>>>>>> that
>>>>>>> it occurred.  Do you have something in place on UI or something that
>>>>>>> indicates this? The nice thing is also you aren't wasting memory by
>>>>>>> increasing it for all tasks when maybe you only need it for one or two.
>>>>>>> The downside is you are only finding out after failure.
>>>>>>>
>>>>>>> I do also worry a little bit that in your blog post, the error you
>>>>>>> pointed out isn't a java OOM but an off heap memory issue (overhead + 
>>>>>>> heap
>>>>>>> usage).  You don't really address heap memory vs off heap in that 
>>>>>>> article.
>>>>>>> Only thing I see mentioned is spark.executor.memory which is heap 
>>>>>>> memory.
>>>>>>> Obviously adjusting to only run one task is going to give that task more
>>>>>>> overall memory but the reasons its running out in the first place could 
>>>>>>> be
>>>>>>> different.  If it was on heap memory for instance with more tasks I 
>>>>>>> would
>>>>>>> expect to see more GC and not executor OOM.  If you are getting executor
>>>>>>> OOM you are likely using more off heap memory/stack space, etc then you
>>>>>>> allocated.   Ultimately it would be nice to know why that is happening 
>>>>>>> and
>>>>>>> see if we can address it to not fail in the first place.  That could be
>>>>>>> extremely difficult though, especially if using software outside Spark 
>>>>>>> that
>>>>>>> is using that memory.
>>>>>>>
>>>>>>> As Holden said,  we need to make sure this would play nice with the
>>>>>>> resource profiles, or potentially if we can use the resource profile
>>>>>>> functionality.  Theoretically you could extend this to try to get new
>>>>>>> executor if using dynamic allocation for instance.
>>>>>>>
>>>>>>> I agree doing a SPIP would be a good place to start to have more
>>>>>>> discussions.
>>>>>>>
>>>>>>> Tom
>>>>>>>
>>>>>>> On Wednesday, January 17, 2024 at 12:47:51 AM CST, kalyan <
>>>>>>> justfors...@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hello All,
>>>>>>>
>>>>>>> At Uber, we had recently, done some work on improving the
>>>>>>> reliability of spark applications in scenarios of fatter executors going
>>>>>>> out of memory and leading to application failure. Fatter executors are
>>>>>>> those that have more than 1 task running on it at a given time
>>>>>>> concurrently. This has significantly improved the reliability of many 
>>>>>>> spark
>>>>>>> applications for us at Uber. We made a blog about this recently. Link:
>>>>>>> https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/
>>>>>>>
>>>>>>> At a high level, we have done the below changes:
>>>>>>>
>>>>>>>    1. When a Task fails with the OOM of an executor, we update the
>>>>>>>    core requirements of the task to max executor cores.
>>>>>>>    2. When the task is picked for rescheduling, the new attempt of
>>>>>>>    the task happens to be on an executor where no other task can run
>>>>>>>    concurrently. All cores get allocated to this task itself.
>>>>>>>    3. This way we ensure that the configured memory is completely
>>>>>>>    at the disposal of a single task. Thus eliminating contention of 
>>>>>>> memory.
>>>>>>>
>>>>>>> The best part of this solution is that it's reactive. It kicks in
>>>>>>> only when the executors fail with the OOM exception.
>>>>>>>
>>>>>>> We understand that the problem statement is very common and we
>>>>>>> expect our solution to be effective in many cases.
>>>>>>>
>>>>>>> There could be more cases that can be covered. Executor failing with
>>>>>>> OOM is like a hard signal. The framework(making the driver aware of
>>>>>>> what's happening with the executor) can be extended to handle scenarios 
>>>>>>> of
>>>>>>> other forms of memory pressure like excessive spilling to disk, etc.
>>>>>>>
>>>>>>> While we had developed this on Spark 2.4.3 in-house, we would like
>>>>>>> to collaborate and contribute this work to the latest versions of Spark.
>>>>>>>
>>>>>>> What is the best way forward here? Will an SPIP proposal to detail
>>>>>>> the changes help?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Kalyan.
>>>>>>> Uber India.
>>>>>>>
>>>>>>

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

Reply via email to