Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-03-18 Thread Mridul Muralidharan
alyan >> Date 02/6/2024 10:08 >> To Jay Han >> Cc Ashish Singh , >> Mridul Muralidharan , >> dev , >> >> >> Subject Re: [Spark-Core] Improving Reliability of spark when Executors >> OOM >> Hey, >> Disk space not enou

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-03-11 Thread Ashish Singh
kalyan > Date 02/6/2024 10:08 > To Jay Han > Cc Ashish Singh , > Mridul Muralidharan , > dev , > > > Subject Re: [Spark-Core] Improving Reliability of spark when Executors > OOM > Hey, > Disk space not enough is also a reliability concern, but might need

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-02-05 Thread kalyan
Hey, Disk space not enough is also a reliability concern, but might need a diff strategy to handle it. As suggested by Mridul, I am working on making things more configurable in another(new) module… with that, we can plug in new rules for each type of error. Regards Kalyan. On Mon, 5 Feb 2024 at

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-02-04 Thread Jay Han
Hi, what about supporting for solving the disk space problem of "device space isn't enough"? I think it's same as OOM exception. kalyan 于2024年1月27日周六 13:00写道: > Hi all, > > Sorry for the delay in getting the first draft of (my first) SPIP out. > >

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-26 Thread kalyan
Hi all, Sorry for the delay in getting the first draft of (my first) SPIP out. https://docs.google.com/document/d/1hxEPUirf3eYwNfMOmUHpuI5dIt_HJErCdo7_yr9htQc/edit?pli=1 Let me know what you think. Regards kalyan. On Sat, Jan 20, 2024 at 8:19 AM Ashish Singh wrote: > Hey all, > > Thanks for

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-19 Thread Ashish Singh
Hey all, Thanks for this discussion, the timing of this couldn't be better! At Pinterest, we recently started to look into reducing OOM failures while also reducing memory consumption of spark applications. We considered the following options. 1. Changing core count on executor to change memory

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-17 Thread Mridul Muralidharan
Hi, We are internally exploring adding support for dynamically changing the resource profile of a stage based on runtime characteristics. This includes failures due to OOM and the like, slowness due to excessive GC, resource wastage due to excessive overprovisioning, etc. Essentially handles

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-17 Thread Tom Graves
It is interesting. I think there are definitely some discussion points around this.  reliability vs performance is always a trade off and its great it doesn't fail but if it doesn't meet someone's SLA now that could be as bad if its hard to figure out why.   I think if something like this

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-16 Thread Holden Karau
Oh interesting solution, a co-worker was suggesting something similar using resource profiles to increase memory -- but your approach avoids a lot of complexity I like it (and we could extend it out to support resource profile growth too). I think an SPIP sounds like a great next step. On Tue,

[Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-16 Thread kalyan
Hello All, At Uber, we had recently, done some work on improving the reliability of spark applications in scenarios of fatter executors going out of memory and leading to application failure. Fatter executors are those that have more than 1 task running on it at a given time concurrently. This