Hi all,

I've written a new Spark feature and I would love to have a committer take a 
look at it. I want to increase Spark performance when using dynamic allocation 
by preserving cached data.

The PR and Jira ticket are here:

https://github.com/apache/spark/pull/19041
https://issues.apache.org/jira/browse/SPARK-21097

Notebook spark users are the primary target for this change. Notebook users 
generally have periods of inactivity where spark executors could be used for 
other jobs, but if the user has any cached data, then they will either lock up 
those executors or lose their cached data. This change remedies this problem by 
replicating data to surviving executors before shutting down idle ones.

I have conducted some benchmarks showing significant performance gains under 
the right usage patterns. See the benchmark data here:

https://docs.google.com/document/d/1E6_rhAAJB8Ww0n52-LYcFTO1zhJBWgfIXzNjLi29730/edit?usp=sharing

I tried to mitigate the risk of this code change by keeping the code self 
contained and falling back to regular dynamic allocation behavior if there are 
any issues. The feature should work with any coarse grained backend and I have 
tested with YARN and standalone clusters.

I would love to discuss this change with anyone who is interested. Your 
attention is greatly appreciated.

Thanks
Brad Kaiser


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to