[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049485#comment-16049485
 ] 

Brad edited comment on SPARK-21097 at 6/14/17 6:29 PM:
-------------------------------------------------------

Hey [~srowen], thanks for your input. I would definitely like to do some 
benchmarks and show that the recovered data leads to a performance improvement. 
I don't think it will increase complexity too much, If there are any issues 
copying the data, we can just go ahead and kill the executor and fall back to 
the current behavior. I've done some preliminary work on this and the changes 
to existing code will be minimal and unlikely to damage existing functionality.


was (Author: bradkaiser):
Hey Sean, thanks for your input. I would definitely like to do some benchmarks 
and show that the recovered data leads to a performance improvement. I don't 
think it will increase complexity too much, If there are any issues copying the 
data, we can just go ahead and kill the executor and fall back to the current 
behavior. I've done some preliminary work on this and the changes to existing 
code will be minimal and unlikely to damage existing functionality.

> Dynamic allocation will preserve cached data
> --------------------------------------------
>
>                 Key: SPARK-21097
>                 URL: https://issues.apache.org/jira/browse/SPARK-21097
>             Project: Spark
>          Issue Type: Improvement
>          Components: Block Manager, Scheduler, Spark Core
>    Affects Versions: 2.2.0, 2.3.0
>            Reporter: Brad
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to