[ 
https://issues.apache.org/jira/browse/SPARK-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanzhen Geng updated SPARK-14289:
----------------------------------
    Summary: Support multiple eviction strategies for cached RDD partitions  
(was: Add support to multiple eviction strategies for cached RDD partitions)

> Support multiple eviction strategies for cached RDD partitions
> --------------------------------------------------------------
>
>                 Key: SPARK-14289
>                 URL: https://issues.apache.org/jira/browse/SPARK-14289
>             Project: Spark
>          Issue Type: New Feature
>          Components: Block Manager, Spark Core
>         Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>            Reporter: Yuanzhen Geng
>            Priority: Minor
>
> Currently, there is only eviction strategy for cached RDD partition in Spark. 
> The default RDD eviction strategy is LRU (with an additional rule that do not 
> replacing another block that belongs to the same RDD like current creating 
> partition).
> When memory space not sufficient for RDD caching, several partitions will be 
> evicted, if these partitions are used again latterly, they will be reproduced 
> by the Lineage information and cached in memory again. The reproduce phase 
> will bring in additional cost. However, LRU has no guarantee for the lowest 
> reproduce cost. 
> The first RDD that needed to be cached is usually generated by reading from 
> HDFS and doing several transformations. The reading operation usually cost 
> longer time than other Spark transformations. 
> For example, in one stage we having the following DAG structure: hdfs -> 
> \[A\] -> B -> \[C\] -> D - > \[E\] -> \[F\], RDD A, C, E, F needed to be 
> cached in memory, F is creating during this stage while A, B and E had 
> already been created in previous. When using the LRU eviction strategy, 
> partition of A will be evicted first. However, the time cost in\ [A\] -> B -> 
> \[C\] may be much less than hdfs ->\ [A\], so evict \[C\] may be better than 
> evict \[A\]. 
> A eviction strategy based on the creation cost may be better than LRU, by 
> statisticing each transformation's time during the creation of cached RDD 
> partition (e.g. \[E\] only need to statistic time cost in \[C\] -> D and D -> 
> \[E\]) and time cost in needed shuffle reading. When memory for RDD storage 
> not sufficient, partition with the least creation cost may be evicted first. 
> So this strategy for be called as LCS. My current demo show better 
> performance gain than default LRU.
> This strategy needs to consider the following situation:
> 1. Unified Memory Management is provided after Spark 1.6, memory for 
> execution during recomputing a partition may be pretty different than the 
> first time the partition created. So before better thought, LCS may not be 
> allowed in UMM mode. (Though my demo also show improvement in LCS than LRU in 
> UMM mode).
> 2. MEMORY_AND_DISK_SER or other similar storage level may serialize RDD 
> partition. By estimating ser/deserialize cost and compare to creation cost, 
> if the ser/deserialize cost even larger than recreation, not serialize but 
> directly removed from memory. As existing storage level only allowed for the 
> whole RDD, so a new storage level may be needed for RDD partition to directly 
> determine whether to serialize or just remove from memory.
> Besides LCS, FIFO or LFU is easy to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to