[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046544#comment-15046544 ]
Apache Spark commented on SPARK-12196: -------------------------------------- User 'yucai' has created a pull request for this issue: https://github.com/apache/spark/pull/10192 > Store blocks in storage devices with hierarchy way > -------------------------------------------------- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Reporter: yucai > > Problem: > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is low. HDDs have good > capacity, but x2-x3 lower than SSDs. > How can we get both good? > Solution: > Our idea is to build hierarchy store: use SSDs as cache and HDDs as > backup storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), > it gets blocks from SSDs first, and when SSD’s useable space is less than > some threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > Performance: > 1. At the best case, our solution performs the same as all SSDs. > At the worst case, like all data are spilled to HDDs, no performance > regression. > 2. Compared with all HDDs, hierarchy store improves more than x1.86 (it > could be higher, CPU reaches bottleneck in our test environment). > 3. Compared with Tachyon, our hierarchy store still x1.3 faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org