[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--------------------------
    Description: 
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. Set the priority and threshold for each layer in 
spark.storage.hierarchyStore.
{code}
spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configure each layer's location, user just needs put the keyword like "nvm", 
"ssd", which are specified in step 1 into local dirs, like spark.local.dir or 
yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.

  was:
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. Configure spark.storage.hierarchyStore.
{code}
spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.


> Store blocks in storage devices with hierarchy way
> --------------------------------------------------
>
>                 Key: SPARK-12196
>                 URL: https://issues.apache.org/jira/browse/SPARK-12196
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1 into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to