[ 
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanping Wang updated SPARK-13004:
---------------------------------
    Comment: was deleted

(was: Hi, Sean, in order for this model to work, we also developed a memory 
library to support this computation model. we are planning to donate this 
library to Apache incubator. If you are interested, I can send you a proposal 
draft. )

> Support Non-Volatile Data and Operations
> ----------------------------------------
>
>                 Key: SPARK-13004
>                 URL: https://issues.apache.org/jira/browse/SPARK-13004
>             Project: Spark
>          Issue Type: Epic
>          Components: Input/Output, Spark Core
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Wang, Gang
>              Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant 
> negative performance impacts on majority of industrial Spark workloads, 
> especially, when the volumn of datasets are much larger than the system 
> memory volumns of Spark cluster available to caching, checkpoint, 
> shuffling/dispatching, data loading and Storing. the JVM on-heap management 
> would downgrade the performance as well when under pressure incurred by large 
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large 
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This 
> project focuses on adopting new features provided by server platform for 
> Spark applications and retrofitting the utilization of hybrid addressable 
> memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid 
> SerDe as well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD 
> operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache 
> pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory 
> usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our 
> non-volatile solutions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to