[ 
https://issues.apache.org/jira/browse/SPARK-22229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199052#comment-16199052
 ] 

Yuval Degani commented on SPARK-22229:
--------------------------------------

[~srowen], [~viirya], thanks for your response.

Regarding whether RDMA requires specialized hardware:
RDMA is considered a commodity these days. You will find that most 10Gb/s+ 
network cards support it, and RDMA supported NICs are sold by many vendors: 
Mellanox, Intel, Broadcom, Chelsio, Cavium, HP, Dell, Emulex and more. As a 
matter of fact, most people are not even aware that their existing setups 
already support RDMA, and this is where we come in and try to make this 
technology accessible and seamless.
Also, cloud provider support is growing fast: Microsoft Azure A, H nodes 
support RDMA for a while now.

Regarding the pluggable mechanism:
I think that we, as Spark advocates and enthusiasts, would like to keep Spark 
as a framework that shows uncontested performance.
We see lower-level integration reaching almost every mainstream framework with 
GPU and ASIC most recently, and also RDMA is now taking its place.
RDMA is already supported natively in today's most popular distributed ML 
platforms: TensorFlow, Caffe2 and CNTK, and is being driven into others as well.

I think that in order for Spark to keep up with today's performance challenges, 
we must allow some lower-level integration, especially where mature and proven 
technologies such as RDMA are considered.

> SPIP: RDMA Accelerated Shuffle Engine
> -------------------------------------
>
>                 Key: SPARK-22229
>                 URL: https://issues.apache.org/jira/browse/SPARK-22229
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Yuval Degani
>         Attachments: 
> SPARK-22229_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits 
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
> processing overhead by bypassing the kernel and networking stack as well as 
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
> directly by the actual Spark workloads, and help reducing the job runtime 
> significantly. 
> This performance gain is demonstrated with both industry standard HiBench 
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive 
> customer applications. 
> SparkRDMA will be presented at Spark Summit 2017 in Dublin 
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to