[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-4740:
-------------------------------
    Attachment: repartition test.7z

Hi [~rxin], [~adav], I made several tests on HDDs and ramdisk with 
*repartition(192)* to test shuffle performance for NIO and Netty, with the same 
dataset as before (400GB). I uploaded archived file *repatition test.7z*, in 
which there are 6 tests' results:
1, NIO on ramdisk
2, NIO on HDDs
3, Netty on ramdisk with connectionPerPeer set to 1
4, Netty on ramdisk with connectionPerPeer set to 8
5, Netty on HDDs with connectionPerPeer set to 1
6, Netty on HDDs with connectionPerPeer set to 8
P.S. in the attached htmls, unit of IO throughput is requests instead of byte.

>From the 6 tests, it's very obvious that the reduce performance increases a 
>lot by setting *connectionPerPeer* from 1 to 8. Both with Ramdisk and HDDs. 

For HDDs, the reduce time of Netty with *connectionPerPeer=8* is about the same 
with NIO (about 6.7 mins).

For Ramdisk, Netty outperforms NIO even with *connectionPerPeer=1*. That is 
because the memory bandwidth has reaches bound for NIO, it's memory bandwidth 
bound, which I have confirmed with other tools. That's why the CPU utilization 
of NIO in reduce phase is only about 50%. While Netty still can get some 
performance gain by increasing *connectionPerPeer*'s value. This is execpected 
because NIO need some extra memory copy than Netty.

Before these 6 tests, I have monitored the IO with *iostat* for HDDs case. When 
keeping *connectionPerPeer* as default (=1), Netty's read requests queue size, 
read requests, await, %util are all smaller than NIO, which means Netty's read 
parallelism is not well profiled. 

Till now, we can confirm that Netty doesn't get good read concurrency for small 
size cluster with many disks (if not set *connetionPerPeer*), but still we can 
not make a conclusion that Netty can run faster than NIO on HDDs.

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> ------------------------------------------------------------------------
>
>                 Key: SPARK-4740
>                 URL: https://issues.apache.org/jira/browse/SPARK-4740
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Zhang, Liye
>            Assignee: Reynold Xin
>         Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, repartition test.7z, 
> rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to