Hi, all i have two questions about shuffle time and parallel degree. question 1: we assume that cluster size is fixed, for example a cluster of 16 nodes, each node has 2 cores in EC2 case 1: a total shuffle of 64GB data between 32 partitions case 2: a total shuffle of 128GB data between 128 partitions data is evenly distributed across the cluster, but we found the shuffle time of case 2 is 5x-6x longer than case 1. i guess one reason is the OS cache buffer. There is more data in case 2, so cache buffer is exhausted. We have to directly write to and read from the disk. can anyone give reasons from the perspective of network transfer?
question 2: we all know that high parallel degree means low computation time. but what's the impact on shuffle time? this time assume data size is fixed, for example 64GB, and cluster is large enough. case 1: evenly distributed among 32 partitions, then do a total shuffle case 2: evenly distributed among 64 partitions, then do a total shuffle will shuffle time be halved in case 2? appreciate your help:) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-shuffle-time-and-parallel-degree-tp8512.html Sent from the Apache Spark User List mailing list archive at Nabble.com.