[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144060#comment-17144060 ]
Min Shen commented on SPARK-30602: ---------------------------------- Also want to share the production results we have so far. We have deployed Magnet, the new push-based shuffle service, to our biggest production cluster at LinkedIn. This is a very large-scale and busy cluster. We run 30K+ Spark applications daily which shuffle ~5PB data. We have enabled a few fairly complex production flows in our cluster to start using the new push-based shuffle. The result of using the new push-based shuffle mechanism is shown in the screenshot below. Here, we used an internal performance comparison tooling to compare the effect of enabling push-based shuffle. We saw huge reduction in shuffle read wait time, which also significantly brought down the total executor runtime. !Screen Shot 2020-06-23 at 11.31.22 AM.jpg! > SPIP: Support push-based shuffle to improve shuffle efficiency > -------------------------------------------------------------- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core > Affects Versions: 3.1.0 > Reporter: Min Shen > Priority: Major > Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, > vldb_2020_magnet_shuffle.pdf > > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org