Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20414 Talked to @yanboliang offline, he claimed that the major use cases of RDD/DataFrame.repartition() in ml workloads he has observed are: 1. During save models, you may need `repartition()` to reduce the number of output files, a typical special case is `xxx.repartition(1)`; 2. You may use `repartition()` to let the original data set to have more partitions, to gain a higher parallelism for following operations. Actually for the first case, you shall use `coalesce()` instead of `repartition()` to get a similar effect, without need of another shuffle! Also, the scene don't strictly require the data set to distribute evenly, so the change from round-robin partitioning to hash partitioning should be fine. For the latter case, if you have a bunch of data with the same values, the change may lead to high data skew and brings performance regression, currently the best-effort-approach we can choose is to perform a local sort if the data type is comparable (and that also requires a lot of work refactoring the `ExternalSorter`). Another approach is that we may let the `ShuffleBlockFetcherIterator` to remember the sequence of block fetches, and force the blocks to be fetched one-by-one. This actually involves more issues because we may face memory limit and therefore have to spill the fetched blocks. IIUC this should resolve the issue for general cases.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org