Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    Talked to @yanboliang offline, he claimed that the major use cases of 
RDD/DataFrame.repartition() in ml workloads he has observed are:
    1. During save models, you may need `repartition()` to reduce the number of 
output files, a typical special case is `xxx.repartition(1)`;
    2. You may use `repartition()` to let the original data set to have more 
partitions, to gain a higher parallelism for following operations.
    
    Actually for the first case, you shall use `coalesce()` instead of 
`repartition()` to get a similar effect, without need of another shuffle! Also, 
the scene don't strictly require the data set to distribute evenly, so the 
change from round-robin partitioning to hash partitioning should be fine.
    For the latter case, if you have a bunch of data with the same values, the 
change may lead to high data skew and brings performance regression, currently 
the best-effort-approach we can choose is to perform a local sort if the data 
type is comparable (and that also requires a lot of work refactoring the 
`ExternalSorter`).
    
    Another approach is that we may let the `ShuffleBlockFetcherIterator` to 
remember the sequence of block fetches, and force the blocks to be fetched 
one-by-one. This actually involves more issues because we may face memory limit 
and therefore have to spill the fetched blocks. IIUC this should resolve the 
issue for general cases.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to