[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

squito Fri, 10 Aug 2018 13:34:58 -0700

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    yeah, you'd have to sort the entire record.  I think originally it didn't 
seem like that would work, because you don't know that `T` is sortable for 
`RDD[T]`.  But after a sort, you've got bytes, so you can at least sort the 
serialized bytes of `T`.  Note that you don't want to do the entire 
repartitioning based on that, since you won't get a good distribution for 
skewed data.  But it can give you a deterministic order as an *additional* 
sort, just within one partition.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to