I'm working on the fix of SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243> and should be able push another commit in 1~2 days. More detailed discussions can go to the PR. Thanks for pushing this issue forward! I really appreciate efforts by submit PRs or involve in the discussions actively!
2018-08-13 22:50 GMT+08:00 Tom Graves <tgraves...@yahoo.com.invalid>: > I agree with Imran, we need to fix SPARK-23243 > <https://issues.apache.org/jira/browse/SPARK-23243> and any correctness > issues for that matter. > > Tom > > On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid > <iras...@cloudera.com.INVALID> wrote: > > > On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > > SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>: > Shuffle+Repartition > on an RDD could lead to incorrect answers > It turns out to be a very complicated issue, there is no consensus about > what is the right fix yet. Likely to miss it in Spark 2.4 because it's a > long-standing issue, not a regression. > > > This is a really serious data loss bug. Yes its very complex, but we > absolutely have to fix this, I really think it should be in 2.4. > Has worked on it stopped? >