We fixed the repartition correctness bug before, by sorting the data before doing round-robin partitioning. But the issue is that we need to propagate the isDeterministic property through SQL operators.
On Tue, Mar 15, 2022 at 1:50 AM Jason Xu <jasonxu.sp...@gmail.com> wrote: > Hi Reynold, do you suggest removing RoundRobinPartitioning in > repartition(numPartitions: Int) API implementation? If that's the direction > we're considering, before we have a new implementation, should we suggest > users avoid using the repartition(numPartitions: Int) API? > > On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin <r...@databricks.com> wrote: > >> This is why RoundRobinPartitioning shouldn't be used ... >> >> >> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu <jasonxu.sp...@gmail.com> >> wrote: >> >>> Hi Spark community, >>> >>> I reported a data correctness issue in >>> https://issues.apache.org/jira/browse/SPARK-38388. In short, >>> non-deterministic data + Repartition + FetchFailure could result in >>> incorrect data, this is an issue we run into in production pipelines, I >>> have an example to reproduce the bug in the ticket. >>> >>> I report here to bring more attention, could you help confirm it's a bug >>> and worth effort to further investigate and fix, thank you in advance for >>> help! >>> >>> Thanks, >>> Jason Xu >>> >> >>