It's great if you can help with it! Basically, we need to propagate the column-level deterministic information and sort the inputs if the partition key lineage has nondeterminisitc part.
On Wed, Mar 16, 2022 at 5:28 AM Jason Xu <jasonxu.sp...@gmail.com> wrote: > Hi Wenchen, thanks for the insight. Agree, the previous fix for > repartition works for deterministic data. With non-deterministic data, I > didn't find an API to pass DeterministicLevel to underlying rdd. > Do you plan to continue work on integration with SQL operators? If not, > I'm available to take a stab. > > On Mon, Mar 14, 2022 at 7:00 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> We fixed the repartition correctness bug before, by sorting the data >> before doing round-robin partitioning. But the issue is that we need to >> propagate the isDeterministic property through SQL operators. >> >> On Tue, Mar 15, 2022 at 1:50 AM Jason Xu <jasonxu.sp...@gmail.com> wrote: >> >>> Hi Reynold, do you suggest removing RoundRobinPartitioning in >>> repartition(numPartitions: Int) API implementation? If that's the direction >>> we're considering, before we have a new implementation, should we suggest >>> users avoid using the repartition(numPartitions: Int) API? >>> >>> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin <r...@databricks.com> wrote: >>> >>>> This is why RoundRobinPartitioning shouldn't be used ... >>>> >>>> >>>> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu <jasonxu.sp...@gmail.com> >>>> wrote: >>>> >>>>> Hi Spark community, >>>>> >>>>> I reported a data correctness issue in >>>>> https://issues.apache.org/jira/browse/SPARK-38388. In short, >>>>> non-deterministic data + Repartition + FetchFailure could result in >>>>> incorrect data, this is an issue we run into in production pipelines, I >>>>> have an example to reproduce the bug in the ticket. >>>>> >>>>> I report here to bring more attention, could you help confirm it's a >>>>> bug and worth effort to further investigate and fix, thank you in advance >>>>> for help! >>>>> >>>>> Thanks, >>>>> Jason Xu >>>>> >>>> >>>>