Re: Data correctness issue with Repartition + FetchFailure

Wenchen Fan Mon, 14 Mar 2022 19:01:42 -0700

We fixed the repartition correctness bug before, by sorting the data before
doing round-robin partitioning. But the issue is that we need to propagate
the isDeterministic property through SQL operators.


On Tue, Mar 15, 2022 at 1:50 AM Jason Xu <[email protected]> wrote:

> Hi Reynold, do you suggest removing RoundRobinPartitioning in
> repartition(numPartitions: Int) API implementation? If that's the direction
> we're considering, before we have a new implementation, should we suggest
> users avoid using the repartition(numPartitions: Int) API?
>
> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin <[email protected]> wrote:
>
>> This is why RoundRobinPartitioning shouldn't be used ...
>>
>>
>> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu <[email protected]>
>> wrote:
>>
>>> Hi Spark community,
>>>
>>> I reported a data correctness issue in
>>> https://issues.apache.org/jira/browse/SPARK-38388. In short,
>>> non-deterministic data + Repartition + FetchFailure could result in
>>> incorrect data, this is an issue we run into in production pipelines, I
>>> have an example to reproduce the bug in the ticket.
>>>
>>> I report here to bring more attention, could you help confirm it's a bug
>>> and worth effort to further investigate and fix, thank you in advance for
>>> help!
>>>
>>> Thanks,
>>> Jason Xu
>>>
>>
>>

Re: Data correctness issue with Repartition + FetchFailure

Reply via email to