Re: repartition(n) should be deprecated/alerted

Igor Berman Wed, 22 Jun 2022 11:50:08 -0700

I'd argue it's strange and unexpected.
I understand there is precision issues here, but I'm fine that result might
be slightly different each time for the specific column
What I'm not expecting(as end user for sure) is that presumably trivial
computation might under retries scenarios cause few hundreds rows to be
duplicated and same amount to be dropped(since one precision shift, might
shift few hundreds of rows in local sort done by repartiton(n))
Maybe what I'm trying to say is that repartition documentation is not
hinting in any way that this might happen and maybe it should.


* I'm aware of coalesce, but it has its own problems due to influence on
parallelism of all the transformations/filters up to last shuffle/exchange





On Wed, 22 Jun 2022 at 20:43, Sean Owen <sro...@gmail.com> wrote:

> Eh, there is a huge caveat - you are making your input non-deterministic,
> where determinism is assumed. I don't think that supports such a drastic
> statement.
>
> On Wed, Jun 22, 2022 at 12:39 PM Igor Berman <igor.ber...@gmail.com>
> wrote:
>
>> Hi All
>> tldr; IMHO repartition(n) should be deprecated or red-flagged, so that
>> everybody will understand consequences of usage of this method
>>
>> Following conversation in
>> https://issues.apache.org/jira/browse/SPARK-38388 (still relevant for
>> recent versions of spark) I think it's very important to mark this function
>> somehow and to alert end-user about consequences of such usage
>>
>> Basically it may produce duplicates and data loss under retries for
>> several kinds of input: among them non-deterministic input, but more
>> importantly input that deterministic but might produce not exactly same
>> results due to precision of doubles(and floats) in very simple queries like
>> following
>>
>> sqlContext.sql(
>> " SELECT integerColumn, SUM(someDoubleTypeValue) AS value
>>   FROM data
>>   GROUP BY integerColumn "
>> ).repartition(3)
>>
>> (see comment from Tom in ticket)
>>
>> As an end-user I'd expect the retries mechanism to work in a consistent
>> way and not to drop data silently(neither to produce duplicates)
>>
>> Any thoughts?
>> thanks in advance
>> Igor
>>
>>

Re: repartition(n) should be deprecated/alerted

Reply via email to