Eh, there is a huge caveat - you are making your input non-deterministic,
where determinism is assumed. I don't think that supports such a drastic
statement.

On Wed, Jun 22, 2022 at 12:39 PM Igor Berman <igor.ber...@gmail.com> wrote:

> Hi All
> tldr; IMHO repartition(n) should be deprecated or red-flagged, so that
> everybody will understand consequences of usage of this method
>
> Following conversation in
> https://issues.apache.org/jira/browse/SPARK-38388 (still relevant for
> recent versions of spark) I think it's very important to mark this function
> somehow and to alert end-user about consequences of such usage
>
> Basically it may produce duplicates and data loss under retries for
> several kinds of input: among them non-deterministic input, but more
> importantly input that deterministic but might produce not exactly same
> results due to precision of doubles(and floats) in very simple queries like
> following
>
> sqlContext.sql(
> " SELECT integerColumn, SUM(someDoubleTypeValue) AS value
>   FROM data
>   GROUP BY integerColumn "
> ).repartition(3)
>
> (see comment from Tom in ticket)
>
> As an end-user I'd expect the retries mechanism to work in a consistent
> way and not to drop data silently(neither to produce duplicates)
>
> Any thoughts?
> thanks in advance
> Igor
>
>

Reply via email to