Eh, there is a huge caveat - you are making your input non-deterministic, where determinism is assumed. I don't think that supports such a drastic statement.
On Wed, Jun 22, 2022 at 12:39 PM Igor Berman <igor.ber...@gmail.com> wrote: > Hi All > tldr; IMHO repartition(n) should be deprecated or red-flagged, so that > everybody will understand consequences of usage of this method > > Following conversation in > https://issues.apache.org/jira/browse/SPARK-38388 (still relevant for > recent versions of spark) I think it's very important to mark this function > somehow and to alert end-user about consequences of such usage > > Basically it may produce duplicates and data loss under retries for > several kinds of input: among them non-deterministic input, but more > importantly input that deterministic but might produce not exactly same > results due to precision of doubles(and floats) in very simple queries like > following > > sqlContext.sql( > " SELECT integerColumn, SUM(someDoubleTypeValue) AS value > FROM data > GROUP BY integerColumn " > ).repartition(3) > > (see comment from Tom in ticket) > > As an end-user I'd expect the retries mechanism to work in a consistent > way and not to drop data silently(neither to produce duplicates) > > Any thoughts? > thanks in advance > Igor > >