Correctness and data loss issues

Dongjoon Hyun Sun, 19 Jan 2020 22:07:57 -0800

Hi, All.

According to our policy, "Correctness and data loss issues should be
considered Blockers".


    - http://spark.apache.org/contributing.html

Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss
issues.

    SPARK-30218 Columns used in inequality conditions for joins not
resolved correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage
code gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting
foldable expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset
count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY
clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive
table if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields
confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss
issue.

How do you think about the above issues?

Bests,
Dongjoon.

Correctness and data loss issues

Reply via email to