From: Dongjoon Hyun <dongjoon.h...@gmail.com>
Date: Wednesday, January 22, 2020 at 1:57 AM
To: Wenchen Fan <cloud0...@gmail.com>
Cc: dev <dev@spark.apache.org>
Subject: Re: Correctness and data loss issues
Thank you for checking, Wenchen! Sure, we need to do that.
Another question is "What can we do for 2.4.5 release"?
Some of the fixes cannot be backported due to the technical difficulty like the
followings.
1. https://issues.apache.org/jira/browse/SPARK-26154
Stream-stream joins - left outer join gives inconsistent output
(Like this, there are eight correctness fixes which lands only at 3.0.0)
2. https://github.com/apache/spark/pull/27233
[SPARK-29701][SQL] Correct behaviours of group analytical queries when
empty input given
(This is on-going PR which is currently blocking 2.4.5 RC2).
Bests,
Dongjoon.
On Tue, Jan 21, 2020 at 11:10 PM Wenchen Fan <cloud0...@gmail.com> wrote:
I think we need to go through them during the 3.0 QA period, and try to fix the
valid ones.
For example, the first ticket should be fixed already in
https://issues.apache.org/jira/browse/SPARK-28344
On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote:
Hi, All.
According to our policy, "Correctness and data loss issues should be considered
Blockers".
- http://spark.apache.org/contributing.html
Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.
SPARK-30218 Columns used in inequality conditions for joins not resolved
correctly in case of common lineage
SPARK-29701 Different answers when empty input given in GROUPING SETS
SPARK-29699 Different answers in nested aggregates with window functions
SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
SPARK-28125 dataframes created by randomSplit have overlapping rows
SPARK-28067 Incorrect results in decimal aggregation with whole-stage code
gen enabled
SPARK-28024 Incorrect numeric values when out of range
SPARK-27784 Alias ID reuse can break correctness when substituting foldable
expressions
SPARK-27619 MapType should be prohibited in hash expressions
SPARK-27298 Dataset except operation gives different results(dataset count)
on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
SPARK-27213 Unexpected results when filter is used after distinct
SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table
if schema evolves
SPARK-25150 Joining DataFrames derived from the same source yields
confusing/incorrect results
SPARK-21774 The rule PromoteStrings cast string to a wrong data type
SPARK-19248 Regex_replace works in 1.6 but not in 2.0
Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss
issue.
How do you think about the above issues?
Bests,
Dongjoon.