Hi, Tom. Then, along with the following, do you think we need to hold on 2.4.5 release, too?
> If it's really a correctness issue we should hold 3.0 for it. Recently, (1) 2.4.4 delivered 9 correctness patches. (2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches, too. SPARK-29101 CSV datasource returns incorrect .count() from file with malformed records SPARK-30447 Constant propagation nullability issue SPARK-29708 Different answers in aggregates of duplicate grouping sets SPARK-29651 Incorrect parsing of interval seconds fraction SPARK-29918 RecordBinaryComparator should check endianness when compared by long SPARK-29042 Sampling-based RDD with unordered input should be INDETERMINATE SPARK-30082 Zeros are being treated as NaNs SPARK-29743 sample should set needCopyResult to true if its child is SPARK-26985 Test "access only some column of the all of columns " fails on big endian Without the official Apache Spark 2.4.5 binaries, there is no official way to deliver the 9 correctness fixes in (2) to the users. In addition, usually, the correctness fixes are independent to each other. Bests, Dongjoon. On Wed, Jan 22, 2020 at 7:02 AM Tom Graves <tgraves...@yahoo.com> wrote: > I agree, I think we just need to go through all of them and individual > assess each one. If it's really a correctness issue we should hold 3.0 for > it. > > On the 2.4 release I didn't see an explanation on > https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back > ported, I think in the very least we need that in each jira comment. > > spark-29701 looks more like compatibility with Postgres then a purely > wrong answer to me, if Spark has been consistent about that it feels like > it can wait for 3.0 but would be good to get others input and I'm not an > expert on SQL standard and what do the other sql engines do in this case. > > Tom > > On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun < > dongjoon.h...@gmail.com> wrote: > > > Hi, All. > > According to our policy, "Correctness and data loss issues should be > considered Blockers". > > - http://spark.apache.org/contributing.html > > Since we are close to branch-3.0 cut, > I want to ask your opinions on the following correctness and data loss > issues. > > SPARK-30218 Columns used in inequality conditions for joins not > resolved correctly in case of common lineage > SPARK-29701 Different answers when empty input given in GROUPING SETS > SPARK-29699 Different answers in nested aggregates with window > functions > SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe > SPARK-28125 dataframes created by randomSplit have overlapping rows > SPARK-28067 Incorrect results in decimal aggregation with whole-stage > code gen enabled > SPARK-28024 Incorrect numeric values when out of range > SPARK-27784 Alias ID reuse can break correctness when substituting > foldable expressions > SPARK-27619 MapType should be prohibited in hash expressions > SPARK-27298 Dataset except operation gives different results(dataset > count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment > SPARK-27282 Spark incorrect results when using UNION with GROUP BY > clause > SPARK-27213 Unexpected results when filter is used after distinct > SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive > table if schema evolves > SPARK-25150 Joining DataFrames derived from the same source yields > confusing/incorrect results > SPARK-21774 The rule PromoteStrings cast string to a wrong data type > SPARK-19248 Regex_replace works in 1.6 but not in 2.0 > > Some of them are targeted on 3.0.0, but the others are not. > Although we will work on them until 3.0.0, > I'm not sure we can reach a status with no known correctness and data loss > issue. > > How do you think about the above issues? > > Bests, > Dongjoon. >