Re: Correctness and data loss issues

Dongjoon Hyun Wed, 22 Jan 2020 13:01:08 -0800

Hi, All.

BTW, based on the AS-IS feedbacks,
I updated all open `correctness` and `dataloss` issues like the followings.


    1. Raised the issue priority into `Blocker`.
    2. Set the target version to `3.0.0`.

It's a time to give more visibility to those issues in order to close or
resolve.

The remaining things are the followings:

    1. Revisit `3.0.0`-only correctness patches?
    2. Set the target version to `2.4.5`? (Specifically, is this feasible
in terms of timeline?)

Bests,
Dongjoon.


On Wed, Jan 22, 2020 at 9:43 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Hi, Tom.
>
> Then, along with the following, do you think we need to hold on 2.4.5
> release, too?
>
> > If it's really a correctness issue we should hold 3.0 for it.
>
> Recently,
>
>     (1) 2.4.4 delivered 9 correctness patches.
>     (2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches,
> too.
>
>         SPARK-29101 CSV datasource returns incorrect .count() from file
> with malformed records
>         SPARK-30447 Constant propagation nullability issue
>         SPARK-29708 Different answers in aggregates of duplicate grouping
> sets
>         SPARK-29651 Incorrect parsing of interval seconds fraction
>         SPARK-29918 RecordBinaryComparator should check endianness when
> compared by long
>         SPARK-29042 Sampling-based RDD with unordered input should be
> INDETERMINATE
>         SPARK-30082 Zeros are being treated as NaNs
>         SPARK-29743 sample should set needCopyResult to true if its child
> is
>         SPARK-26985 Test "access only some column of the all of columns "
> fails on big endian
>
> Without the official Apache Spark 2.4.5 binaries,
> there is no official way to deliver the 9 correctness fixes in (2) to the
> users.
> In addition, usually, the correctness fixes are independent to each other.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jan 22, 2020 at 7:02 AM Tom Graves <tgraves...@yahoo.com> wrote:
>
>> I agree, I think we just need to go through all of them and individual
>> assess each one. If it's really a correctness issue we should hold 3.0 for
>> it.
>>
>> On the 2.4 release I didn't see an explanation on
>> https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back
>> ported, I think in the very least we need that in each jira comment.
>>
>> spark-29701 looks more like compatibility with Postgres then a purely
>> wrong answer to me, if Spark has been consistent about that it feels like
>> it can wait for 3.0 but would be good to get others input and I'm not an
>> expert on SQL standard and what do the other sql engines do in this case.
>>
>> Tom
>>
>> On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>
>> Hi, All.
>>
>> According to our policy, "Correctness and data loss issues should be
>> considered Blockers".
>>
>>     - http://spark.apache.org/contributing.html
>>
>> Since we are close to branch-3.0 cut,
>> I want to ask your opinions on the following correctness and data loss
>> issues.
>>
>>     SPARK-30218 Columns used in inequality conditions for joins not
>> resolved correctly in case of common lineage
>>     SPARK-29701 Different answers when empty input given in GROUPING SETS
>>     SPARK-29699 Different answers in nested aggregates with window
>> functions
>>     SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
>>     SPARK-28125 dataframes created by randomSplit have overlapping rows
>>     SPARK-28067 Incorrect results in decimal aggregation with whole-stage
>> code gen enabled
>>     SPARK-28024 Incorrect numeric values when out of range
>>     SPARK-27784 Alias ID reuse can break correctness when substituting
>> foldable expressions
>>     SPARK-27619 MapType should be prohibited in hash expressions
>>     SPARK-27298 Dataset except operation gives different results(dataset
>> count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
>>     SPARK-27282 Spark incorrect results when using UNION with GROUP BY
>> clause
>>     SPARK-27213 Unexpected results when filter is used after distinct
>>     SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive
>> table if schema evolves
>>     SPARK-25150 Joining DataFrames derived from the same source yields
>> confusing/incorrect results
>>     SPARK-21774 The rule PromoteStrings cast string to a wrong data type
>>     SPARK-19248 Regex_replace works in 1.6 but not in 2.0
>>
>> Some of them are targeted on 3.0.0, but the others are not.
>> Although we will work on them until 3.0.0,
>> I'm not sure we can reach a status with no known correctness and data
>> loss issue.
>>
>> How do you think about the above issues?
>>
>> Bests,
>> Dongjoon.
>>
>

Re: Correctness and data loss issues

Reply via email to