Re: `Target Version` management on correctness/data-loss Issues

2020-01-28 Thread Dongjoon Hyun
Thanks, Tom.

I agree that emails are good for urgent announcement and reaching fast
agreement. Also, more visible in a short time period.

However, some correctness issues are long-standing and sometime they
changes their faces with different JIRA IDs. We can see the relationship
easily in the JIRA, but it's difficult in email thread. Also, email search
is not so helpful because it's individual email is read-only and not
itemized.

BTW, To All.
I'm continuing this correctness threads with multiple perspectives because
our RC process seems to be flaky. If we have a flaky test, we are trying to
fix. Why not about the flaky RC process? RC is designed to be okay to fail,
but that doesn't mean we don't have an efficient RC process.

The main root cause of RC failure is our insufficient management and
agreement on `Target Version`.

Bests,
Dongjoon.

On Tue, Jan 28, 2020 at 5:47 AM Tom Graves  wrote:

> I was just thinking an info email  (perhaps tagged with
> correctness/dataloss) to dev rather than an official vote, that way its
> more visible and if anyone sees it and disagrees with the targeting it can
> be discussed on that thread.  It might also just bring more visibility to
> those important issues and get people interesting in working on them sooner.
>
> Tom
>
> On Monday, January 27, 2020, 02:31:03 PM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>
> Yes. That is what I pointed in `Unfortunately, we didn't build a consensus
> on what is really blocked by that.` If you are suggesting a vote, do you
> mean a majority-win vote or an unanimous decision? Will it be a permanent
> decision?
>
> > I think the other interesting thing here is how exactly to come to
> agreement on whether it needs to be fixed in a particular release. Like we
> have been discussing on SPARK-29701. This could be a matter of opinion, so
> should we do something like mail the dev list whenever one of these issues
> is tagged if its not going to be back ported to an affected release?
>
> The following seems to happen when the committers initially think like
> "Seems behavioral to me and its been consistent so seems ok to skip for
> 2.4.5"
> For example, SPARK-27619 MapType should be prohibited in hash expressions.
>
> > A) I'm not clear on this one as to why affected and target would be
> different initially,
>
> BTW, in this email thread, I'm focusing on the `Target Version` management.
> That is the only way to detect the community decision change.
>
> Bests,
> Dongjoon.
>
> On Mon, Jan 27, 2020 at 11:12 AM Tom Graves  wrote:
>
> thanks for bringing this up.
>
> A) I'm not clear on this one as to why affected and target would be
> different initially, other then the reasons target versions != fixed
> versions.  Is the intention here just to say, if its already been discussed
> and came to consensus not needed in certain release?  The only other
> obvious time is in spark releases that are no longer maintained.
>
> I think the other interesting thing here is how exactly to come to
> agreement on whether it needs to be fixed in a particular release. Like we
> have been discussing on SPARK-29701. This could be a matter of opinion, so
> should we do something like mail the dev list whenever one of these issues
> is tagged if its not going to be back ported to an affected release?
>
> Tom
> On Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>
> Hi, All.
>
> After 2.4.5 RC1 vote failure, I asked your opinions about
> correctness/dataloss issues (at mailing lists/JIRAs/PRs) in order to
> collect the current status and public opinion widely in the community to
> build a consensus on this at this time.
>
> Before talking about those issues, please remind that
>
> - Apache Spark 2.4.x is the only live version because 2.3.x is EOL and
> 3.0.0 is not released.
> - Apache Spark community has the following rule: "Correctness and data
> loss issues should be considered Blockers."
>
> Unfortunately, we didn't build a consensus on what is really blocked by
> that. In reality, it was just our resolution for the quality and it works a
> little differently.
>
> In this email, I want to talk about correctness/dataloss issues and
> observed public opinions. They fall into the following categories roughly.
>
> 1. Resolved in both 3.0.0 and 2.4.x
>- ex) SPARK-30447 Constant propagation nullability issue
>- No problem. However, this case sometimes goes to (2)
>
> 2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.
>- ex) SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match
> Hive
>- "We don't want to change the behavior in the maintenence release"
>
> 3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.
>- ex) SPARK-29906 Reading of csv file fails with adaptive execution
> turned on
>- No problem.
>
> 4. Resolved in 3.0.0 and not backported due to technical difficulty.
>- ex) SPARK-26154 Stream-stream joins - left 

Re: `Target Version` management on correctness/data-loss Issues

2020-01-28 Thread Tom Graves
 I was just thinking an info email  (perhaps tagged with correctness/dataloss) 
to dev rather than an official vote, that way its more visible and if anyone 
sees it and disagrees with the targeting it can be discussed on that thread.  
It might also just bring more visibility to those important issues and get 
people interesting in working on them sooner.
Tom
On Monday, January 27, 2020, 02:31:03 PM CST, Dongjoon Hyun 
 wrote:  
 
 Yes. That is what I pointed in `Unfortunately, we didn't build a consensus on 
what is really blocked by that.` If you are suggesting a vote, do you mean a 
majority-win vote or an unanimous decision? Will it be a permanent decision?
> I think the other interesting thing here is how exactly to come to agreement 
> on whether it needs to be fixed in a particular release. Like we have been 
> discussing on SPARK-29701. This could be a matter of opinion, so should we do 
> something like mail the dev list whenever one of these issues is tagged if 
> its not going to be back ported to an affected release?

The following seems to happen when the committers initially think like "Seems 
behavioral to me and its been consistent so seems ok to skip for 2.4.5"
For example, SPARK-27619 MapType should be prohibited in hash expressions.

> A) I'm not clear on this one as to why affected and target would be different 
> initially, 
BTW, in this email thread, I'm focusing on the `Target Version` management.That 
is the only way to detect the community decision change.
Bests,Dongjoon.
On Mon, Jan 27, 2020 at 11:12 AM Tom Graves  wrote:

 thanks for bringing this up.
A) I'm not clear on this one as to why affected and target would be different 
initially, other then the reasons target versions != fixed versions.  Is the 
intention here just to say, if its already been discussed and came to consensus 
not needed in certain release?  The only other obvious time is in spark 
releases that are no longer maintained.
I think the other interesting thing here is how exactly to come to agreement on 
whether it needs to be fixed in a particular release. Like we have been 
discussing on SPARK-29701. This could be a matter of opinion, so should we do 
something like mail the dev list whenever one of these issues is tagged if its 
not going to be back ported to an affected release?
TomOn Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
After 2.4.5 RC1 vote failure, I asked your opinions about correctness/dataloss 
issues (at mailing lists/JIRAs/PRs) in order to collect the current status and 
public opinion widely in the community to build a consensus on this at this 
time.
Before talking about those issues, please remind that
    - Apache Spark 2.4.x is the only live version because 2.3.x is EOL and 
3.0.0 is not released.    - Apache Spark community has the following rule: 
"Correctness and data loss issues should be considered Blockers."
Unfortunately, we didn't build a consensus on what is really blocked by that. 
In reality, it was just our resolution for the quality and it works a little 
differently.
In this email, I want to talk about correctness/dataloss issues and observed 
public opinions. They fall into the following categories roughly.
1. Resolved in both 3.0.0 and 2.4.x   - ex) SPARK-30447 Constant propagation 
nullability issue   - No problem. However, this case sometimes goes to (2)
2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.   - ex) 
SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match Hive   - "We 
don't want to change the behavior in the maintenence release"
3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.   - ex) 
SPARK-29906 Reading of csv file fails with adaptive execution turned on   - No 
problem.
4. Resolved in 3.0.0 and not backported due to technical difficulty.   - ex) 
SPARK-26154 Stream-stream joins - left outer join gives inconsistent output   - 
"This is not backported due to the technical difficulty"
5. Resolved in 3.0.0 and not backported because this is not public API.   - ex) 
SPARK-29503 MapObjects doesn't copy Unsafe data when nested under Safe data   - 
"Since `catalyst` is not public, it's less worth backporting this."
6. Resolved in 3.0.0 and not backported because we forget since there was a no 
Target Version.   - ex) SPARK-28375 Make pullupCorrelatedPredicate idempotent   
- "Adding the 'correctness' label so we remember to backport this fix to 
2.4.x."   - "This is possible, if users add the rule into 
postHocOptimizationBatches"
7. Open with Target Version 3.0.0.   - ex) SPARK-29701 Correct behaviours of 
group analytical queries when empty input given   - "We aren't fully SQL 
compliant there and I think that has been true since the beginning of spark 
sql"   - "This is not a regression"
8. Open without Target Version.   - I removed this case last week to give more 
visibility on them.
Here, I want to focus that Apache Spark is a very healthy community because we 

Re: `Target Version` management on correctness/data-loss Issues

2020-01-27 Thread Dongjoon Hyun
Hi, All.

Currently, there is only one correctness issue which is targeting at 2.4.5.

SPARK-28344 Fail the query if detect ambiguous self join
-> Duplicated by
 SPARK-10892 Join with Data Frame returns wrong results
 SPARK-27547 fix DataFrame self-join problems
 SPARK-30218 Columns used in inequality conditions for joins not
resolved correctly in case of common lineage

As I sent yesterday, we revisited the correctness/dataloss issues and
reinitiated the further discussion. Also, the use of `Target Version` is
proposed. So, please set `Target Version` explicitly if you think there is
any other correctness/dataloss issue which is blocking 2.4.5 RC2.
Otherwise, it's very hard for the release manager to notice it from the hey
stacks of JIRA comments and PR comments.

Bests,
Dongjoon.


On Mon, Jan 27, 2020 at 12:30 PM Dongjoon Hyun 
wrote:

> Yes. That is what I pointed in `Unfortunately, we didn't build a consensus
> on what is really blocked by that.` If you are suggesting a vote, do you
> mean a majority-win vote or an unanimous decision? Will it be a permanent
> decision?
>
> > I think the other interesting thing here is how exactly to come to
> agreement on whether it needs to be fixed in a particular release. Like we
> have been discussing on SPARK-29701. This could be a matter of opinion, so
> should we do something like mail the dev list whenever one of these issues
> is tagged if its not going to be back ported to an affected release?
>
> The following seems to happen when the committers initially think like
> "Seems behavioral to me and its been consistent so seems ok to skip for
> 2.4.5"
> For example, SPARK-27619 MapType should be prohibited in hash expressions.
>
> > A) I'm not clear on this one as to why affected and target would be
> different initially,
>
> BTW, in this email thread, I'm focusing on the `Target Version` management.
> That is the only way to detect the community decision change.
>
> Bests,
> Dongjoon.
>
> On Mon, Jan 27, 2020 at 11:12 AM Tom Graves  wrote:
>
>> thanks for bringing this up.
>>
>> A) I'm not clear on this one as to why affected and target would be
>> different initially, other then the reasons target versions != fixed
>> versions.  Is the intention here just to say, if its already been discussed
>> and came to consensus not needed in certain release?  The only other
>> obvious time is in spark releases that are no longer maintained.
>>
>> I think the other interesting thing here is how exactly to come to
>> agreement on whether it needs to be fixed in a particular release. Like we
>> have been discussing on SPARK-29701. This could be a matter of opinion, so
>> should we do something like mail the dev list whenever one of these issues
>> is tagged if its not going to be back ported to an affected release?
>>
>> Tom
>> On Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>
>> Hi, All.
>>
>> After 2.4.5 RC1 vote failure, I asked your opinions about
>> correctness/dataloss issues (at mailing lists/JIRAs/PRs) in order to
>> collect the current status and public opinion widely in the community to
>> build a consensus on this at this time.
>>
>> Before talking about those issues, please remind that
>>
>> - Apache Spark 2.4.x is the only live version because 2.3.x is EOL
>> and 3.0.0 is not released.
>> - Apache Spark community has the following rule: "Correctness and
>> data loss issues should be considered Blockers."
>>
>> Unfortunately, we didn't build a consensus on what is really blocked by
>> that. In reality, it was just our resolution for the quality and it works a
>> little differently.
>>
>> In this email, I want to talk about correctness/dataloss issues and
>> observed public opinions. They fall into the following categories roughly.
>>
>> 1. Resolved in both 3.0.0 and 2.4.x
>>- ex) SPARK-30447 Constant propagation nullability issue
>>- No problem. However, this case sometimes goes to (2)
>>
>> 2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.
>>- ex) SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match
>> Hive
>>- "We don't want to change the behavior in the maintenence release"
>>
>> 3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.
>>- ex) SPARK-29906 Reading of csv file fails with adaptive execution
>> turned on
>>- No problem.
>>
>> 4. Resolved in 3.0.0 and not backported due to technical difficulty.
>>- ex) SPARK-26154 Stream-stream joins - left outer join gives
>> inconsistent output
>>- "This is not backported due to the technical difficulty"
>>
>> 5. Resolved in 3.0.0 and not backported because this is not public API.
>>- ex) SPARK-29503 MapObjects doesn't copy Unsafe data when nested
>> under Safe data
>>- "Since `catalyst` is not public, it's less worth backporting this."
>>
>> 6. Resolved in 3.0.0 and not backported because we forget since there was
>> a no Target 

Re: `Target Version` management on correctness/data-loss Issues

2020-01-27 Thread Dongjoon Hyun
Yes. That is what I pointed in `Unfortunately, we didn't build a consensus
on what is really blocked by that.` If you are suggesting a vote, do you
mean a majority-win vote or an unanimous decision? Will it be a permanent
decision?

> I think the other interesting thing here is how exactly to come to
agreement on whether it needs to be fixed in a particular release. Like we
have been discussing on SPARK-29701. This could be a matter of opinion, so
should we do something like mail the dev list whenever one of these issues
is tagged if its not going to be back ported to an affected release?

The following seems to happen when the committers initially think like
"Seems behavioral to me and its been consistent so seems ok to skip for
2.4.5"
For example, SPARK-27619 MapType should be prohibited in hash expressions.

> A) I'm not clear on this one as to why affected and target would be
different initially,

BTW, in this email thread, I'm focusing on the `Target Version` management.
That is the only way to detect the community decision change.

Bests,
Dongjoon.

On Mon, Jan 27, 2020 at 11:12 AM Tom Graves  wrote:

> thanks for bringing this up.
>
> A) I'm not clear on this one as to why affected and target would be
> different initially, other then the reasons target versions != fixed
> versions.  Is the intention here just to say, if its already been discussed
> and came to consensus not needed in certain release?  The only other
> obvious time is in spark releases that are no longer maintained.
>
> I think the other interesting thing here is how exactly to come to
> agreement on whether it needs to be fixed in a particular release. Like we
> have been discussing on SPARK-29701. This could be a matter of opinion, so
> should we do something like mail the dev list whenever one of these issues
> is tagged if its not going to be back ported to an affected release?
>
> Tom
> On Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>
> Hi, All.
>
> After 2.4.5 RC1 vote failure, I asked your opinions about
> correctness/dataloss issues (at mailing lists/JIRAs/PRs) in order to
> collect the current status and public opinion widely in the community to
> build a consensus on this at this time.
>
> Before talking about those issues, please remind that
>
> - Apache Spark 2.4.x is the only live version because 2.3.x is EOL and
> 3.0.0 is not released.
> - Apache Spark community has the following rule: "Correctness and data
> loss issues should be considered Blockers."
>
> Unfortunately, we didn't build a consensus on what is really blocked by
> that. In reality, it was just our resolution for the quality and it works a
> little differently.
>
> In this email, I want to talk about correctness/dataloss issues and
> observed public opinions. They fall into the following categories roughly.
>
> 1. Resolved in both 3.0.0 and 2.4.x
>- ex) SPARK-30447 Constant propagation nullability issue
>- No problem. However, this case sometimes goes to (2)
>
> 2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.
>- ex) SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match
> Hive
>- "We don't want to change the behavior in the maintenence release"
>
> 3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.
>- ex) SPARK-29906 Reading of csv file fails with adaptive execution
> turned on
>- No problem.
>
> 4. Resolved in 3.0.0 and not backported due to technical difficulty.
>- ex) SPARK-26154 Stream-stream joins - left outer join gives
> inconsistent output
>- "This is not backported due to the technical difficulty"
>
> 5. Resolved in 3.0.0 and not backported because this is not public API.
>- ex) SPARK-29503 MapObjects doesn't copy Unsafe data when nested under
> Safe data
>- "Since `catalyst` is not public, it's less worth backporting this."
>
> 6. Resolved in 3.0.0 and not backported because we forget since there was
> a no Target Version.
>- ex) SPARK-28375 Make pullupCorrelatedPredicate idempotent
>- "Adding the 'correctness' label so we remember to backport this fix
> to 2.4.x."
>- "This is possible, if users add the rule into
> postHocOptimizationBatches"
>
> 7. Open with Target Version 3.0.0.
>- ex) SPARK-29701 Correct behaviours of group analytical queries when
> empty input given
>- "We aren't fully SQL compliant there and I think that has been true
> since the beginning of spark sql"
>- "This is not a regression"
>
> 8. Open without Target Version.
>- I removed this case last week to give more visibility on them.
>
> Here, I want to focus that Apache Spark is a very healthy community
> because we have diverse opinions and reevaluating JIRA issues are the
> results of the community decision based on the discusson. I believe that it
> will go well eventually. In the above, I added those example JIRA IDs and
> the collected reasons just to give some colors to illustrate all cases 

Re: `Target Version` management on correctness/data-loss Issues

2020-01-27 Thread Tom Graves
 thanks for bringing this up.
A) I'm not clear on this one as to why affected and target would be different 
initially, other then the reasons target versions != fixed versions.  Is the 
intention here just to say, if its already been discussed and came to consensus 
not needed in certain release?  The only other obvious time is in spark 
releases that are no longer maintained.
I think the other interesting thing here is how exactly to come to agreement on 
whether it needs to be fixed in a particular release. Like we have been 
discussing on SPARK-29701. This could be a matter of opinion, so should we do 
something like mail the dev list whenever one of these issues is tagged if its 
not going to be back ported to an affected release?
TomOn Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
After 2.4.5 RC1 vote failure, I asked your opinions about correctness/dataloss 
issues (at mailing lists/JIRAs/PRs) in order to collect the current status and 
public opinion widely in the community to build a consensus on this at this 
time.
Before talking about those issues, please remind that
    - Apache Spark 2.4.x is the only live version because 2.3.x is EOL and 
3.0.0 is not released.    - Apache Spark community has the following rule: 
"Correctness and data loss issues should be considered Blockers."
Unfortunately, we didn't build a consensus on what is really blocked by that. 
In reality, it was just our resolution for the quality and it works a little 
differently.
In this email, I want to talk about correctness/dataloss issues and observed 
public opinions. They fall into the following categories roughly.
1. Resolved in both 3.0.0 and 2.4.x   - ex) SPARK-30447 Constant propagation 
nullability issue   - No problem. However, this case sometimes goes to (2)
2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.   - ex) 
SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match Hive   - "We 
don't want to change the behavior in the maintenence release"
3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.   - ex) 
SPARK-29906 Reading of csv file fails with adaptive execution turned on   - No 
problem.
4. Resolved in 3.0.0 and not backported due to technical difficulty.   - ex) 
SPARK-26154 Stream-stream joins - left outer join gives inconsistent output   - 
"This is not backported due to the technical difficulty"
5. Resolved in 3.0.0 and not backported because this is not public API.   - ex) 
SPARK-29503 MapObjects doesn't copy Unsafe data when nested under Safe data   - 
"Since `catalyst` is not public, it's less worth backporting this."
6. Resolved in 3.0.0 and not backported because we forget since there was a no 
Target Version.   - ex) SPARK-28375 Make pullupCorrelatedPredicate idempotent   
- "Adding the 'correctness' label so we remember to backport this fix to 
2.4.x."   - "This is possible, if users add the rule into 
postHocOptimizationBatches"
7. Open with Target Version 3.0.0.   - ex) SPARK-29701 Correct behaviours of 
group analytical queries when empty input given   - "We aren't fully SQL 
compliant there and I think that has been true since the beginning of spark 
sql"   - "This is not a regression"
8. Open without Target Version.   - I removed this case last week to give more 
visibility on them.
Here, I want to focus that Apache Spark is a very healthy community because we 
have diverse opinions and reevaluating JIRA issues are the results of the 
community decision based on the discusson. I believe that it will go well 
eventually. In the above, I added those example JIRA IDs and the collected 
reasons just to give some colors to illustrate all cases are the real cases. 
There is no case to be blamed in the above.
  
Although some JIRA issues will jump from one category into another category 
time to time, the categories will remain there. I want to propose a small 
additional work on `Target Version` to distinguish the above categories easily 
to communicate clearly in the community. This should be done by committers 
because we have the following policy on `Target Version`.
    "Target Version. This is assigned by committers to indicate a PR has been 
accepted for possible fix by the target version."
Proposed Idea:    A. To reduce the mismatch between `Target Version` vs 
`Affected Version`:       When a committer set `correctness` or `data-loss` 
label, `Target Version` should be set together according to the `Affected 
Versions`.       In case of the insufficient `Target Version` (e.g. `Target 
Version`=`3.0.0` for `Affected Version`=`2.4.4,3.0.0`), he/she need to add a 
comment on the JIRA.       For example, "This is 3.0.0-specific issue"
    B. To reduce the mismatch between `Target Version` vs `Fixed Version`:      
 When a committer resolve `correctness` or `data-loss` labeled issue, `Target 
Version` should be compared with `Fixed Version`.       In case of the 
insufficient `Fixed Version` (e.g. `Target