Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-22 Thread Hyukjin Kwon
There's another PR open to expose this more publicity in Python side (
https://github.com/apache/spark/pull/27331).

To sum up, I would like to make sure we know these below:
- Is this expression only for partition or do we plan to expose this to
replace other existing expressions as some kind of public DSv2 expression
API?
- Do we want to support other expressions here?
  - If so, why do we need partition-specific expressions?
  - If not, why don't we use a different syntax and class for this API?
- What about we expose a native function to allow transform like a UDF?

Ryan and Wenchen, do you mind if I ask answers for these questions?

2020년 1월 17일 (금) 오전 10:25, Hyukjin Kwon 님이 작성:

> Thanks for giving me some context and clarification, Ryan.
>
> I think I was rather trying to propose to revert because I don't see the
> explicit plan here and it was just left half-done for a long while.
> From reading the PR description and codes, I could not guess in which way
> we should fix this API (e.g., is this expression only for partition or
> replacement of all expressions?). Also, if you take a look at the commit
> log, it has not been fixed for 10 months except moving around or minor
> fixes.
>
> Do you mind if I ask how we plan to extend this feature? For example,
> - if we want to replace existing expressions at the end
> - if we want to add a copy of expressions for some reasons.
> - How will we handle ambiguity of supported expressions between other
> datasource implementations and Spark.
> - If we can't tell other expressions are supported here, why don't we just
> use different syntax to clarify?
>
> If we have this plan or doc, and people can fix accordingly with
> incremental improvements, I am good to keep it.
>
>
> Here are some of followup questions and answers:
>
> > I don't think there is reason to revert this simply because of some of
> the early choices, like deciding to start a public expression API. If you'd
> like to extend this to "fix" areas where you find it confusing, then please
> do.
>
> If it's clear that we should redesign the API, or there is no more plan
> about that API at this moment, I think it can be an option to revert, in
> particular, considering that code freeze is being close. For example, why
> did we try UDF-like way by exposing a function interface only.
>
>
> > The idea was that Spark needs a public expression API anyway for other
> uses
>
> I was wondering why we should we a public expression API in DSv2. Is there
> some places where UDFs can't cover?
> As said above, it requires a duplication of existing expressions is
> required and wonder if this is worthwhile.
> With the stub of Transform API, it looks we want this but I don't know why.
>
>
> > None of this has been confusing or misleading for our users, who caught
> on quickly.
>
> Maybe using simple case wouldn't bring so much confusions if they were
> already told about it.
> However, if we think about the difference and subtleties, I doubt if the
> users already know the answers:
>
> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>
>   - It looks expressions and allowing other expressions / combinations
>   - Since the expressions are handled by DSv2, the behaviours are
> dependent on DSv2 e.g., using *transform* against Datasource
> implementation A and B are different.
>  - Likewise, if Spark supports *transform* here, the behaviour will be
> different.
>
>
> 2020년 1월 17일 (금) 오전 2:36, Ryan Blue 님이 작성:
>
>> Hi everyone,
>>
>> Let me recap some of the discussions that got us to where we are with
>> this today. Hopefully that will provide some clarity.
>>
>> The purpose of partition transforms is to allow source implementations to
>> internally handle partitioning. Right now, users are responsible for this.
>> For example, users will transform timestamps into date strings when writing
>> and other people will provide a filter on those date strings when scanning.
>> This is error-prone: users commonly forget to add partition filters in
>> addition to data filters, if anyone uses the wrong format or transformation
>> queries will silently return incorrect results, etc. But sources can (and
>> should) automatically handle storing and retrieving data internally because
>> it is much easier for users.
>>
>> When we first proposed transforms, I wanted to use Expression. But
>> Reynold rightly pointed out that Expression is an internal API that should
>> not be exposed. So we decided to compromise by building a public
>> expressions API like the public Filter API for the initial purpose of
>> passing transform expressions to sources. The idea was that Spark needs a
>> public expression API anyway for other uses, like requesting a distribution
>> and ordering for a writer. To keep things simple, we chose to build a
>> minimal public expression API and expand it incrementally as we need more
>> features.
>>
>> We also considered whether to parse all expressions and convert only
>> transformatio

Re: Correctness and data loss issues

2020-01-22 Thread Tom Graves
 My thoughts on your list, would be good to get people who worked on these 
issues input. Obviously we can weigh the importance of these vs getting 2.4.5 
out that has a bunch of other correctness fixes you mention as well.  I think 
you have already pinged on most of the jira to get feedback.

 SPARK-30218 Columns used in inequality conditions for joins not resolved 
correctly in case of common lineageYou already linked to SPARK-28344 and asked 
the question about back port
    SPARK-29701 Different answers when empty input given in GROUPING SETSThis 
seems like Postgres compatibility thing again not a correctness issue
    SPARK-29699 Different answers in nested aggregates with window 
functionsThis seems like Postgres compatibility thing again not a correctness 
issue
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe This is 
currently listed as an improvement and I can see an argument user has to 
explicitly do this in separate threads so seems less critical to me though 
definitely nice to fix. personally think its ok to not have in 2.4.5
    SPARK-28125 dataframes created by randomSplit have overlapping rowsSeems 
like something we should fix
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code 
gen enabledSeems like we should fix
    SPARK-28024 Incorrect numeric values when out of rangeSeems like we could 
skip for 2.4.5 and some overflow exceptions fixed in 3.0
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable 
expressionsWould be good to understand what fixed in 3.0 to see if can back port
    SPARK-27619 MapType should be prohibited in hash expressionsSeems 
behavioral to me and its been consistent so seems ok to skip for 2.4.5
    SPARK-27298 Dataset except operation gives different results(dataset count) 
on Spark 2.3.0 Windows and Spark 2.3.0 Linux environmentSeems to be a windows 
vs linux issue and seems like we should investigate
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY 
clauseSimilar seems to be fixed in spark 3.0 so need to see if we can back port 
if we can find what fixed
    SPARK-27213 Unexpected results when filter is used after distinctNeed to 
try to reproduce on 2.4.X
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table 
if schema evolvesSeems like we should investigate further for 2.4.x fix
    SPARK-25150 Joining DataFrames derived from the same source yields 
confusing/incorrect resultsSeems like we should investigate further for 2.4.x 
fix
    SPARK-21774 The rule PromoteStrings cast string to a wrong data typeSeems 
like we should investigate further for 2.4.x fix
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0
Seems wrong but if its been consistent for the entire 2.0 may be ok to skip for 
2.4.x
TomOn Wednesday, January 22, 2020, 11:43:30 AM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, Tom.
Then, along with the following, do you think we need to hold on 2.4.5 release, 
too?
> If it's really a correctness issue we should hold 3.0 for it.
Recently,
    (1) 2.4.4 delivered 9 correctness patches.
    (2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches, too.
        SPARK-29101 CSV datasource returns incorrect .count() from file with 
malformed records
        SPARK-30447 Constant propagation nullability issue
        SPARK-29708 Different answers in aggregates of duplicate grouping sets
        SPARK-29651 Incorrect parsing of interval seconds fraction
        SPARK-29918 RecordBinaryComparator should check endianness when 
compared by long
        SPARK-29042 Sampling-based RDD with unordered input should be 
INDETERMINATE
        SPARK-30082 Zeros are being treated as NaNs
        SPARK-29743 sample should set needCopyResult to true if its child is
        SPARK-26985 Test "access only some column of the all of columns " fails 
on big endian

Without the official Apache Spark 2.4.5 binaries,there is no official way to 
deliver the 9 correctness fixes in (2) to the users.
In addition, usually, the correctness fixes are independent to each other.
Bests,
Dongjoon.

On Wed, Jan 22, 2020 at 7:02 AM Tom Graves  wrote:

 I agree, I think we just need to go through all of them and individual assess 
each one. If it's really a correctness issue we should hold 3.0 for it.
On the 2.4 release I didn't see an explanation on  
https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back ported, 
I think in the very least we need that in each jira comment.
spark-29701 looks more like compatibility with Postgres then a purely wrong 
answer to me, if Spark has been consistent about that it feels like it can wait 
for 3.0 but would be good to get others input and I'm not an expert on SQL 
standard and what do the other sql engines do in this case.
Tom
On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
According to our policy, "Correctness and data loss issues should be considered 
Blockers".

    

Re: Correctness and data loss issues

2020-01-22 Thread Dongjoon Hyun
Hi, All.

BTW, based on the AS-IS feedbacks,
I updated all open `correctness` and `dataloss` issues like the followings.

1. Raised the issue priority into `Blocker`.
2. Set the target version to `3.0.0`.

It's a time to give more visibility to those issues in order to close or
resolve.

The remaining things are the followings:

1. Revisit `3.0.0`-only correctness patches?
2. Set the target version to `2.4.5`? (Specifically, is this feasible
in terms of timeline?)

Bests,
Dongjoon.


On Wed, Jan 22, 2020 at 9:43 AM Dongjoon Hyun 
wrote:

> Hi, Tom.
>
> Then, along with the following, do you think we need to hold on 2.4.5
> release, too?
>
> > If it's really a correctness issue we should hold 3.0 for it.
>
> Recently,
>
> (1) 2.4.4 delivered 9 correctness patches.
> (2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches,
> too.
>
> SPARK-29101 CSV datasource returns incorrect .count() from file
> with malformed records
> SPARK-30447 Constant propagation nullability issue
> SPARK-29708 Different answers in aggregates of duplicate grouping
> sets
> SPARK-29651 Incorrect parsing of interval seconds fraction
> SPARK-29918 RecordBinaryComparator should check endianness when
> compared by long
> SPARK-29042 Sampling-based RDD with unordered input should be
> INDETERMINATE
> SPARK-30082 Zeros are being treated as NaNs
> SPARK-29743 sample should set needCopyResult to true if its child
> is
> SPARK-26985 Test "access only some column of the all of columns "
> fails on big endian
>
> Without the official Apache Spark 2.4.5 binaries,
> there is no official way to deliver the 9 correctness fixes in (2) to the
> users.
> In addition, usually, the correctness fixes are independent to each other.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jan 22, 2020 at 7:02 AM Tom Graves  wrote:
>
>> I agree, I think we just need to go through all of them and individual
>> assess each one. If it's really a correctness issue we should hold 3.0 for
>> it.
>>
>> On the 2.4 release I didn't see an explanation on
>> https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back
>> ported, I think in the very least we need that in each jira comment.
>>
>> spark-29701 looks more like compatibility with Postgres then a purely
>> wrong answer to me, if Spark has been consistent about that it feels like
>> it can wait for 3.0 but would be good to get others input and I'm not an
>> expert on SQL standard and what do the other sql engines do in this case.
>>
>> Tom
>>
>> On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>
>> Hi, All.
>>
>> According to our policy, "Correctness and data loss issues should be
>> considered Blockers".
>>
>> - http://spark.apache.org/contributing.html
>>
>> Since we are close to branch-3.0 cut,
>> I want to ask your opinions on the following correctness and data loss
>> issues.
>>
>> SPARK-30218 Columns used in inequality conditions for joins not
>> resolved correctly in case of common lineage
>> SPARK-29701 Different answers when empty input given in GROUPING SETS
>> SPARK-29699 Different answers in nested aggregates with window
>> functions
>> SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
>> SPARK-28125 dataframes created by randomSplit have overlapping rows
>> SPARK-28067 Incorrect results in decimal aggregation with whole-stage
>> code gen enabled
>> SPARK-28024 Incorrect numeric values when out of range
>> SPARK-27784 Alias ID reuse can break correctness when substituting
>> foldable expressions
>> SPARK-27619 MapType should be prohibited in hash expressions
>> SPARK-27298 Dataset except operation gives different results(dataset
>> count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
>> SPARK-27282 Spark incorrect results when using UNION with GROUP BY
>> clause
>> SPARK-27213 Unexpected results when filter is used after distinct
>> SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive
>> table if schema evolves
>> SPARK-25150 Joining DataFrames derived from the same source yields
>> confusing/incorrect results
>> SPARK-21774 The rule PromoteStrings cast string to a wrong data type
>> SPARK-19248 Regex_replace works in 1.6 but not in 2.0
>>
>> Some of them are targeted on 3.0.0, but the others are not.
>> Although we will work on them until 3.0.0,
>> I'm not sure we can reach a status with no known correctness and data
>> loss issue.
>>
>> How do you think about the above issues?
>>
>> Bests,
>> Dongjoon.
>>
>


Re: Correctness and data loss issues

2020-01-22 Thread Dongjoon Hyun
Hi, Tom.

Then, along with the following, do you think we need to hold on 2.4.5
release, too?

> If it's really a correctness issue we should hold 3.0 for it.

Recently,

(1) 2.4.4 delivered 9 correctness patches.
(2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches, too.

SPARK-29101 CSV datasource returns incorrect .count() from file
with malformed records
SPARK-30447 Constant propagation nullability issue
SPARK-29708 Different answers in aggregates of duplicate grouping
sets
SPARK-29651 Incorrect parsing of interval seconds fraction
SPARK-29918 RecordBinaryComparator should check endianness when
compared by long
SPARK-29042 Sampling-based RDD with unordered input should be
INDETERMINATE
SPARK-30082 Zeros are being treated as NaNs
SPARK-29743 sample should set needCopyResult to true if its child is
SPARK-26985 Test "access only some column of the all of columns "
fails on big endian

Without the official Apache Spark 2.4.5 binaries,
there is no official way to deliver the 9 correctness fixes in (2) to the
users.
In addition, usually, the correctness fixes are independent to each other.

Bests,
Dongjoon.


On Wed, Jan 22, 2020 at 7:02 AM Tom Graves  wrote:

> I agree, I think we just need to go through all of them and individual
> assess each one. If it's really a correctness issue we should hold 3.0 for
> it.
>
> On the 2.4 release I didn't see an explanation on
> https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back
> ported, I think in the very least we need that in each jira comment.
>
> spark-29701 looks more like compatibility with Postgres then a purely
> wrong answer to me, if Spark has been consistent about that it feels like
> it can wait for 3.0 but would be good to get others input and I'm not an
> expert on SQL standard and what do the other sql engines do in this case.
>
> Tom
>
> On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>
> Hi, All.
>
> According to our policy, "Correctness and data loss issues should be
> considered Blockers".
>
> - http://spark.apache.org/contributing.html
>
> Since we are close to branch-3.0 cut,
> I want to ask your opinions on the following correctness and data loss
> issues.
>
> SPARK-30218 Columns used in inequality conditions for joins not
> resolved correctly in case of common lineage
> SPARK-29701 Different answers when empty input given in GROUPING SETS
> SPARK-29699 Different answers in nested aggregates with window
> functions
> SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
> SPARK-28125 dataframes created by randomSplit have overlapping rows
> SPARK-28067 Incorrect results in decimal aggregation with whole-stage
> code gen enabled
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27784 Alias ID reuse can break correctness when substituting
> foldable expressions
> SPARK-27619 MapType should be prohibited in hash expressions
> SPARK-27298 Dataset except operation gives different results(dataset
> count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
> SPARK-27282 Spark incorrect results when using UNION with GROUP BY
> clause
> SPARK-27213 Unexpected results when filter is used after distinct
> SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive
> table if schema evolves
> SPARK-25150 Joining DataFrames derived from the same source yields
> confusing/incorrect results
> SPARK-21774 The rule PromoteStrings cast string to a wrong data type
> SPARK-19248 Regex_replace works in 1.6 but not in 2.0
>
> Some of them are targeted on 3.0.0, but the others are not.
> Although we will work on them until 3.0.0,
> I'm not sure we can reach a status with no known correctness and data loss
> issue.
>
> How do you think about the above issues?
>
> Bests,
> Dongjoon.
>


Re: Adding Maven Central mirror from Google to the build?

2020-01-22 Thread Tom Graves
 +1 for proposal.
Tom
On Tuesday, January 21, 2020, 04:37:04 PM CST, Sean Owen  
wrote:  
 
 See https://github.com/apache/spark/pull/27307 for some context. We've
had to add, in at least one place, some settings to resolve artifacts
from a mirror besides Maven Central to work around some build
problems.

Now, we find it might be simpler to just use this mirror as the
primary repo in the build, falling back to Central if needed.

The question is: any objections to that?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

  

Re: Correctness and data loss issues

2020-01-22 Thread Tom Graves
 I agree, I think we just need to go through all of them and individual assess 
each one. If it's really a correctness issue we should hold 3.0 for it.
On the 2.4 release I didn't see an explanation on  
https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back ported, 
I think in the very least we need that in each jira comment.
spark-29701 looks more like compatibility with Postgres then a purely wrong 
answer to me, if Spark has been consistent about that it feels like it can wait 
for 3.0 but would be good to get others input and I'm not an expert on SQL 
standard and what do the other sql engines do in this case.
Tom
On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
According to our policy, "Correctness and data loss issues should be considered 
Blockers".

    - http://spark.apache.org/contributing.html
Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

    SPARK-30218 Columns used in inequality conditions for joins not resolved 
correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code 
gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable 
expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset count) 
on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table 
if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields 
confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,I'm not sure we can reach a status 
with no known correctness and data loss issue.
How do you think about the above issues?
Bests,Dongjoon.  

Unsubscribe

2020-01-22 Thread sadhana avasarala
 

 

From: Dongjoon Hyun 
Date: Wednesday, January 22, 2020 at 1:57 AM
To: Wenchen Fan 
Cc: dev 
Subject: Re: Correctness and data loss issues

 

Thank you for checking, Wenchen! Sure, we need to do that.

 

Another question is "What can we do for 2.4.5 release"?
Some of the fixes cannot be backported due to the technical difficulty like the 
followings.

1. https://issues.apache.org/jira/browse/SPARK-26154
Stream-stream joins - left outer join gives inconsistent output

(Like this, there are eight correctness fixes which lands only at 3.0.0)


2. https://github.com/apache/spark/pull/27233
[SPARK-29701][SQL] Correct behaviours of group analytical queries when 
empty input given
(This is on-going PR which is currently blocking 2.4.5 RC2).

Bests,
Dongjoon.

 

On Tue, Jan 21, 2020 at 11:10 PM Wenchen Fan  wrote:

I think we need to go through them during the 3.0 QA period, and try to fix the 
valid ones.

 

For example, the first ticket should be fixed already in 
https://issues.apache.org/jira/browse/SPARK-28344

 

On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun  wrote:

Hi, All.

 

According to our policy, "Correctness and data loss issues should be considered 
Blockers".

- http://spark.apache.org/contributing.html


Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

SPARK-30218 Columns used in inequality conditions for joins not resolved 
correctly in case of common lineage
SPARK-29701 Different answers when empty input given in GROUPING SETS
SPARK-29699 Different answers in nested aggregates with window functions
SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
SPARK-28125 dataframes created by randomSplit have overlapping rows
SPARK-28067 Incorrect results in decimal aggregation with whole-stage code 
gen enabled
SPARK-28024 Incorrect numeric values when out of range
SPARK-27784 Alias ID reuse can break correctness when substituting foldable 
expressions
SPARK-27619 MapType should be prohibited in hash expressions
SPARK-27298 Dataset except operation gives different results(dataset count) 
on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
SPARK-27213 Unexpected results when filter is used after distinct
SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table 
if schema evolves
SPARK-25150 Joining DataFrames derived from the same source yields 
confusing/incorrect results
SPARK-21774 The rule PromoteStrings cast string to a wrong data type
SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,

I'm not sure we can reach a status with no known correctness and data loss 
issue.

 

How do you think about the above issues?

 

Bests,

Dongjoon.



[Dataset API] SPARK-27249

2020-01-22 Thread Nick Afshartous



Hello,

I'm looking into starting work on this ticket

  https://issues.apache.org/jira/browse/SPARK-27249

which involves adding an API for transforming Datasets.  In the comments 
I have a question about whether or not this ticket is still necessary.


Could someone please review and advise.

Cheers,
--
   Nick

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org