Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread Reynold Xin
+1! Long due for a preview release.

On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time to make calls around this.
> 
> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge < jzhuge@ apache. org (
> jzh...@apache.org ) > wrote:
> 
> 
>> +1  Like the idea as a user and a DSv2 contributor.
>> 
>> 
>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim < kabhwan@ gmail. com (
>> kabh...@gmail.com ) > wrote:
>> 
>> 
>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>> would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we
>>> are intended to introduce new features before official release, that
>>> should work regardless of this, but if we are intended to have opportunity
>>> to test earlier, ideally it should.
>>> 
>>> 
>>> As a one of contributors in structured streaming area, I'd like to add
>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>> 
>>> 
>>> > must be done
>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>> output
>>> 
>>> It's a correctness issue with multiple users reported, being reported at
>>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>>> submitted at Jan. 2019 to fix it.
>>> 
>>> 
>>> > better to have
>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>>> start and end offset
>>> * SPARK-20568 Delete files after processing in structured streaming
>>> 
>>> 
>>> There're some more new features/improvements items in SS, but given we're
>>> talking about ramping-down, above list might be realistic one.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin < jgp@ jgp. net (
>>> j...@jgp.net ) > wrote:
>>> 
>>> 
 As a user/non committer, +1
 
 
 I love the idea of an early 3.0.0 so we can test current dev against it, I
 know the final 3.x will probably need another round of testing when it
 gets out, but less for sure... I know I could checkout and compile, but
 having a “packaged” preversion is great if it does not take too much time
 to the team...
 
 jg
 
 
 
 On Sep 11, 2019, at 20:40, Hyukjin Kwon < gurwls223@ gmail. com (
 gurwls...@gmail.com ) > wrote:
 
 
 
> +1 from me too but I would like to know what other people think too.
> 
> 
> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
> dongjoon.h...@gmail.com ) >님이 작성:
> 
> 
>> Thank you, Sean.
>> 
>> 
>> I'm also +1 for the following three.
>> 
>> 
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>> 
>> 
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps 
>> it
>> a lot.
>> 
>> 
>> After this discussion, can we have some timeline for `Spark 3.0 Release
>> Window` in our versioning-policy page?
>> 
>> 
>> - https:/ / spark. apache. org/ versioning-policy. html (
>> https://spark.apache.org/versioning-policy.html )
>> 
>> 
>> Bests,
>> Dongjoon.
>> 
>> 
>> 
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer < heuermh@ gmail. com (
>> heue...@gmail.com ) > wrote:
>> 
>> 
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility 
>>> problems
>>> resolved, e.g.
>>> 
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25588 (
>>> https://issues.apache.org/jira/browse/SPARK-25588 )
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27781 (
>>> https://issues.apache.org/jira/browse/SPARK-27781 )
>>> 
>>> 
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far 
>>> as
>>> I know, Parquet has not cut a release based on this new version.
>>> 
>>> 
>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>> 
>>> 
>>> https:/ / github. com/ apache/ spark/ pull/ 24851 (
>>> https://github.com/apache/spark/pull/24851 )
>>> https:/ / github. com/ apache/ spark/ pull/ 24297 (
>>> https://github.com/apache/spark/pull/24297 )

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread Holden Karau
I like the idea from the PoV of giving folks something to start testing
against and exploring so they can raise issues with us earlier in the
process and we have more time to make calls around this.

On Thu, Sep 12, 2019 at 4:15 PM John Zhuge  wrote:

> +1  Like the idea as a user and a DSv2 contributor.
>
> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim  wrote:
>
>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>> would help to test the feature. When to cut preview release is
>> questionable, as major works are ideally to be done before that - if we are
>> intended to introduce new features before official release, that should
>> work regardless of this, but if we are intended to have opportunity to test
>> earlier, ideally it should.
>>
>> As a one of contributors in structured streaming area, I'd like to add
>> some items for Spark 3.0, both "must be done" and "better to have". For
>> "better to have", I pick some items for new features which committers
>> reviewed couple of rounds and dropped off without soft-reject (No valid
>> reason to stop). For Spark 2.4 users, only added feature for structured
>> streaming is Kafka delegation token. (given we assume revising Kafka
>> consumer pool as improvement) I hope we provide some gifts for structured
>> streaming users in Spark 3.0 envelope.
>>
>> > must be done
>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>> output
>> It's a correctness issue with multiple users reported, being reported at
>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>> submitted at Jan. 2019 to fix it.
>>
>> > better to have
>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>> start and end offset
>> * SPARK-20568 Delete files after processing in structured streaming
>>
>> There're some more new features/improvements items in SS, but given we're
>> talking about ramping-down, above list might be realistic one.
>>
>>
>>
>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin  wrote:
>>
>>> As a user/non committer, +1
>>>
>>> I love the idea of an early 3.0.0 so we can test current dev against it,
>>> I know the final 3.x will probably need another round of testing when it
>>> gets out, but less for sure... I know I could checkout and compile, but
>>> having a “packaged” preversion is great if it does not take too much time
>>> to the team...
>>>
>>> jg
>>>
>>>
>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon  wrote:
>>>
>>> +1 from me too but I would like to know what other people think too.
>>>
>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun 님이 작성:
>>>
 Thank you, Sean.

 I'm also +1 for the following three.

 1. Start to ramp down (by the official branch-3.0 cut)
 2. Apache Spark 3.0.0-preview in 2019
 3. Apache Spark 3.0.0 in early 2020

 For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
 it a lot.

 After this discussion, can we have some timeline for `Spark 3.0 Release
 Window` in our versioning-policy page?

 - https://spark.apache.org/versioning-policy.html

 Bests,
 Dongjoon.


 On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer 
 wrote:

> I would love to see Spark + Hadoop + Parquet + Avro compatibility
> problems resolved, e.g.
>
> https://issues.apache.org/jira/browse/SPARK-25588
> https://issues.apache.org/jira/browse/SPARK-27781
>
> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As
> far as I know, Parquet has not cut a release based on this new version.
>
> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>
> https://github.com/apache/spark/pull/24851
> https://github.com/apache/spark/pull/24297
>
>michael
>
>
> On Sep 11, 2019, at 1:37 PM, Sean Owen  wrote:
>
> I'm curious what current feelings are about ramping down towards a
> Spark 3 release. It feels close to ready. There is no fixed date,
> though in the past we had informally tossed around "back end of 2019".
> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
> due.
>
> What are the few major items that must get done for Spark 3, in your
> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> should feel free to update with things that aren't really needed for
> Spark 3; I already triaged some).
>
> For me, it's:
> - DSv2?
> - Finishing touches on the Hive, JDK 11 update
>
> What about considering a preview release earlier, as happened for
> Spark 2, to get feedback much earlier than the RC cycle? Could that
> even happen ... about now?
>
> I'm also wondering what a realistic estimate of Spark 3 release is. My
> guess is quite early 

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread Matt Cheah
+1 as both a contributor and a user.

 

From: John Zhuge 
Date: Thursday, September 12, 2019 at 4:15 PM
To: Jungtaek Lim 
Cc: Jean Georges Perrin , Hyukjin Kwon , 
Dongjoon Hyun , dev 
Subject: Re: Thoughts on Spark 3 release, or a preview release

 

+1  Like the idea as a user and a DSv2 contributor.

 

On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim  wrote:

+1 (as a contributor) from me to have preview release on Spark 3 as it would 
help to test the feature. When to cut preview release is questionable, as major 
works are ideally to be done before that - if we are intended to introduce new 
features before official release, that should work regardless of this, but if 
we are intended to have opportunity to test earlier, ideally it should.

 

As a one of contributors in structured streaming area, I'd like to add some 
items for Spark 3.0, both "must be done" and "better to have". For "better to 
have", I pick some items for new features which committers reviewed couple of 
rounds and dropped off without soft-reject (No valid reason to stop). For Spark 
2.4 users, only added feature for structured streaming is Kafka delegation 
token. (given we assume revising Kafka consumer pool as improvement) I hope we 
provide some gifts for structured streaming users in Spark 3.0 envelope.

 

> must be done

* SPARK-26154 Stream-stream joins - left outer join gives inconsistent output

It's a correctness issue with multiple users reported, being reported at Nov. 
2018. There's a way to reproduce it consistently, and we have a patch submitted 
at Jan. 2019 to fix it.

 

> better to have

* SPARK-23539 Add support for Kafka headers in Structured Streaming

* SPARK-26848 Introduce new option to Kafka source - specify timestamp to start 
and end offset

* SPARK-20568 Delete files after processing in structured streaming

 

There're some more new features/improvements items in SS, but given we're 
talking about ramping-down, above list might be realistic one.

 

 

 

On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin  wrote:

As a user/non committer, +1 

 

I love the idea of an early 3.0.0 so we can test current dev against it, I know 
the final 3.x will probably need another round of testing when it gets out, but 
less for sure... I know I could checkout and compile, but having a “packaged” 
preversion is great if it does not take too much time to the team...

 

jg 

 


On Sep 11, 2019, at 20:40, Hyukjin Kwon  wrote:

+1 from me too but I would like to know what other people think too.

 

2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun 님이 작성:

Thank you, Sean. 

 

I'm also +1 for the following three.

 

1. Start to ramp down (by the official branch-3.0 cut)

2. Apache Spark 3.0.0-preview in 2019

3. Apache Spark 3.0.0 in early 2020

 

For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it a 
lot.

 

After this discussion, can we have some timeline for `Spark 3.0 Release Window` 
in our versioning-policy page?

 

- https://spark.apache.org/versioning-policy.html [spark.apache.org]

 

Bests,

Dongjoon.

 

 

On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer  wrote:

I would love to see Spark + Hadoop + Parquet + Avro compatibility problems 
resolved, e.g. 

 

https://issues.apache.org/jira/browse/SPARK-25588 [issues.apache.org]

https://issues.apache.org/jira/browse/SPARK-27781 [issues.apache.org]

 

Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as I 
know, Parquet has not cut a release based on this new version.

 

Then out of curiosity, are the new Spark Graph APIs targeting 3.0?

 

https://github.com/apache/spark/pull/24851 [github.com]

https://github.com/apache/spark/pull/24297 [github.com]

 

   michael

 



On Sep 11, 2019, at 1:37 PM, Sean Owen  wrote:

 

I'm curious what current feelings are about ramping down towards a
Spark 3 release. It feels close to ready. There is no fixed date,
though in the past we had informally tossed around "back end of 2019".
For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
due.

What are the few major items that must get done for Spark 3, in your
opinion? Below are all of the open JIRAs for 3.0 (which everyone
should feel free to update with things that aren't really needed for
Spark 3; I already triaged some).

For me, it's:
- DSv2?
- Finishing touches on the Hive, JDK 11 update

What about considering a preview release earlier, as happened for
Spark 2, to get feedback much earlier than the RC cycle? Could that
even happen ... about now?

I'm also wondering what a realistic estimate of Spark 3 release is. My
guess is quite early 2020, from here.



SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses
SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
SPARK-28588 Build a SQL 

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread John Zhuge
+1  Like the idea as a user and a DSv2 contributor.

On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim  wrote:

> +1 (as a contributor) from me to have preview release on Spark 3 as it
> would help to test the feature. When to cut preview release is
> questionable, as major works are ideally to be done before that - if we are
> intended to introduce new features before official release, that should
> work regardless of this, but if we are intended to have opportunity to test
> earlier, ideally it should.
>
> As a one of contributors in structured streaming area, I'd like to add
> some items for Spark 3.0, both "must be done" and "better to have". For
> "better to have", I pick some items for new features which committers
> reviewed couple of rounds and dropped off without soft-reject (No valid
> reason to stop). For Spark 2.4 users, only added feature for structured
> streaming is Kafka delegation token. (given we assume revising Kafka
> consumer pool as improvement) I hope we provide some gifts for structured
> streaming users in Spark 3.0 envelope.
>
> > must be done
> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
> output
> It's a correctness issue with multiple users reported, being reported at
> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
> submitted at Jan. 2019 to fix it.
>
> > better to have
> * SPARK-23539 Add support for Kafka headers in Structured Streaming
> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
> start and end offset
> * SPARK-20568 Delete files after processing in structured streaming
>
> There're some more new features/improvements items in SS, but given we're
> talking about ramping-down, above list might be realistic one.
>
>
>
> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin  wrote:
>
>> As a user/non committer, +1
>>
>> I love the idea of an early 3.0.0 so we can test current dev against it,
>> I know the final 3.x will probably need another round of testing when it
>> gets out, but less for sure... I know I could checkout and compile, but
>> having a “packaged” preversion is great if it does not take too much time
>> to the team...
>>
>> jg
>>
>>
>> On Sep 11, 2019, at 20:40, Hyukjin Kwon  wrote:
>>
>> +1 from me too but I would like to know what other people think too.
>>
>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun 님이 작성:
>>
>>> Thank you, Sean.
>>>
>>> I'm also +1 for the following three.
>>>
>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>> 2. Apache Spark 3.0.0-preview in 2019
>>> 3. Apache Spark 3.0.0 in early 2020
>>>
>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>>> it a lot.
>>>
>>> After this discussion, can we have some timeline for `Spark 3.0 Release
>>> Window` in our versioning-policy page?
>>>
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer 
>>> wrote:
>>>
 I would love to see Spark + Hadoop + Parquet + Avro compatibility
 problems resolved, e.g.

 https://issues.apache.org/jira/browse/SPARK-25588
 https://issues.apache.org/jira/browse/SPARK-27781

 Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
 as I know, Parquet has not cut a release based on this new version.

 Then out of curiosity, are the new Spark Graph APIs targeting 3.0?

 https://github.com/apache/spark/pull/24851
 https://github.com/apache/spark/pull/24297

michael


 On Sep 11, 2019, at 1:37 PM, Sean Owen  wrote:

 I'm curious what current feelings are about ramping down towards a
 Spark 3 release. It feels close to ready. There is no fixed date,
 though in the past we had informally tossed around "back end of 2019".
 For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
 Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
 due.

 What are the few major items that must get done for Spark 3, in your
 opinion? Below are all of the open JIRAs for 3.0 (which everyone
 should feel free to update with things that aren't really needed for
 Spark 3; I already triaged some).

 For me, it's:
 - DSv2?
 - Finishing touches on the Hive, JDK 11 update

 What about considering a preview release earlier, as happened for
 Spark 2, to get feedback much earlier than the RC cycle? Could that
 even happen ... about now?

 I'm also wondering what a realistic estimate of Spark 3 release is. My
 guess is quite early 2020, from here.



 SPARK-29014 DataSourceV2: Clean up current, default, and session
 catalog uses
 SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
 SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
 SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
 SPARK-28588 Build a SQL reference doc
 SPARK-28629 

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread Jungtaek Lim
+1 (as a contributor) from me to have preview release on Spark 3 as it
would help to test the feature. When to cut preview release is
questionable, as major works are ideally to be done before that - if we are
intended to introduce new features before official release, that should
work regardless of this, but if we are intended to have opportunity to test
earlier, ideally it should.

As a one of contributors in structured streaming area, I'd like to add some
items for Spark 3.0, both "must be done" and "better to have". For "better
to have", I pick some items for new features which committers reviewed
couple of rounds and dropped off without soft-reject (No valid reason to
stop). For Spark 2.4 users, only added feature for structured streaming is
Kafka delegation token. (given we assume revising Kafka consumer pool as
improvement) I hope we provide some gifts for structured streaming users in
Spark 3.0 envelope.

> must be done
* SPARK-26154 Stream-stream joins - left outer join gives inconsistent
output
It's a correctness issue with multiple users reported, being reported at
Nov. 2018. There's a way to reproduce it consistently, and we have a patch
submitted at Jan. 2019 to fix it.

> better to have
* SPARK-23539 Add support for Kafka headers in Structured Streaming
* SPARK-26848 Introduce new option to Kafka source - specify timestamp to
start and end offset
* SPARK-20568 Delete files after processing in structured streaming

There're some more new features/improvements items in SS, but given we're
talking about ramping-down, above list might be realistic one.



On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin  wrote:

> As a user/non committer, +1
>
> I love the idea of an early 3.0.0 so we can test current dev against it, I
> know the final 3.x will probably need another round of testing when it gets
> out, but less for sure... I know I could checkout and compile, but having a
> “packaged” preversion is great if it does not take too much time to the
> team...
>
> jg
>
>
> On Sep 11, 2019, at 20:40, Hyukjin Kwon  wrote:
>
> +1 from me too but I would like to know what other people think too.
>
> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun 님이 작성:
>
>> Thank you, Sean.
>>
>> I'm also +1 for the following three.
>>
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>>
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>> it a lot.
>>
>> After this discussion, can we have some timeline for `Spark 3.0 Release
>> Window` in our versioning-policy page?
>>
>> - https://spark.apache.org/versioning-policy.html
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer  wrote:
>>
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>>> problems resolved, e.g.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-25588
>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>>> as I know, Parquet has not cut a release based on this new version.
>>>
>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>
>>> https://github.com/apache/spark/pull/24851
>>> https://github.com/apache/spark/pull/24297
>>>
>>>michael
>>>
>>>
>>> On Sep 11, 2019, at 1:37 PM, Sean Owen  wrote:
>>>
>>> I'm curious what current feelings are about ramping down towards a
>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>> though in the past we had informally tossed around "back end of 2019".
>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>> due.
>>>
>>> What are the few major items that must get done for Spark 3, in your
>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>> should feel free to update with things that aren't really needed for
>>> Spark 3; I already triaged some).
>>>
>>> For me, it's:
>>> - DSv2?
>>> - Finishing touches on the Hive, JDK 11 update
>>>
>>> What about considering a preview release earlier, as happened for
>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>> even happen ... about now?
>>>
>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>> guess is quite early 2020, from here.
>>>
>>>
>>>
>>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
>>> uses
>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>> SPARK-28588 Build a SQL reference doc
>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>> SPARK-28684 Hive module support JDK 11
>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>> after some operations
>>> SPARK-28372 Document Spark WEB UI
>>> SPARK-28476 Support ALTER 

Re: Ask for ARM CI for spark

2019-09-12 Thread Sean Owen
I don't know what's involved in actually accepting or operating those
machines, so can't comment there, but in the meantime it's good that you
are running these tests and can help report changes needed to keep it
working with ARM. I would continue with that for now.

On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang 
wrote:

> Hi all,
>
> For the whole work process of spark ARM CI, we want to make 2 things clear.
>
> The first thing is:
> About spark ARM CI, now we have two periodic jobs, one job[1] based on
> commit[2](which already fixed the replay tests failed issue[3], we made a
> new test branch based on date 09-09-2019), the other job[4] based on spark
> master.
>
> The first job we test on the specified branch to prove that our ARM CI is
> good and stable.
> The second job checks spark master every day, then we can find whether the
> latest commits affect the ARM CI. According to the build history and
> result, it shows that some problems are easier to find on ARM like
> SPARK-28770 , and it
> also shows that we would make efforts to trace and figure them out, till
> now we have found and fixed several problems[5][6][7], thanks everyone of
> the community :). And we believe that ARM CI is very necessary, right?
>
> The second thing is:
> We plan to run the jobs for a period of time, and you can see the result
> and logs from 'build history' of the jobs console, if everything goes well
> for one or two weeks could community accept the ARM CI? or how long the
> periodic jobs to run then our community could have enough confidence to
> accept the ARM CI? As you suggested before, it's good to integrate ARM CI
> to amplab jenkins, we agree that and we can donate the ARM instances and
> then maintain the ARM-related test jobs together with community, any
> thoughts?
>
> Thank you all!
>
> [1]
> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64
> [2]
> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72
> [3] https://issues.apache.org/jira/browse/SPARK-28770
> [4]
> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64
> [5] https://github.com/apache/spark/pull/25186
> [6] https://github.com/apache/spark/pull/25279
> [7] https://github.com/apache/spark/pull/25673
>
>
>
> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen  wrote:
>
>> Yes, I think it's just local caching. After you run the build you should
>> find lots of stuff cached at ~/.m2/repository and it won't download every
>> time.
>>
>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo 
>> wrote:
>>
>>> Hi Sean,
>>> Thanks for reply. And very apologize for making you confused.
>>> I know the dependencies will be downloaded from SBT or Maven. But the
>>> Spark QA job also exec "mvn clean package", why the log didn't print
>>> "downloading some jar from Maven central [1] and build very fast. Is the
>>> reason that Spark Jenkins build the Spark jars in the physical machiines
>>> and won't destrory the test env after job is finished? Then the other job
>>> build Spark will get the dependencies jar from the local cached, as the
>>> previous jobs exec "mvn package", those dependencies had been downloaded
>>> already on local worker machine. Am I right? Is that the reason the job
>>> log[1] didn't print any downloading information from Maven Central?
>>>
>>> Thank you very much.
>>>
>>> [1]
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>
>>>
>>> Best regards
>>>
>>> ZhaoBo
>>>
>>> [image: Mailtrack]
>>> 
>>>  Sender
>>> notified by
>>> Mailtrack
>>> 
>>>  19/08/16
>>> 下午03:58:53
>>>
>>> Sean Owen  于2019年8月16日周五 上午10:38写道:
>>>
 I'm not sure what you mean. The dependencies are downloaded by SBT and
 Maven like in any other project, and nothing about it is specific to Spark.
 The worker machines cache artifacts that are downloaded from these, but
 this is a function of Maven and SBT, not Spark. You may find that the
 initial download takes a long time.

 On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo 
 wrote:

> Hi Sean,
>
> Thanks very much for pointing out the roadmap. ;-). Then I think we
> will continue to focus on our test environment.
>
> For the networking problems, I mean that we can access Maven Central,
> and jobs cloud download the required jar package with a high network 
> speed.
> What we want to know is that, why the Spark QA test jobs[1] log shows the
> job script/maven build seem don't download the jar packages? Could you 
> tell
> us the reason about that? Thank you.  The reason we raise the "networking
> problems" is that we found a phenomenon during we test, if we 

Re: Welcoming some new committers and PMC members

2019-09-12 Thread Jacek Laskowski
Hi,

What a great news! Congrats to all awarded and the community for voting
them in!

p.s. I think it should go to the user mailing list too.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers and one PMC
> member. Join me in welcoming them to their new roles!
>
> New PMC member: Dongjoon Hyun
>
> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
> Weichen Xu, Ruifeng Zheng
>
> The new committers cover lots of important areas including ML, SQL, and
> data sources, so it’s great to have them here. All the best,
>
> Matei and the Spark PMC
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcoming some new committers and PMC members

2019-09-12 Thread Furkan KAMACI
Hi,

Congrats!

Kind Regards,
Furkan KAMACI



On Tue, Sep 10, 2019 at 2:48 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Congrats! Well deserved.
>
> On Tue, Sep 10, 2019 at 1:20 PM Driesprong, Fokko 
> wrote:
>
>> Congrats all, well deserved!
>>
>>
>> Cheers, Fokko
>>
>> Op di 10 sep. 2019 om 10:21 schreef Gabor Somogyi <
>> gabor.g.somo...@gmail.com>:
>>
>>> Congrats Guys!
>>>
>>> G
>>>
>>>
>>> On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
>>> wrote:
>>>
 Hi all,

 The Spark PMC recently voted to add several new committers and one PMC
 member. Join me in welcoming them to their new roles!

 New PMC member: Dongjoon Hyun

 New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming
 Wang, Weichen Xu, Ruifeng Zheng

 The new committers cover lots of important areas including ML, SQL, and
 data sources, so it’s great to have them here. All the best,

 Matei and the Spark PMC


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>
> --
>


Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-12 Thread Gengliang Wang
Thanks for the great suggestions from Ryan, Russell, and Wenchen.
As there are -1 from Ryan and Felix, this vote doesn't pass.

As per the SQL standard, data rounding or truncation is allowed on
assigning value to numeric/datetime type. So, I think we can discuss
whether data rounding/truncation should be allowed in strict mode, as Spark
doesn't produce invalid null values with data rounding/truncation.
For example, the conversion between `Date` and `Timestamp` should be
allowed as per the SQL standard. For another example, converting `Decimal`
to `Double` should be allowed as well, since in Spark SQL the max precision
of Decimal type is 38, while the range of `Double` is
[-1.7976931348623157E308, 1.7976931348623157E308]. Converting `Decimal` to
`Double` won't cause an overflow in Spark.

After refining the ANSI mode and strict modes, we can vote for the default
table insertion behavior for both V1 and V2.



On Thu, Sep 12, 2019 at 2:09 PM Wenchen Fan  wrote:

> I think it's too risky to enable the "runtime exception" mode by default
> in the next release. We don't even have a spec to describe when Spark would
> throw runtime exceptions. Currently the "runtime exception" mode works for
> overflow but I believe there are more places need to be considered (e.g.
> divide by zero).
>
> However, Ryan has a good point that if we use the ANSI store assignment
> policy, we should make sure the table insertion behavior completely follows
> the SQL spec. After reading the related section in the SQL spec, the rule
> is to throw runtime exception for value out of range, which is the overflow
> check we already have in Spark. I think we should enable the overflow
> check during table insertion, when ANSI policy is picked. This should be
> done no matter which policy becomes the default eventually.
>
> On Mon, Sep 9, 2019 at 8:00 AM Felix Cheung 
> wrote:
>
>> I’d prefer strict mode and fail fast (analysis check)
>>
>> Also I like what Alastair suggested about standard clarification.
>>
>> I think we can re-visit this proposal and restart the vote
>>
>> --
>> *From:* Ryan Blue 
>> *Sent:* Friday, September 6, 2019 5:28 PM
>> *To:* Alastair Green
>> *Cc:* Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang
>> *Subject:* Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in
>> table insertion by default
>>
>>
>> We discussed this thread quite a bit in the DSv2 sync up and Russell
>> brought up a really good point about this.
>>
>> The ANSI rule used here specifies how to store a specific value, V, so
>> this is a runtime rule — an earlier case covers when V is NULL, so it is
>> definitely referring to a specific value. The rule requires that if the
>> type doesn’t match or if the value cannot be truncated, an exception is
>> thrown for “numeric value out of range”.
>>
>> That runtime error guarantees that even though the cast is introduced at
>> analysis time, unexpected NULL values aren’t inserted into a table in place
>> of data values that are out of range. Unexpected NULL values are the
>> problem that was concerning to many of us in the discussion thread, but it
>> turns out that real ANSI behavior doesn’t have the problem. (In the sync,
>> we validated this by checking Postgres and MySQL behavior, too.)
>>
>> In Spark, the runtime check is a separate configuration property from
>> this one, but in order to actually implement ANSI semantics, both need to
>> be set. So I think it makes sense to*change both defaults to be ANSI*.
>> The analysis check alone does not implement the ANSI standard.
>>
>> In the sync, we also agreed that it makes sense to be able to turn off
>> the runtime check in order to avoid job failures. Another, safer way to
>> avoid job failures is to require an explicit cast, i.e., strict mode.
>>
>> I think that we should amend this proposal to change the default for both
>> the runtime check and the analysis check to ANSI.
>>
>> As this stands now, I vote -1. But I would support this if the vote were
>> to set both runtime and analysis checks to ANSI mode.
>>
>> rb
>>
>> On Fri, Sep 6, 2019 at 3:12 AM Alastair Green
>>  wrote:
>>
>>> Makes sense.
>>>
>>> While the ISO SQL standard automatically becomes an American national
>>>  (ANSI) standard, changes are only made to the International (ISO/IEC)
>>> Standard, which is the authoritative specification.
>>>
>>> These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2),
>>> section 9.2.
>>>
>>> Could we rename the proposed default to “ISO/IEC (ANSI)”?
>>>
>>> — Alastair
>>>
>>> On Thu, Sep 5, 2019 at 17:17, Reynold Xin  wrote:
>>>
>>> Having three modes is a lot. Why not just use ansi mode as default, and
>>> legacy for backward compatibility? Then over time there's only the ANSI
>>> mode, which is standard compliant and easy to understand. We also don't
>>> need to invent a standard just for Spark.
>>>
>>>
>>> On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan 
>>> wrote:
>>>
 +1

 To be 

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-12 Thread Wenchen Fan
I think it's too risky to enable the "runtime exception" mode by default in
the next release. We don't even have a spec to describe when Spark would
throw runtime exceptions. Currently the "runtime exception" mode works for
overflow but I believe there are more places need to be considered (e.g.
divide by zero).

However, Ryan has a good point that if we use the ANSI store assignment
policy, we should make sure the table insertion behavior completely follows
the SQL spec. After reading the related section in the SQL spec, the rule
is to throw runtime exception for value out of range, which is the overflow
check we already have in Spark. I think we should enable the overflow
check during table insertion, when ANSI policy is picked. This should be
done no matter which policy becomes the default eventually.

On Mon, Sep 9, 2019 at 8:00 AM Felix Cheung 
wrote:

> I’d prefer strict mode and fail fast (analysis check)
>
> Also I like what Alastair suggested about standard clarification.
>
> I think we can re-visit this proposal and restart the vote
>
> --
> *From:* Ryan Blue 
> *Sent:* Friday, September 6, 2019 5:28 PM
> *To:* Alastair Green
> *Cc:* Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang
> *Subject:* Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in
> table insertion by default
>
>
> We discussed this thread quite a bit in the DSv2 sync up and Russell
> brought up a really good point about this.
>
> The ANSI rule used here specifies how to store a specific value, V, so
> this is a runtime rule — an earlier case covers when V is NULL, so it is
> definitely referring to a specific value. The rule requires that if the
> type doesn’t match or if the value cannot be truncated, an exception is
> thrown for “numeric value out of range”.
>
> That runtime error guarantees that even though the cast is introduced at
> analysis time, unexpected NULL values aren’t inserted into a table in place
> of data values that are out of range. Unexpected NULL values are the
> problem that was concerning to many of us in the discussion thread, but it
> turns out that real ANSI behavior doesn’t have the problem. (In the sync,
> we validated this by checking Postgres and MySQL behavior, too.)
>
> In Spark, the runtime check is a separate configuration property from this
> one, but in order to actually implement ANSI semantics, both need to be
> set. So I think it makes sense to*change both defaults to be ANSI*. The
> analysis check alone does not implement the ANSI standard.
>
> In the sync, we also agreed that it makes sense to be able to turn off the
> runtime check in order to avoid job failures. Another, safer way to avoid
> job failures is to require an explicit cast, i.e., strict mode.
>
> I think that we should amend this proposal to change the default for both
> the runtime check and the analysis check to ANSI.
>
> As this stands now, I vote -1. But I would support this if the vote were
> to set both runtime and analysis checks to ANSI mode.
>
> rb
>
> On Fri, Sep 6, 2019 at 3:12 AM Alastair Green
>  wrote:
>
>> Makes sense.
>>
>> While the ISO SQL standard automatically becomes an American national
>>  (ANSI) standard, changes are only made to the International (ISO/IEC)
>> Standard, which is the authoritative specification.
>>
>> These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2), section
>> 9.2.
>>
>> Could we rename the proposed default to “ISO/IEC (ANSI)”?
>>
>> — Alastair
>>
>> On Thu, Sep 5, 2019 at 17:17, Reynold Xin  wrote:
>>
>> Having three modes is a lot. Why not just use ansi mode as default, and
>> legacy for backward compatibility? Then over time there's only the ANSI
>> mode, which is standard compliant and easy to understand. We also don't
>> need to invent a standard just for Spark.
>>
>>
>> On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan 
>> wrote:
>>
>>> +1
>>>
>>> To be honest I don't like the legacy policy. It's too loose and easy for
>>> users to make mistakes, especially when Spark returns null if a function
>>> hit errors like overflow.
>>>
>>> The strict policy is not good either. It's too strict and stops valid
>>> use cases like writing timestamp values to a date type column. Users do
>>> expect truncation to happen without adding cast manually in this case. It's
>>> also weird to use a spark specific policy that no other database is using.
>>>
>>> The ANSI policy is better. It stops invalid use cases like writing
>>> string values to an int type column, while keeping valid use cases like
>>> timestamp -> date.
>>>
>>> I think it's no doubt that we should use ANSI policy instead of legacy
>>> policy for v1 tables. Except for backward compatibility, ANSI policy is
>>> literally better than the legacy policy.
>>>
>>> The v2 table is arguable here. Although the ANSI policy is better than
>>> strict policy to me, this is just the store assignment policy, which only
>>> partially controls the table insertion behavior. With Spark's