Re: Spark 3.0 preview release 2?

2019-12-12 Thread Xiao Li
Hi, Yuming,

Thank you, @Wang, Yuming  ! It sounds like everyone is
fine about releasing a new Spark 3.0 preview. Could you start working on
it?

Thanks,

Xiao

On Tue, Dec 10, 2019 at 2:14 PM Dongjoon Hyun 
wrote:

> BTW, our Jenkins seems to be behind.
>
> 1. For the first item, `Support JDK 11 with Hadoop 2.7`:
> At least, we need a new Jenkins job
> `spark-master-test-maven-hadoop-2.7-jdk-11/`.
> 2. https://issues.apache.org/jira/browse/SPARK-28900 (Test Pyspark,
> SparkR on JDK 11 with run-tests)
> 3. https://issues.apache.org/jira/browse/SPARK-29988 (Adjust Jenkins jobs
> for `hive-1.2/2.3` combination)
>
> It would be great if we can finish the above three jobs before mentioning
> them in our release note of the next preview.
>
> Bests,
> Dongjoon.
>
>
> On Tue, Dec 10, 2019 at 6:29 AM Tom Graves 
> wrote:
>
>> +1 for another preview
>>
>> Tom
>>
>> On Monday, December 9, 2019, 12:32:29 AM CST, Xiao Li <
>> gatorsm...@gmail.com> wrote:
>>
>>
>> I got many great feedbacks from the community about the recent 3.0
>> preview release. Since the last 3.0 preview release, we already have 353
>> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
>> There are various important features and behavior changes we want the
>> community to try before entering the official release candidates of Spark
>> 3.0.
>>
>>
>> Below is my selected items that are not part of the last 3.0 preview but
>> already available in the upstream master branch:
>>
>>
>>- Support JDK 11 with Hadoop 2.7
>>- Spark SQL will respect its own default format (i.e., parquet) when
>>users do CREATE TABLE without USING or STORED AS clauses
>>- Enable Parquet nested schema pruning and nested pruning on
>>expressions by default
>>- Add observable Metrics for Streaming queries
>>- Column pruning through nondeterministic expressions
>>- RecordBinaryComparator should check endianness when compared by long
>>
>>- Improve parallelism for local shuffle reader in adaptive query
>>execution
>>- Upgrade Apache Arrow to version 0.15.1
>>- Various interval-related SQL support
>>- Add a mode to pin Python thread into JVM's
>>- Provide option to clean up completed files in streaming query
>>
>> I am wondering if we can have another preview release for Spark 3.0? This
>> can help us find the design/API defects as early as possible and avoid the
>> significant delay of the upcoming Spark 3.0 release
>>
>>
>> Also, any committer is willing to volunteer as the release manager of the
>> next preview release of Spark 3.0, if we have such a release?
>>
>>
>> Cheers,
>>
>>
>> Xiao
>>
>

-- 
[image: Databricks Summit - Watch the talks]



Re: Spark 3.0 preview release 2?

2019-12-10 Thread Dongjoon Hyun
BTW, our Jenkins seems to be behind.

1. For the first item, `Support JDK 11 with Hadoop 2.7`:
At least, we need a new Jenkins job
`spark-master-test-maven-hadoop-2.7-jdk-11/`.
2. https://issues.apache.org/jira/browse/SPARK-28900 (Test Pyspark, SparkR
on JDK 11 with run-tests)
3. https://issues.apache.org/jira/browse/SPARK-29988 (Adjust Jenkins jobs
for `hive-1.2/2.3` combination)

It would be great if we can finish the above three jobs before mentioning
them in our release note of the next preview.

Bests,
Dongjoon.


On Tue, Dec 10, 2019 at 6:29 AM Tom Graves 
wrote:

> +1 for another preview
>
> Tom
>
> On Monday, December 9, 2019, 12:32:29 AM CST, Xiao Li <
> gatorsm...@gmail.com> wrote:
>
>
> I got many great feedbacks from the community about the recent 3.0
> preview release. Since the last 3.0 preview release, we already have 353
> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
> There are various important features and behavior changes we want the
> community to try before entering the official release candidates of Spark
> 3.0.
>
>
> Below is my selected items that are not part of the last 3.0 preview but
> already available in the upstream master branch:
>
>
>- Support JDK 11 with Hadoop 2.7
>- Spark SQL will respect its own default format (i.e., parquet) when
>users do CREATE TABLE without USING or STORED AS clauses
>- Enable Parquet nested schema pruning and nested pruning on
>expressions by default
>- Add observable Metrics for Streaming queries
>- Column pruning through nondeterministic expressions
>- RecordBinaryComparator should check endianness when compared by long
>- Improve parallelism for local shuffle reader in adaptive query
>execution
>- Upgrade Apache Arrow to version 0.15.1
>- Various interval-related SQL support
>- Add a mode to pin Python thread into JVM's
>- Provide option to clean up completed files in streaming query
>
> I am wondering if we can have another preview release for Spark 3.0? This
> can help us find the design/API defects as early as possible and avoid the
> significant delay of the upcoming Spark 3.0 release
>
>
> Also, any committer is willing to volunteer as the release manager of the
> next preview release of Spark 3.0, if we have such a release?
>
>
> Cheers,
>
>
> Xiao
>


Re: Spark 3.0 preview release 2?

2019-12-10 Thread Tom Graves
 +1 for another preview
Tom
On Monday, December 9, 2019, 12:32:29 AM CST, Xiao Li 
 wrote:  
 
 
I got many great feedbacks from the community about the recent 3.0 preview 
release. Since the last 3.0 preview release, we already have 353 commits 
[https://github.com/apache/spark/compare/v3.0.0-preview...master]. There are 
various important features and behavior changes we want the community to try 
before entering the official release candidates of Spark 3.0. 





Below is my selected items that are not part of the last 3.0 preview but 
already available in the upstream master branch: 


   
   - Support JDK 11 with Hadoop 2.7
   - Spark SQL will respect its own default format (i.e., parquet) when users 
do CREATE TABLE without USING or STORED AS clauses
   - Enable Parquet nested schema pruning and nested pruning on expressions by 
default
   - Add observable Metrics for Streaming queries
   - Column pruning through nondeterministic expressions
   - RecordBinaryComparator should check endianness when compared by long 
   - Improve parallelism for local shuffle reader in adaptive query execution
   - Upgrade Apache Arrow to version 0.15.1
   - Various interval-related SQL support
   - Add a mode to pin Python thread into JVM's
   - Provide option to clean up completed files in streaming query



I am wondering if we can have another preview release for Spark 3.0? This can 
help us find the design/API defects as early as possible and avoid the 
significant delay of the upcoming Spark 3.0 release




Also, any committer is willing to volunteer as the release manager of the next 
preview release of Spark 3.0, if we have such a release? 




Cheers,




Xiao
  

Re: Spark 3.0 preview release 2?

2019-12-09 Thread Matei Zaharia
Yup, it would be great to release these more often.

> On Dec 9, 2019, at 4:25 PM, Takeshi Yamamuro  wrote:
> 
> +1; Looks great if we can in terms of user's feedbacks.
> 
> Bests,
> Takeshi
> 
> On Tue, Dec 10, 2019 at 3:14 AM Dongjoon Hyun  > wrote:
> Thank you, All.
> 
> +1 for another `3.0-preview`.
> 
> Also, thank you Yuming for volunteering for that!
> 
> Bests,
> Dongjoon.
> 
> 
> On Mon, Dec 9, 2019 at 9:39 AM Xiao Li  > wrote:
> When entering the official release candidates, the new features have to be 
> disabled or even reverted [if the conf is not available] if the fixes are not 
> trivial; otherwise, we might need 10+ RCs to make the final release. The new 
> features should not block the release based on the previous discussions. 
> 
> I agree we should have code freeze at the beginning of 2020. The preview 
> releases should not block the official releases. The preview is just to 
> collect more feedback about these new features or behavior changes.
> 
> Also, for the release of Spark 3.0, we still need the Hive community to do us 
> a favor to release 2.3.7 for having HIVE-22190 
> . Before asking Hive 
> community to do 2.3.7 release, if possible, we want our Spark community to 
> have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2, 
> which is based on Hive 2.3 execution JAR. During the preview stage, we might 
> find more issues that are not covered by our test cases.
> 
>  
> 
> On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  > wrote:
> Seems fine to me of course. Honestly that wouldn't be a bad result for
> a release candidate, though we would probably roll another one now.
> How about simply moving to a release candidate? If not now then at
> least move to code freeze from the start of 2020. There is also some
> downside in pushing out the 3.0 release further with previews.
> 
> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  > wrote:
> >
> > I got many great feedbacks from the community about the recent 3.0 preview 
> > release. Since the last 3.0 preview release, we already have 353 commits 
> > [https://github.com/apache/spark/compare/v3.0.0-preview...master 
> > ]. There 
> > are various important features and behavior changes we want the community 
> > to try before entering the official release candidates of Spark 3.0.
> >
> >
> > Below is my selected items that are not part of the last 3.0 preview but 
> > already available in the upstream master branch:
> >
> > Support JDK 11 with Hadoop 2.7
> > Spark SQL will respect its own default format (i.e., parquet) when users do 
> > CREATE TABLE without USING or STORED AS clauses
> > Enable Parquet nested schema pruning and nested pruning on expressions by 
> > default
> > Add observable Metrics for Streaming queries
> > Column pruning through nondeterministic expressions
> > RecordBinaryComparator should check endianness when compared by long
> > Improve parallelism for local shuffle reader in adaptive query execution
> > Upgrade Apache Arrow to version 0.15.1
> > Various interval-related SQL support
> > Add a mode to pin Python thread into JVM's
> > Provide option to clean up completed files in streaming query
> >
> > I am wondering if we can have another preview release for Spark 3.0? This 
> > can help us find the design/API defects as early as possible and avoid the 
> > significant delay of the upcoming Spark 3.0 release
> >
> >
> > Also, any committer is willing to volunteer as the release manager of the 
> > next preview release of Spark 3.0, if we have such a release?
> >
> >
> > Cheers,
> >
> >
> > Xiao
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> -- 
>   
> 
> 
> -- 
> ---
> Takeshi Yamamuro



Re: Spark 3.0 preview release 2?

2019-12-09 Thread Takeshi Yamamuro
+1; Looks great if we can in terms of user's feedbacks.

Bests,
Takeshi

On Tue, Dec 10, 2019 at 3:14 AM Dongjoon Hyun 
wrote:

> Thank you, All.
>
> +1 for another `3.0-preview`.
>
> Also, thank you Yuming for volunteering for that!
>
> Bests,
> Dongjoon.
>
>
> On Mon, Dec 9, 2019 at 9:39 AM Xiao Li  wrote:
>
>> When entering the official release candidates, the new features have to
>> be disabled or even reverted [if the conf is not available] if the fixes
>> are not trivial; otherwise, we might need 10+ RCs to make the final
>> release. The new features should not block the release based on the
>> previous discussions.
>>
>> I agree we should have code freeze at the beginning of 2020. The preview
>> releases should not block the official releases. The preview is just to
>> collect more feedback about these new features or behavior changes.
>>
>> Also, for the release of Spark 3.0, we still need the Hive community to
>> do us a favor to release 2.3.7 for having HIVE-22190
>> . Before asking Hive
>> community to do 2.3.7 release, if possible, we want our Spark community to
>> have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2,
>> which is based on Hive 2.3 execution JAR. During the preview stage, we
>> might find more issues that are not covered by our test cases.
>>
>>
>>
>> On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  wrote:
>>
>>> Seems fine to me of course. Honestly that wouldn't be a bad result for
>>> a release candidate, though we would probably roll another one now.
>>> How about simply moving to a release candidate? If not now then at
>>> least move to code freeze from the start of 2020. There is also some
>>> downside in pushing out the 3.0 release further with previews.
>>>
>>> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
>>> >
>>> > I got many great feedbacks from the community about the recent 3.0
>>> preview release. Since the last 3.0 preview release, we already have 353
>>> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
>>> There are various important features and behavior changes we want the
>>> community to try before entering the official release candidates of Spark
>>> 3.0.
>>> >
>>> >
>>> > Below is my selected items that are not part of the last 3.0 preview
>>> but already available in the upstream master branch:
>>> >
>>> > Support JDK 11 with Hadoop 2.7
>>> > Spark SQL will respect its own default format (i.e., parquet) when
>>> users do CREATE TABLE without USING or STORED AS clauses
>>> > Enable Parquet nested schema pruning and nested pruning on expressions
>>> by default
>>> > Add observable Metrics for Streaming queries
>>> > Column pruning through nondeterministic expressions
>>> > RecordBinaryComparator should check endianness when compared by long
>>> > Improve parallelism for local shuffle reader in adaptive query
>>> execution
>>> > Upgrade Apache Arrow to version 0.15.1
>>> > Various interval-related SQL support
>>> > Add a mode to pin Python thread into JVM's
>>> > Provide option to clean up completed files in streaming query
>>> >
>>> > I am wondering if we can have another preview release for Spark 3.0?
>>> This can help us find the design/API defects as early as possible and avoid
>>> the significant delay of the upcoming Spark 3.0 release
>>> >
>>> >
>>> > Also, any committer is willing to volunteer as the release manager of
>>> the next preview release of Spark 3.0, if we have such a release?
>>> >
>>> >
>>> > Cheers,
>>> >
>>> >
>>> > Xiao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>

-- 
---
Takeshi Yamamuro


Re: Spark 3.0 preview release 2?

2019-12-09 Thread Dongjoon Hyun
Thank you, All.

+1 for another `3.0-preview`.

Also, thank you Yuming for volunteering for that!

Bests,
Dongjoon.


On Mon, Dec 9, 2019 at 9:39 AM Xiao Li  wrote:

> When entering the official release candidates, the new features have to be
> disabled or even reverted [if the conf is not available] if the fixes are
> not trivial; otherwise, we might need 10+ RCs to make the final release.
> The new features should not block the release based on the previous
> discussions.
>
> I agree we should have code freeze at the beginning of 2020. The preview
> releases should not block the official releases. The preview is just to
> collect more feedback about these new features or behavior changes.
>
> Also, for the release of Spark 3.0, we still need the Hive community to do
> us a favor to release 2.3.7 for having HIVE-22190
> . Before asking Hive
> community to do 2.3.7 release, if possible, we want our Spark community to
> have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2,
> which is based on Hive 2.3 execution JAR. During the preview stage, we
> might find more issues that are not covered by our test cases.
>
>
>
> On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  wrote:
>
>> Seems fine to me of course. Honestly that wouldn't be a bad result for
>> a release candidate, though we would probably roll another one now.
>> How about simply moving to a release candidate? If not now then at
>> least move to code freeze from the start of 2020. There is also some
>> downside in pushing out the 3.0 release further with previews.
>>
>> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
>> >
>> > I got many great feedbacks from the community about the recent 3.0
>> preview release. Since the last 3.0 preview release, we already have 353
>> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
>> There are various important features and behavior changes we want the
>> community to try before entering the official release candidates of Spark
>> 3.0.
>> >
>> >
>> > Below is my selected items that are not part of the last 3.0 preview
>> but already available in the upstream master branch:
>> >
>> > Support JDK 11 with Hadoop 2.7
>> > Spark SQL will respect its own default format (i.e., parquet) when
>> users do CREATE TABLE without USING or STORED AS clauses
>> > Enable Parquet nested schema pruning and nested pruning on expressions
>> by default
>> > Add observable Metrics for Streaming queries
>> > Column pruning through nondeterministic expressions
>> > RecordBinaryComparator should check endianness when compared by long
>> > Improve parallelism for local shuffle reader in adaptive query execution
>> > Upgrade Apache Arrow to version 0.15.1
>> > Various interval-related SQL support
>> > Add a mode to pin Python thread into JVM's
>> > Provide option to clean up completed files in streaming query
>> >
>> > I am wondering if we can have another preview release for Spark 3.0?
>> This can help us find the design/API defects as early as possible and avoid
>> the significant delay of the upcoming Spark 3.0 release
>> >
>> >
>> > Also, any committer is willing to volunteer as the release manager of
>> the next preview release of Spark 3.0, if we have such a release?
>> >
>> >
>> > Cheers,
>> >
>> >
>> > Xiao
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>


Re: Spark 3.0 preview release 2?

2019-12-09 Thread Xiao Li
When entering the official release candidates, the new features have to be
disabled or even reverted [if the conf is not available] if the fixes are
not trivial; otherwise, we might need 10+ RCs to make the final release.
The new features should not block the release based on the previous
discussions.

I agree we should have code freeze at the beginning of 2020. The preview
releases should not block the official releases. The preview is just to
collect more feedback about these new features or behavior changes.

Also, for the release of Spark 3.0, we still need the Hive community to do
us a favor to release 2.3.7 for having HIVE-22190
. Before asking Hive
community to do 2.3.7 release, if possible, we want our Spark community to
have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2,
which is based on Hive 2.3 execution JAR. During the preview stage, we
might find more issues that are not covered by our test cases.



On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  wrote:

> Seems fine to me of course. Honestly that wouldn't be a bad result for
> a release candidate, though we would probably roll another one now.
> How about simply moving to a release candidate? If not now then at
> least move to code freeze from the start of 2020. There is also some
> downside in pushing out the 3.0 release further with previews.
>
> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
> >
> > I got many great feedbacks from the community about the recent 3.0
> preview release. Since the last 3.0 preview release, we already have 353
> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
> There are various important features and behavior changes we want the
> community to try before entering the official release candidates of Spark
> 3.0.
> >
> >
> > Below is my selected items that are not part of the last 3.0 preview but
> already available in the upstream master branch:
> >
> > Support JDK 11 with Hadoop 2.7
> > Spark SQL will respect its own default format (i.e., parquet) when users
> do CREATE TABLE without USING or STORED AS clauses
> > Enable Parquet nested schema pruning and nested pruning on expressions
> by default
> > Add observable Metrics for Streaming queries
> > Column pruning through nondeterministic expressions
> > RecordBinaryComparator should check endianness when compared by long
> > Improve parallelism for local shuffle reader in adaptive query execution
> > Upgrade Apache Arrow to version 0.15.1
> > Various interval-related SQL support
> > Add a mode to pin Python thread into JVM's
> > Provide option to clean up completed files in streaming query
> >
> > I am wondering if we can have another preview release for Spark 3.0?
> This can help us find the design/API defects as early as possible and avoid
> the significant delay of the upcoming Spark 3.0 release
> >
> >
> > Also, any committer is willing to volunteer as the release manager of
> the next preview release of Spark 3.0, if we have such a release?
> >
> >
> > Cheers,
> >
> >
> > Xiao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
[image: Databricks Summit - Watch the talks]



Re: Spark 3.0 preview release 2?

2019-12-09 Thread Sean Owen
Seems fine to me of course. Honestly that wouldn't be a bad result for
a release candidate, though we would probably roll another one now.
How about simply moving to a release candidate? If not now then at
least move to code freeze from the start of 2020. There is also some
downside in pushing out the 3.0 release further with previews.

On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
>
> I got many great feedbacks from the community about the recent 3.0 preview 
> release. Since the last 3.0 preview release, we already have 353 commits 
> [https://github.com/apache/spark/compare/v3.0.0-preview...master]. There are 
> various important features and behavior changes we want the community to try 
> before entering the official release candidates of Spark 3.0.
>
>
> Below is my selected items that are not part of the last 3.0 preview but 
> already available in the upstream master branch:
>
> Support JDK 11 with Hadoop 2.7
> Spark SQL will respect its own default format (i.e., parquet) when users do 
> CREATE TABLE without USING or STORED AS clauses
> Enable Parquet nested schema pruning and nested pruning on expressions by 
> default
> Add observable Metrics for Streaming queries
> Column pruning through nondeterministic expressions
> RecordBinaryComparator should check endianness when compared by long
> Improve parallelism for local shuffle reader in adaptive query execution
> Upgrade Apache Arrow to version 0.15.1
> Various interval-related SQL support
> Add a mode to pin Python thread into JVM's
> Provide option to clean up completed files in streaming query
>
> I am wondering if we can have another preview release for Spark 3.0? This can 
> help us find the design/API defects as early as possible and avoid the 
> significant delay of the upcoming Spark 3.0 release
>
>
> Also, any committer is willing to volunteer as the release manager of the 
> next preview release of Spark 3.0, if we have such a release?
>
>
> Cheers,
>
>
> Xiao

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 3.0 preview release 2?

2019-12-08 Thread Reynold Xin
If the cost is low, why don't we just do monthly previews until we code freeze? 
If it is high, maybe we should discuss and do it when there are people that 
volunteer 

On Sun, Dec 08, 2019 at 10:32 PM, Xiao Li < gatorsm...@gmail.com > wrote:

> 
> 
> 
> I got many great feedbacks from the community about the recent 3.0 preview
> release. Since the last 3.0 preview release, we already have 353 commits [
> https://github.com/apache/spark/compare/v3.0.0-preview...master (
> https://github.com/apache/spark/compare/v3.0.0-preview...master ) ]. There
> are various important features and behavior changes we want the community
> to try before entering the official release candidates of Spark 3.0. 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Below is my selected items that are not part of the last 3.0 preview but
> already available in the upstream master branch: 
> 
> 
> 
> 
> 
> 
> 
> * Support JDK 11 with Hadoop 2.7
> * Spark SQL will respect its own default format (i.e., parquet) when users
> do CREATE TABLE without USING or STORED AS clauses
> * Enable Parquet nested schema pruning and nested pruning on expressions
> by default
> * Add observable Metrics for Streaming queries
> * Column pruning through nondeterministic expressions
> * RecordBinaryComparator should check endianness when compared by long 
> * Improve parallelism for local shuffle reader in adaptive query execution
> 
> * Upgrade Apache Arrow to version 0.15.1
> * Various interval-related SQL support
> * Add a mode to pin Python thread into JVM's
> * Provide option to clean up completed files in streaming query
> 
> 
> 
> 
> 
> 
> 
> 
> I am wondering if we can have another preview release for Spark 3.0? This
> can help us find the design/API defects as early as possible and avoid the
> significant delay of the upcoming Spark 3.0 release
> 
> 
> 
> 
> 
> 
> 
> 
> Also, any committer is willing to volunteer as the release manager of the
> next preview release of Spark 3.0, if we have such a release? 
> 
> 
> 
> 
> 
> 
> 
> 
> Cheers,
> 
> 
> 
> 
> 
> 
> 
> 
> Xiao
> 
> 
>