Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Xiao Li
I think we just need to provide two options and let end users choose the
ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
3.1 release to me.

I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based on
this link
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
sounds like Hadoop 3.x is not as popular as Hadoop 2.7.


On Tue, Jun 23, 2020 at 8:08 PM Dongjoon Hyun 
wrote:

> I fully understand your concern, but we cannot live with Hadoop 2.7.4
> forever, Xiao. Like Hadoop 2.6, we should let it go.
>
> So, are you saying that CRAN/PyPy should have all combination of Apache
> Spark including Hive 1.2 distribution?
>
> What is your suggestion as a PMC on Hadoop 3.2 migration path? I'd love to
> remove the road blocks for that.
>
> As a side note, Homebrew is not Apache Spark official channel, but it's
> also popular distribution channel in the community. And, it's using Hadoop
> 3.2 distribution already. Hadoop 2.7 is too old for Year 2021 (Apache Spark
> 3.1), isn't it?
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Jun 23, 2020 at 7:55 PM Xiao Li  wrote:
>
>> Then, it will be a little complex after this PR. It might make the
>> community more confused.
>>
>> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile;
>> however, in the other distributions, we are using Hadoop 3.2 as the
>> default?
>>
>> How to explain this to the community? I would not change the default for
>> consistency.
>>
>> Xiao
>>
>>
>>
>> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thanks. Uploading PySpark to PyPI is a simple manual step and
>>> our release script is able to build PySpark with Hadoop 2.7 still if we
>>> want.
>>> So, `No` for the following question. I updated my PR according to your
>>> comment.
>>>
>>> > If we change the default, will it impact them? If YES,...
>>>
>>> From the comment on the PR, the following become irrelevant to the
>>> current PR.
>>>
>>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li  wrote:
>>>

 Our monthly pypi downloads of PySpark have reached 5.4 million. We
 should avoid forcing the current PySpark users to upgrade their Hadoop
 versions. If we change the default, will it impact them? If YES, I think we
 should not do it until it is ready and they have a workaround. So far, our
 pypi downloads are still relying on our default version.

 Please correct me if my concern is not valid.

 Xiao


 On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> I bump up this thread again with the title "Use Hadoop-3.2 as a
> default Hadoop profile in 3.1.0?"
> There exists some recent discussion on the following PR. Please let us
> know your thoughts.
>
> https://github.com/apache/spark/pull/28897
>
>
> Bests,
> Dongjoon.
>
>
> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:
>
>> Hi, Steve,
>>
>> Thanks for your comments! My major quality concern is not against
>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 
>> profile
>> is more risky due to these changes.
>>
>> To speed up the adoption of Spark 3.0, which has many other highly
>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>> default.
>>
>> Cheers,
>>
>> Xiao.
>>
>>
>>
>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
>> wrote:
>>
>>> What is the current default value? as the 2.x releases are becoming
>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>> there will inevitably be surprises.
>>>
>>> One issue about using a older versions is that any problem reported
>>> -especially at stack traces you can blame me for- Will generally be met 
>>> by
>>> a response of "does it go away when you upgrade?" The other issue is how
>>> much test coverage are things getting?
>>>
>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The
>>> ABFS client is there, and I the big guava update (HADOOP-16213) went in.
>>> People will either love or hate that.
>>>
>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>> backport planned though, including changes to better handle AWS caching 
>>> of
>>> 404s generatd from HEAD requests before an object was actually created.
>>>
>>> It would be really good if the spark distributions shipped with
>>> later versions of the hadoop 

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Dongjoon Hyun
I fully understand your concern, but we cannot live with Hadoop 2.7.4
forever, Xiao. Like Hadoop 2.6, we should let it go.

So, are you saying that CRAN/PyPy should have all combination of Apache
Spark including Hive 1.2 distribution?

What is your suggestion as a PMC on Hadoop 3.2 migration path? I'd love to
remove the road blocks for that.

As a side note, Homebrew is not Apache Spark official channel, but it's
also popular distribution channel in the community. And, it's using Hadoop
3.2 distribution already. Hadoop 2.7 is too old for Year 2021 (Apache Spark
3.1), isn't it?

Bests,
Dongjoon.



On Tue, Jun 23, 2020 at 7:55 PM Xiao Li  wrote:

> Then, it will be a little complex after this PR. It might make the
> community more confused.
>
> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
> in the other distributions, we are using Hadoop 3.2 as the default?
>
> How to explain this to the community? I would not change the default for
> consistency.
>
> Xiao
>
>
>
> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun 
> wrote:
>
>> Thanks. Uploading PySpark to PyPI is a simple manual step and our release
>> script is able to build PySpark with Hadoop 2.7 still if we want.
>> So, `No` for the following question. I updated my PR according to your
>> comment.
>>
>> > If we change the default, will it impact them? If YES,...
>>
>> From the comment on the PR, the following become irrelevant to the
>> current PR.
>>
>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li  wrote:
>>
>>>
>>> Our monthly pypi downloads of PySpark have reached 5.4 million. We
>>> should avoid forcing the current PySpark users to upgrade their Hadoop
>>> versions. If we change the default, will it impact them? If YES, I think we
>>> should not do it until it is ready and they have a workaround. So far, our
>>> pypi downloads are still relying on our default version.
>>>
>>> Please correct me if my concern is not valid.
>>>
>>> Xiao
>>>
>>>
>>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 I bump up this thread again with the title "Use Hadoop-3.2 as a default
 Hadoop profile in 3.1.0?"
 There exists some recent discussion on the following PR. Please let us
 know your thoughts.

 https://github.com/apache/spark/pull/28897


 Bests,
 Dongjoon.


 On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:

> Hi, Steve,
>
> Thanks for your comments! My major quality concern is not against
> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
> is more risky due to these changes.
>
> To speed up the adoption of Spark 3.0, which has many other highly
> desirable features, I am proposing to keep Hadoop 2.x profile as the
> default.
>
> Cheers,
>
> Xiao.
>
>
>
> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
> wrote:
>
>> What is the current default value? as the 2.x releases are becoming
>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>> there will inevitably be surprises.
>>
>> One issue about using a older versions is that any problem reported
>> -especially at stack traces you can blame me for- Will generally be met 
>> by
>> a response of "does it go away when you upgrade?" The other issue is how
>> much test coverage are things getting?
>>
>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>> client is there, and I the big guava update (HADOOP-16213) went in. 
>> People
>> will either love or hate that.
>>
>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>> backport planned though, including changes to better handle AWS caching 
>> of
>> 404s generatd from HEAD requests before an object was actually created.
>>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li 
>> wrote:
>>
>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>> changes are massive, including Hive execution and a new version of Hive
>>> thriftserver.
>>>
>>> To reduce the risk, I would like to keep the current default version
>>> unchanged. When it becomes stable, we can change the default profile to
>>> Hadoop-3.2.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen  wrote:
>>>
 I'm OK with that, but don't have a strong opinion nor info about the
 

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Sean Owen
So, we also release Spark binary distros with Hadoop 2.7, 3.2, and no
Hadoop -- all of the options. Picking one profile or the other to release
with pypi etc isn't more or less consistent with those releases, as all
exist.

Is this change only about the source code default, with no effect on
planned releases for 3.1.x, etc? I get that this affects what you get if
you build from source, but, the concern wasn't about that audience, but
what pypi users get, which does not change, right?

Although you could also say, why bother -- who cares what the default is --
I do think we need to be moving away from multiple Hadoop and Hive
profiles, and for the audience who this would impact at all, developers,
probably OK to start lightly pushing by changing defaults?

I can't feel strongly about it at this point; we're not debating changing
any mass-consumption artifacts. So, I'd not object to it either.



On Tue, Jun 23, 2020 at 9:55 PM Xiao Li  wrote:

> Then, it will be a little complex after this PR. It might make the
> community more confused.
>
> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
> in the other distributions, we are using Hadoop 3.2 as the default?
>
> How to explain this to the community? I would not change the default for
> consistency.
>
> Xiao
>
>
>
> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun 
> wrote:
>
>> Thanks. Uploading PySpark to PyPI is a simple manual step and our release
>> script is able to build PySpark with Hadoop 2.7 still if we want.
>> So, `No` for the following question. I updated my PR according to your
>> comment.
>>
>> > If we change the default, will it impact them? If YES,...
>>
>> From the comment on the PR, the following become irrelevant to the
>> current PR.
>>
>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li  wrote:
>>
>>>
>>> Our monthly pypi downloads of PySpark have reached 5.4 million. We
>>> should avoid forcing the current PySpark users to upgrade their Hadoop
>>> versions. If we change the default, will it impact them? If YES, I think we
>>> should not do it until it is ready and they have a workaround. So far, our
>>> pypi downloads are still relying on our default version.
>>>
>>> Please correct me if my concern is not valid.
>>>
>>> Xiao
>>>
>>>
>>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 I bump up this thread again with the title "Use Hadoop-3.2 as a default
 Hadoop profile in 3.1.0?"
 There exists some recent discussion on the following PR. Please let us
 know your thoughts.

 https://github.com/apache/spark/pull/28897


 Bests,
 Dongjoon.


 On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:

> Hi, Steve,
>
> Thanks for your comments! My major quality concern is not against
> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
> is more risky due to these changes.
>
> To speed up the adoption of Spark 3.0, which has many other highly
> desirable features, I am proposing to keep Hadoop 2.x profile as the
> default.
>
> Cheers,
>
> Xiao.
>
>
>
> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
> wrote:
>
>> What is the current default value? as the 2.x releases are becoming
>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>> there will inevitably be surprises.
>>
>> One issue about using a older versions is that any problem reported
>> -especially at stack traces you can blame me for- Will generally be met 
>> by
>> a response of "does it go away when you upgrade?" The other issue is how
>> much test coverage are things getting?
>>
>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>> client is there, and I the big guava update (HADOOP-16213) went in. 
>> People
>> will either love or hate that.
>>
>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>> backport planned though, including changes to better handle AWS caching 
>> of
>> 404s generatd from HEAD requests before an object was actually created.
>>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li 
>> wrote:
>>
>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>> changes are massive, including Hive execution and a new version of Hive
>>> thriftserver.
>>>
>>> To reduce the risk, I would like to keep the current default version
>>> 

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Hyukjin Kwon
+1.

Just as a note,
- SPARK-31918  is fixed
now, and there's no blocker. - When we build SparkR, we should use the
latest R version at least 4.0.0+.

2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이 작성:

> +1
>
> Bests,
> Dongjoon.
>
> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim 
> wrote:
>
>> +1 on a 3.0.1 soon.
>>
>> Probably it would be nice if some Scala experts can take a look at
>> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
>> into 3.0.1 if possible.
>> Looks like APIs designed to work with Scala 2.11 & Java bring
>> ambiguity in Scala 2.12 & Java.
>>
>> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Sent from my iPhone
>>> Pardon the dumb thumb typos :)
>>>
>>> On Jun 23, 2020, at 11:36 AM, Holden Karau  wrote:
>>>
>>> 
>>> +1 on a patch release soon
>>>
>>> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
>>> wrote:
>>>
 +1 on doing a new patch release soon. I saw some of these issues when
 preparing the 3.0 release, and some of them are very serious.


 On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
 shiva...@eecs.berkeley.edu> wrote:

> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release
> soon.
>
> Shivaram
>
> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro <
> linguin@gmail.com> wrote:
>
> Thanks for the heads-up, Yuanjian!
>
> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>
> wow, the updates are so quick. Anyway, +1 for the release.
>
> Bests,
> Takeshi
>
> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
> wrote:
>
> Hi dev-list,
>
> I’m writing this to raise the discussion about Spark 3.0.1 feasibility
> since 4 blocker issues were found after Spark 3.0.0:
>
> [SPARK-31990] The state store compatibility broken will cause a
> correctness issue when Streaming query with `dropDuplicate` uses the
> checkpoint written by the old Spark version.
>
> [SPARK-32038] The regression bug in handling NaN values in
> COUNT(DISTINCT)
>
> [SPARK-31918][WIP] CRAN requires to make it working with the latest R
> 4.0. It makes the 3.0 release unavailable on CRAN, and only supports R
> [3.5, 4.0)
>
> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>
> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I
> think it would be great if we have Spark 3.0.1 to deliver the critical
> fixes.
>
> Any comments are appreciated.
>
> Best,
>
> Yuanjian
>
> --
> ---
> Takeshi Yamamuro
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Xiao Li
Then, it will be a little complex after this PR. It might make the
community more confused.

In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
in the other distributions, we are using Hadoop 3.2 as the default?

How to explain this to the community? I would not change the default for
consistency.

Xiao



On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun 
wrote:

> Thanks. Uploading PySpark to PyPI is a simple manual step and our release
> script is able to build PySpark with Hadoop 2.7 still if we want.
> So, `No` for the following question. I updated my PR according to your
> comment.
>
> > If we change the default, will it impact them? If YES,...
>
> From the comment on the PR, the following become irrelevant to the current
> PR.
>
> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>
> Bests,
> Dongjoon.
>
>
>
>
> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li  wrote:
>
>>
>> Our monthly pypi downloads of PySpark have reached 5.4 million. We should
>> avoid forcing the current PySpark users to upgrade their Hadoop versions.
>> If we change the default, will it impact them? If YES, I think we should
>> not do it until it is ready and they have a workaround. So far, our pypi
>> downloads are still relying on our default version.
>>
>> Please correct me if my concern is not valid.
>>
>> Xiao
>>
>>
>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> I bump up this thread again with the title "Use Hadoop-3.2 as a default
>>> Hadoop profile in 3.1.0?"
>>> There exists some recent discussion on the following PR. Please let us
>>> know your thoughts.
>>>
>>> https://github.com/apache/spark/pull/28897
>>>
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:
>>>
 Hi, Steve,

 Thanks for your comments! My major quality concern is not against
 Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
 is more risky due to these changes.

 To speed up the adoption of Spark 3.0, which has many other highly
 desirable features, I am proposing to keep Hadoop 2.x profile as the
 default.

 Cheers,

 Xiao.



 On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
 wrote:

> What is the current default value? as the 2.x releases are becoming
> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
> there will inevitably be surprises.
>
> One issue about using a older versions is that any problem reported
> -especially at stack traces you can blame me for- Will generally be met by
> a response of "does it go away when you upgrade?" The other issue is how
> much test coverage are things getting?
>
> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
> client is there, and I the big guava update (HADOOP-16213) went in. People
> will either love or hate that.
>
> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
> backport planned though, including changes to better handle AWS caching of
> 404s generatd from HEAD requests before an object was actually created.
>
> It would be really good if the spark distributions shipped with later
> versions of the hadoop artifacts.
>
> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li  wrote:
>
>> The stability and quality of Hadoop 3.2 profile are unknown. The
>> changes are massive, including Hive execution and a new version of Hive
>> thriftserver.
>>
>> To reduce the risk, I would like to keep the current default version
>> unchanged. When it becomes stable, we can change the default profile to
>> Hadoop-3.2.
>>
>> Cheers,
>>
>> Xiao
>>
>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen  wrote:
>>
>>> I'm OK with that, but don't have a strong opinion nor info about the
>>> implications.
>>> That said my guess is we're close to the point where we don't need to
>>> support Hadoop 2.x anyway, so, yeah.
>>>
>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > There was a discussion on publishing artifacts built with Hadoop 3
>>> .
>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>> will be the same because we didn't change anything yet.
>>> >
>>> > Technically, we need to change two places for publishing.
>>> >
>>> > 1. Jenkins Snapshot Publishing
>>> >
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>> >
>>> > 2. Release Snapshot/Release Publishing
>>> >
>>> 

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Dongjoon Hyun
+1

Bests,
Dongjoon.

On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim 
wrote:

> +1 on a 3.0.1 soon.
>
> Probably it would be nice if some Scala experts can take a look at
> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
> into 3.0.1 if possible.
> Looks like APIs designed to work with Scala 2.11 & Java bring ambiguity in
> Scala 2.12 & Java.
>
> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji  wrote:
>
>> +1 (non-binding)
>>
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> On Jun 23, 2020, at 11:36 AM, Holden Karau  wrote:
>>
>> 
>> +1 on a patch release soon
>>
>> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin  wrote:
>>
>>> +1 on doing a new patch release soon. I saw some of these issues when
>>> preparing the 3.0 release, and some of them are very serious.
>>>
>>>
>>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release
 soon.

 Shivaram

 On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro 
 wrote:

 Thanks for the heads-up, Yuanjian!

 I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.

 wow, the updates are so quick. Anyway, +1 for the release.

 Bests,
 Takeshi

 On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
 wrote:

 Hi dev-list,

 I’m writing this to raise the discussion about Spark 3.0.1 feasibility
 since 4 blocker issues were found after Spark 3.0.0:

 [SPARK-31990] The state store compatibility broken will cause a
 correctness issue when Streaming query with `dropDuplicate` uses the
 checkpoint written by the old Spark version.

 [SPARK-32038] The regression bug in handling NaN values in
 COUNT(DISTINCT)

 [SPARK-31918][WIP] CRAN requires to make it working with the latest R
 4.0. It makes the 3.0 release unavailable on CRAN, and only supports R
 [3.5, 4.0)

 [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression

 I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I
 think it would be great if we have Spark 3.0.1 to deliver the critical
 fixes.

 Any comments are appreciated.

 Best,

 Yuanjian

 --
 ---
 Takeshi Yamamuro

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Dongjoon Hyun
Thanks. Uploading PySpark to PyPI is a simple manual step and our release
script is able to build PySpark with Hadoop 2.7 still if we want.
So, `No` for the following question. I updated my PR according to your
comment.

> If we change the default, will it impact them? If YES,...

>From the comment on the PR, the following become irrelevant to the current
PR.

> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)

Bests,
Dongjoon.




On Tue, Jun 23, 2020 at 12:09 AM Xiao Li  wrote:

>
> Our monthly pypi downloads of PySpark have reached 5.4 million. We should
> avoid forcing the current PySpark users to upgrade their Hadoop versions.
> If we change the default, will it impact them? If YES, I think we should
> not do it until it is ready and they have a workaround. So far, our pypi
> downloads are still relying on our default version.
>
> Please correct me if my concern is not valid.
>
> Xiao
>
>
> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> I bump up this thread again with the title "Use Hadoop-3.2 as a default
>> Hadoop profile in 3.1.0?"
>> There exists some recent discussion on the following PR. Please let us
>> know your thoughts.
>>
>> https://github.com/apache/spark/pull/28897
>>
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:
>>
>>> Hi, Steve,
>>>
>>> Thanks for your comments! My major quality concern is not against Hadoop
>>> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
>>> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
>>> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
>>> risky due to these changes.
>>>
>>> To speed up the adoption of Spark 3.0, which has many other highly
>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>> default.
>>>
>>> Cheers,
>>>
>>> Xiao.
>>>
>>>
>>>
>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
>>> wrote:
>>>
 What is the current default value? as the 2.x releases are becoming
 EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
 release getting attention. 2.10.0 shipped yesterday, but the ".0" means
 there will inevitably be surprises.

 One issue about using a older versions is that any problem reported
 -especially at stack traces you can blame me for- Will generally be met by
 a response of "does it go away when you upgrade?" The other issue is how
 much test coverage are things getting?

 w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
 client is there, and I the big guava update (HADOOP-16213) went in. People
 will either love or hate that.

 No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
 backport planned though, including changes to better handle AWS caching of
 404s generatd from HEAD requests before an object was actually created.

 It would be really good if the spark distributions shipped with later
 versions of the hadoop artifacts.

 On Mon, Oct 28, 2019 at 7:53 PM Xiao Li  wrote:

> The stability and quality of Hadoop 3.2 profile are unknown. The
> changes are massive, including Hive execution and a new version of Hive
> thriftserver.
>
> To reduce the risk, I would like to keep the current default version
> unchanged. When it becomes stable, we can change the default profile to
> Hadoop-3.2.
>
> Cheers,
>
> Xiao
>
> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen  wrote:
>
>> I'm OK with that, but don't have a strong opinion nor info about the
>> implications.
>> That said my guess is we're close to the point where we don't need to
>> support Hadoop 2.x anyway, so, yeah.
>>
>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>> >
>> > Hi, All.
>> >
>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>> will be the same because we didn't change anything yet.
>> >
>> > Technically, we need to change two places for publishing.
>> >
>> > 1. Jenkins Snapshot Publishing
>> >
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>> >
>> > 2. Release Snapshot/Release Publishing
>> >
>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>> >
>> > To minimize the change, we need to switch our default Hadoop
>> profile.
>> >
>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>> `hadoop-3.2 (3.2.0)` is optional.
>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>> optionally.
>> >
>> > Note that this means we use Hive 2.3.6 by default. Only
>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 

m2 cache issues in Jenkins?

2020-06-23 Thread Holden Karau
Hi Folks,

I've been see some weird failures on Jenkins and it looks like it might be
from the m2 cache. Would it be OK to clean it out? Or is it important?

Cheers,

Holden

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Jungtaek Lim
+1 on a 3.0.1 soon.

Probably it would be nice if some Scala experts can take a look at
https://issues.apache.org/jira/browse/SPARK-32051 and include the fix into
3.0.1 if possible.
Looks like APIs designed to work with Scala 2.11 & Java bring ambiguity in
Scala 2.12 & Java.

On Wed, Jun 24, 2020 at 4:52 AM Jules Damji  wrote:

> +1 (non-binding)
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jun 23, 2020, at 11:36 AM, Holden Karau  wrote:
>
> 
> +1 on a patch release soon
>
> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin  wrote:
>
>> +1 on doing a new patch release soon. I saw some of these issues when
>> preparing the 3.0 release, and some of them are very serious.
>>
>>
>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release
>>> soon.
>>>
>>> Shivaram
>>>
>>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro 
>>> wrote:
>>>
>>> Thanks for the heads-up, Yuanjian!
>>>
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>>
>>> wow, the updates are so quick. Anyway, +1 for the release.
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
>>> wrote:
>>>
>>> Hi dev-list,
>>>
>>> I’m writing this to raise the discussion about Spark 3.0.1 feasibility
>>> since 4 blocker issues were found after Spark 3.0.0:
>>>
>>> [SPARK-31990] The state store compatibility broken will cause a
>>> correctness issue when Streaming query with `dropDuplicate` uses the
>>> checkpoint written by the old Spark version.
>>>
>>> [SPARK-32038] The regression bug in handling NaN values in
>>> COUNT(DISTINCT)
>>>
>>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R
>>> 4.0. It makes the 3.0 release unavailable on CRAN, and only supports R
>>> [3.5, 4.0)
>>>
>>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>>
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I
>>> think it would be great if we have Spark 3.0.1 to deliver the critical
>>> fixes.
>>>
>>> Any comments are appreciated.
>>>
>>> Best,
>>>
>>> Yuanjian
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>> - To
>>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Jules Damji
+1 (non-binding)

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Jun 23, 2020, at 11:36 AM, Holden Karau  wrote:
> 
> 
> +1 on a patch release soon
> 
>> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin  wrote:
>> +1 on doing a new patch release soon. I saw some of these issues when 
>> preparing the 3.0 release, and some of them are very serious.
>> 
>> 
>>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman 
>>>  wrote:
>>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release soon.
>>> 
>>> Shivaram
>>> 
>>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro  
>>> wrote:
>>> 
>>> Thanks for the heads-up, Yuanjian!
>>> 
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>> 
>>> wow, the updates are so quick. Anyway, +1 for the release.
>>> 
>>> Bests, 
>>> Takeshi
>>> 
>>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li  wrote:
>>> 
>>> Hi dev-list,
>>> 
>>> I’m writing this to raise the discussion about Spark 3.0.1 feasibility 
>>> since 4 blocker issues were found after Spark 3.0.0:
>>> 
>>> [SPARK-31990] The state store compatibility broken will cause a correctness 
>>> issue when Streaming query with `dropDuplicate` uses the checkpoint written 
>>> by the old Spark version.
>>> 
>>> [SPARK-32038] The regression bug in handling NaN values in COUNT(DISTINCT)
>>> 
>>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R 4.0. 
>>> It makes the 3.0 release unavailable on CRAN, and only supports R [3.5, 4.0)
>>> 
>>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>> 
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I think 
>>> it would be great if we have Spark 3.0.1 to deliver the critical fixes.
>>> 
>>> Any comments are appreciated.
>>> 
>>> Best,
>>> 
>>> Yuanjian
>>> 
>>> -- 
>>> --- 
>>> Takeshi Yamamuro
>>> 
>>> - To 
>>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
>> 
> 
> 
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Holden Karau
+1 on a patch release soon

On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin  wrote:

> +1 on doing a new patch release soon. I saw some of these issues when
> preparing the 3.0 release, and some of them are very serious.
>
>
> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release
>> soon.
>>
>> Shivaram
>>
>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro 
>> wrote:
>>
>> Thanks for the heads-up, Yuanjian!
>>
>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>
>> wow, the updates are so quick. Anyway, +1 for the release.
>>
>> Bests,
>> Takeshi
>>
>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
>> wrote:
>>
>> Hi dev-list,
>>
>> I’m writing this to raise the discussion about Spark 3.0.1 feasibility
>> since 4 blocker issues were found after Spark 3.0.0:
>>
>> [SPARK-31990] The state store compatibility broken will cause a
>> correctness issue when Streaming query with `dropDuplicate` uses the
>> checkpoint written by the old Spark version.
>>
>> [SPARK-32038] The regression bug in handling NaN values in
>> COUNT(DISTINCT)
>>
>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R
>> 4.0. It makes the 3.0 release unavailable on CRAN, and only supports R
>> [3.5, 4.0)
>>
>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>
>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I
>> think it would be great if we have Spark 3.0.1 to deliver the critical
>> fixes.
>>
>> Any comments are appreciated.
>>
>> Best,
>>
>> Yuanjian
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Reynold Xin
+1 on doing a new patch release soon. I saw some of these issues when preparing 
the 3.0 release, and some of them are very serious.

On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman < 
shiva...@eecs.berkeley.edu > wrote:

> 
> 
> 
> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release soon.
> 
> 
> 
> 
> Shivaram
> 
> 
> 
> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro < linguin. m. s@ gmail. com
> ( linguin@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Thanks for the heads-up, Yuanjian!
>> 
>> 
>>> 
>>> 
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>> 
>>> 
>> 
>> 
>> 
>> wow, the updates are so quick. Anyway, +1 for the release.
>> 
>> 
>> 
>> Bests,
>> Takeshi
>> 
>> 
>> 
>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li < xyliyuanjian@ gmail. com (
>> xyliyuanj...@gmail.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Hi dev-list,
>>> 
>>> 
>>> 
>>> I’m writing this to raise the discussion about Spark 3.0.1 feasibility
>>> since 4 blocker issues were found after Spark 3.0.0:
>>> 
>>> 
>>> 
>>> [SPARK-31990] The state store compatibility broken will cause a
>>> correctness issue when Streaming query with `dropDuplicate` uses the
>>> checkpoint written by the old Spark version.
>>> 
>>> 
>>> 
>>> [SPARK-32038] The regression bug in handling NaN values in COUNT(DISTINCT)
>>> 
>>> 
>>> 
>>> 
>>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R 4.0.
>>> It makes the 3.0 release unavailable on CRAN, and only supports R [3.5,
>>> 4.0)
>>> 
>>> 
>>> 
>>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>> 
>>> 
>>> 
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I
>>> think it would be great if we have Spark 3.0.1 to deliver the critical
>>> fixes.
>>> 
>>> 
>>> 
>>> Any comments are appreciated.
>>> 
>>> 
>>> 
>>> Best,
>>> 
>>> 
>>> 
>>> Yuanjian
>>> 
>>> 
>> 
>> 
>> 
>> --
>> ---
>> Takeshi Yamamuro
>> 
>> 
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Unsubscribe

2020-06-23 Thread Ankit Sinha
Unsubscribe
-- 
Ankit


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Shivaram Venkataraman
+1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release soon.

Shivaram

On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro  wrote:
>
> Thanks for the heads-up, Yuanjian!
>
> > I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
> wow, the updates are so quick. Anyway, +1 for the release.
>
> Bests,
> Takeshi
>
> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li  wrote:
>>
>> Hi dev-list,
>>
>>
>> I’m writing this to raise the discussion about Spark 3.0.1 feasibility since 
>> 4 blocker issues were found after Spark 3.0.0:
>>
>>
>> [SPARK-31990] The state store compatibility broken will cause a correctness 
>> issue when Streaming query with `dropDuplicate` uses the checkpoint written 
>> by the old Spark version.
>>
>> [SPARK-32038] The regression bug in handling NaN values in COUNT(DISTINCT)
>>
>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R 4.0. 
>> It makes the 3.0 release unavailable on CRAN, and only supports R [3.5, 4.0)
>>
>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>
>>
>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I think 
>> it would be great if we have Spark 3.0.1 to deliver the critical fixes.
>>
>>
>> Any comments are appreciated.
>>
>>
>> Best,
>>
>> Yuanjian
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Unsubscribe

2020-06-23 Thread Rohit Mishra
I think you have sent this request to wrong email address. I don’t want to
unsubscribe from Dev mail list. I don’t remember I have send any such
request.

Regards,
Rohit Mishra

On Tue, 23 Jun 2020 at 7:39 PM, Jeff Evans 
wrote:

> That is not how you unsubscribe.  See here:
> https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e
>
> On Tue, Jun 23, 2020 at 5:02 AM Kiran Kumar Dusi 
> wrote:
>
>> Unsubscribe
>>
>> On Tue, 23 Jun 2020 at 15:18 Akhil Anil  wrote:
>>
>
>> --
>>> Sent from Gmail Mobile
>>>
>>


Re: Unsubscribe

2020-06-23 Thread Jeff Evans
That is not how you unsubscribe.  See here:
https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e

On Tue, Jun 23, 2020 at 5:02 AM Kiran Kumar Dusi 
wrote:

> Unsubscribe
>
> On Tue, 23 Jun 2020 at 15:18 Akhil Anil  wrote:
>
>> --
>> Sent from Gmail Mobile
>>
>


Re: Unsubscribe

2020-06-23 Thread Kiran Kumar Dusi
Unsubscribe

On Tue, 23 Jun 2020 at 15:18 Akhil Anil  wrote:

> --
> Sent from Gmail Mobile
>


Unsubscribe

2020-06-23 Thread Akhil Anil
-- 
Sent from Gmail Mobile


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Takeshi Yamamuro
Thanks for the heads-up, Yuanjian!

> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
wow, the updates are so quick. Anyway, +1 for the release.

Bests,
Takeshi

On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li  wrote:

> Hi dev-list,
>
> I’m writing this to raise the discussion about Spark 3.0.1 feasibility
> since 4 blocker issues were found after Spark 3.0.0:
>
>
>1.
>
>[SPARK-31990]  The
>state store compatibility broken will cause a correctness issue when
>Streaming query with `dropDuplicate` uses the checkpoint written by the old
>Spark version.
>2.
>
>[SPARK-32038]  The
>regression bug in handling NaN values in COUNT(DISTINCT)
>3.
>
>[SPARK-31918] [WIP]
>CRAN requires to make it working with the latest R 4.0. It makes the 3.0
>release unavailable on CRAN, and only supports R [3.5, 4.0)
>4.
>
>[SPARK-31967] 
>Downgrade vis.js to fix Jobs UI loading time regression
>
>
> I also noticed branch-3.0 already has 39 commits
> 
> after Spark 3.0.0. I think it would be great if we have Spark 3.0.1 to
> deliver the critical fixes.
>
> Any comments are appreciated.
>
> Best,
>
> Yuanjian
>
>

-- 
---
Takeshi Yamamuro


[DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Yuanjian Li
Hi dev-list,

I’m writing this to raise the discussion about Spark 3.0.1 feasibility
since 4 blocker issues were found after Spark 3.0.0:


   1.

   [SPARK-31990]  The
   state store compatibility broken will cause a correctness issue when
   Streaming query with `dropDuplicate` uses the checkpoint written by the old
   Spark version.
   2.

   [SPARK-32038]  The
   regression bug in handling NaN values in COUNT(DISTINCT)
   3.

   [SPARK-31918] [WIP]
   CRAN requires to make it working with the latest R 4.0. It makes the 3.0
   release unavailable on CRAN, and only supports R [3.5, 4.0)
   4.

   [SPARK-31967] 
   Downgrade vis.js to fix Jobs UI loading time regression


I also noticed branch-3.0 already has 39 commits

after Spark 3.0.0. I think it would be great if we have Spark 3.0.1 to
deliver the critical fixes.

Any comments are appreciated.

Best,

Yuanjian


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Xiao Li
Our monthly pypi downloads of PySpark have reached 5.4 million. We should
avoid forcing the current PySpark users to upgrade their Hadoop versions.
If we change the default, will it impact them? If YES, I think we should
not do it until it is ready and they have a workaround. So far, our pypi
downloads are still relying on our default version.

Please correct me if my concern is not valid.

Xiao


On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> I bump up this thread again with the title "Use Hadoop-3.2 as a default
> Hadoop profile in 3.1.0?"
> There exists some recent discussion on the following PR. Please let us
> know your thoughts.
>
> https://github.com/apache/spark/pull/28897
>
>
> Bests,
> Dongjoon.
>
>
> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:
>
>> Hi, Steve,
>>
>> Thanks for your comments! My major quality concern is not against Hadoop
>> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
>> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
>> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
>> risky due to these changes.
>>
>> To speed up the adoption of Spark 3.0, which has many other highly
>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>> default.
>>
>> Cheers,
>>
>> Xiao.
>>
>>
>>
>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran 
>> wrote:
>>
>>> What is the current default value? as the 2.x releases are becoming EOL;
>>> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
>>> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
>>> inevitably be surprises.
>>>
>>> One issue about using a older versions is that any problem reported
>>> -especially at stack traces you can blame me for- Will generally be met by
>>> a response of "does it go away when you upgrade?" The other issue is how
>>> much test coverage are things getting?
>>>
>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>> will either love or hate that.
>>>
>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>> backport planned though, including changes to better handle AWS caching of
>>> 404s generatd from HEAD requests before an object was actually created.
>>>
>>> It would be really good if the spark distributions shipped with later
>>> versions of the hadoop artifacts.
>>>
>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li  wrote:
>>>
 The stability and quality of Hadoop 3.2 profile are unknown. The
 changes are massive, including Hive execution and a new version of Hive
 thriftserver.

 To reduce the risk, I would like to keep the current default version
 unchanged. When it becomes stable, we can change the default profile to
 Hadoop-3.2.

 Cheers,

 Xiao

 On Mon, Oct 28, 2019 at 12:51 PM Sean Owen  wrote:

> I'm OK with that, but don't have a strong opinion nor info about the
> implications.
> That said my guess is we're close to the point where we don't need to
> support Hadoop 2.x anyway, so, yeah.
>
> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > There was a discussion on publishing artifacts built with Hadoop 3 .
> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
> will be the same because we didn't change anything yet.
> >
> > Technically, we need to change two places for publishing.
> >
> > 1. Jenkins Snapshot Publishing
> >
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> >
> > 2. Release Snapshot/Release Publishing
> >
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
> >
> > To minimize the change, we need to switch our default Hadoop profile.
> >
> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
> `hadoop-3.2 (3.2.0)` is optional.
> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
> optionally.
> >
> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
> >
> > Bests,
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

 --
 [image: Databricks Summit - Watch the talks]
 

>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>

-- 



Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Dongjoon Hyun
Hi, All.

I bump up this thread again with the title "Use Hadoop-3.2 as a default
Hadoop profile in 3.1.0?"
There exists some recent discussion on the following PR. Please let us know
your thoughts.

https://github.com/apache/spark/pull/28897


Bests,
Dongjoon.


On Fri, Nov 1, 2019 at 9:41 AM Xiao Li  wrote:

> Hi, Steve,
>
> Thanks for your comments! My major quality concern is not against Hadoop
> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
> risky due to these changes.
>
> To speed up the adoption of Spark 3.0, which has many other highly
> desirable features, I am proposing to keep Hadoop 2.x profile as the
> default.
>
> Cheers,
>
> Xiao.
>
>
>
> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran  wrote:
>
>> What is the current default value? as the 2.x releases are becoming EOL;
>> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
>> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
>> inevitably be surprises.
>>
>> One issue about using a older versions is that any problem reported
>> -especially at stack traces you can blame me for- Will generally be met by
>> a response of "does it go away when you upgrade?" The other issue is how
>> much test coverage are things getting?
>>
>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>> client is there, and I the big guava update (HADOOP-16213) went in. People
>> will either love or hate that.
>>
>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>> backport planned though, including changes to better handle AWS caching of
>> 404s generatd from HEAD requests before an object was actually created.
>>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li  wrote:
>>
>>> The stability and quality of Hadoop 3.2 profile are unknown. The changes
>>> are massive, including Hive execution and a new version of Hive
>>> thriftserver.
>>>
>>> To reduce the risk, I would like to keep the current default version
>>> unchanged. When it becomes stable, we can change the default profile to
>>> Hadoop-3.2.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen  wrote:
>>>
 I'm OK with that, but don't have a strong opinion nor info about the
 implications.
 That said my guess is we're close to the point where we don't need to
 support Hadoop 2.x anyway, so, yeah.

 On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun 
 wrote:
 >
 > Hi, All.
 >
 > There was a discussion on publishing artifacts built with Hadoop 3 .
 > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
 be the same because we didn't change anything yet.
 >
 > Technically, we need to change two places for publishing.
 >
 > 1. Jenkins Snapshot Publishing
 >
 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
 >
 > 2. Release Snapshot/Release Publishing
 >
 https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
 >
 > To minimize the change, we need to switch our default Hadoop profile.
 >
 > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
 `hadoop-3.2 (3.2.0)` is optional.
 > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
 optionally.
 >
 > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
 distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
 >
 > Bests,
 > Dongjoon.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>> --
>>> [image: Databricks Summit - Watch the talks]
>>> 
>>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>