Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Xiao Li Tue, 23 Jun 2020 19:56:03 -0700

Then, it will be a little complex after this PR. It might make the
community more confused.


In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
in the other distributions, we are using Hadoop 3.2 as the default?

How to explain this to the community? I would not change the default for
consistency.

Xiao



On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun <[email protected]>
wrote:

> Thanks. Uploading PySpark to PyPI is a simple manual step and our release
> script is able to build PySpark with Hadoop 2.7 still if we want.
> So, `No` for the following question. I updated my PR according to your
> comment.
>
> > If we change the default, will it impact them? If YES,...
>
> From the comment on the PR, the following become irrelevant to the current
> PR.
>
> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>
> Bests,
> Dongjoon.
>
>
>
>
> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li <[email protected]> wrote:
>
>>
>> Our monthly pypi downloads of PySpark have reached 5.4 million. We should
>> avoid forcing the current PySpark users to upgrade their Hadoop versions.
>> If we change the default, will it impact them? If YES, I think we should
>> not do it until it is ready and they have a workaround. So far, our pypi
>> downloads are still relying on our default version.
>>
>> Please correct me if my concern is not valid.
>>
>> Xiao
>>
>>
>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> Hi, All.
>>>
>>> I bump up this thread again with the title "Use Hadoop-3.2 as a default
>>> Hadoop profile in 3.1.0?"
>>> There exists some recent discussion on the following PR. Please let us
>>> know your thoughts.
>>>
>>> https://github.com/apache/spark/pull/28897
>>>
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <[email protected]> wrote:
>>>
>>>> Hi, Steve,
>>>>
>>>> Thanks for your comments! My major quality concern is not against
>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
>>>> is more risky due to these changes.
>>>>
>>>> To speed up the adoption of Spark 3.0, which has many other highly
>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>>> default.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao.
>>>>
>>>>
>>>>
>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <[email protected]>
>>>> wrote:
>>>>
>>>>> What is the current default value? as the 2.x releases are becoming
>>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>>> there will inevitably be surprises.
>>>>>
>>>>> One issue about using a older versions is that any problem reported
>>>>> -especially at stack traces you can blame me for- Will generally be met by
>>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>>> much test coverage are things getting?
>>>>>
>>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>>>> will either love or hate that.
>>>>>
>>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>>> backport planned though, including changes to better handle AWS caching of
>>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>>
>>>>> It would be really good if the spark distributions shipped with later
>>>>> versions of the hadoop artifacts.
>>>>>
>>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <[email protected]> wrote:
>>>>>
>>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>>> thriftserver.
>>>>>>
>>>>>> To reduce the risk, I would like to keep the current default version
>>>>>> unchanged. When it becomes stable, we can change the default profile to
>>>>>> Hadoop-3.2.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <[email protected]> wrote:
>>>>>>
>>>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>>>> implications.
>>>>>>> That said my guess is we're close to the point where we don't need to
>>>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>>>
>>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>>>>>> [email protected]> wrote:
>>>>>>> >
>>>>>>> > Hi, All.
>>>>>>> >
>>>>>>> > There was a discussion on publishing artifacts built with Hadoop 3
>>>>>>> .
>>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>>>> will be the same because we didn't change anything yet.
>>>>>>> >
>>>>>>> > Technically, we need to change two places for publishing.
>>>>>>> >
>>>>>>> > 1. Jenkins Snapshot Publishing
>>>>>>> >
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>>>> >
>>>>>>> > 2. Release Snapshot/Release Publishing
>>>>>>> >
>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>>>> >
>>>>>>> > To minimize the change, we need to switch our default Hadoop
>>>>>>> profile.
>>>>>>> >
>>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>>>>> optionally.
>>>>>>> >
>>>>>>> > Note that this means we use Hive 2.3.6 by default. Only
>>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>>>> >
>>>>>>> > Bests,
>>>>>>> > Dongjoon.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> [image: Databricks Summit - Watch the talks]
>>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> [image: Databricks Summit - Watch the talks]
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
<https://databricks.com/sparkaisummit/north-america>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Reply via email to