Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Sean Owen Tue, 23 Jun 2020 20:06:03 -0700

So, we also release Spark binary distros with Hadoop 2.7, 3.2, and no
Hadoop -- all of the options. Picking one profile or the other to release
with pypi etc isn't more or less consistent with those releases, as all
exist.


Is this change only about the source code default, with no effect on
planned releases for 3.1.x, etc? I get that this affects what you get if
you build from source, but, the concern wasn't about that audience, but
what pypi users get, which does not change, right?

Although you could also say, why bother -- who cares what the default is --
I do think we need to be moving away from multiple Hadoop and Hive
profiles, and for the audience who this would impact at all, developers,
probably OK to start lightly pushing by changing defaults?

I can't feel strongly about it at this point; we're not debating changing
any mass-consumption artifacts. So, I'd not object to it either.



On Tue, Jun 23, 2020 at 9:55 PM Xiao Li <lix...@databricks.com> wrote:

> Then, it will be a little complex after this PR. It might make the
> community more confused.
>
> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
> in the other distributions, we are using Hadoop 3.2 as the default?
>
> How to explain this to the community? I would not change the default for
> consistency.
>
> Xiao
>
>
>
> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Thanks. Uploading PySpark to PyPI is a simple manual step and our release
>> script is able to build PySpark with Hadoop 2.7 still if we want.
>> So, `No` for the following question. I updated my PR according to your
>> comment.
>>
>> > If we change the default, will it impact them? If YES,...
>>
>> From the comment on the PR, the following become irrelevant to the
>> current PR.
>>
>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li <lix...@databricks.com> wrote:
>>
>>>
>>> Our monthly pypi downloads of PySpark have reached 5.4 million. We
>>> should avoid forcing the current PySpark users to upgrade their Hadoop
>>> versions. If we change the default, will it impact them? If YES, I think we
>>> should not do it until it is ready and they have a workaround. So far, our
>>> pypi downloads are still relying on our default version.
>>>
>>> Please correct me if my concern is not valid.
>>>
>>> Xiao
>>>
>>>
>>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> I bump up this thread again with the title "Use Hadoop-3.2 as a default
>>>> Hadoop profile in 3.1.0?"
>>>> There exists some recent discussion on the following PR. Please let us
>>>> know your thoughts.
>>>>
>>>> https://github.com/apache/spark/pull/28897
>>>>
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <lix...@databricks.com> wrote:
>>>>
>>>>> Hi, Steve,
>>>>>
>>>>> Thanks for your comments! My major quality concern is not against
>>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
>>>>> is more risky due to these changes.
>>>>>
>>>>> To speed up the adoption of Spark 3.0, which has many other highly
>>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>>>> default.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <ste...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> What is the current default value? as the 2.x releases are becoming
>>>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>>>> there will inevitably be surprises.
>>>>>>
>>>>>> One issue about using a older versions is that any problem reported
>>>>>> -especially at stack traces you can blame me for- Will generally be met 
>>>>>> by
>>>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>>>> much test coverage are things getting?
>>>>>>
>>>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>>>>> client is there, and I the big guava update (HADOOP-16213) went in. 
>>>>>> People
>>>>>> will either love or hate that.
>>>>>>
>>>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>>>> backport planned though, including changes to better handle AWS caching 
>>>>>> of
>>>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>>>
>>>>>> It would be really good if the spark distributions shipped with later
>>>>>> versions of the hadoop artifacts.
>>>>>>
>>>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <lix...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>>>> thriftserver.
>>>>>>>
>>>>>>> To reduce the risk, I would like to keep the current default version
>>>>>>> unchanged. When it becomes stable, we can change the default profile to
>>>>>>> Hadoop-3.2.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>>>>> implications.
>>>>>>>> That said my guess is we're close to the point where we don't need
>>>>>>>> to
>>>>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Hi, All.
>>>>>>>> >
>>>>>>>> > There was a discussion on publishing artifacts built with Hadoop
>>>>>>>> 3 .
>>>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>>>>> will be the same because we didn't change anything yet.
>>>>>>>> >
>>>>>>>> > Technically, we need to change two places for publishing.
>>>>>>>> >
>>>>>>>> > 1. Jenkins Snapshot Publishing
>>>>>>>> >
>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>>>>> >
>>>>>>>> > 2. Release Snapshot/Release Publishing
>>>>>>>> >
>>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>>>>> >
>>>>>>>> > To minimize the change, we need to switch our default Hadoop
>>>>>>>> profile.
>>>>>>>> >
>>>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>>>>> > We had better use `hadoop-3.2` profile by default and
>>>>>>>> `hadoop-2.7` optionally.
>>>>>>>> >
>>>>>>>> > Note that this means we use Hive 2.3.6 by default. Only
>>>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 
>>>>>>>> 2.4.x.
>>>>>>>> >
>>>>>>>> > Bests,
>>>>>>>> > Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> [image: Databricks Summit - Watch the talks]
>>>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> [image: Databricks Summit - Watch the talks]
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>
>>> --
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> <https://databricks.com/sparkaisummit/north-america>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Reply via email to