I think we just need to provide two options and let end users choose the
ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
3.1 release to me.

I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based on
this link
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
sounds like Hadoop 3.x is not as popular as Hadoop 2.7.


On Tue, Jun 23, 2020 at 8:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> I fully understand your concern, but we cannot live with Hadoop 2.7.4
> forever, Xiao. Like Hadoop 2.6, we should let it go.
>
> So, are you saying that CRAN/PyPy should have all combination of Apache
> Spark including Hive 1.2 distribution?
>
> What is your suggestion as a PMC on Hadoop 3.2 migration path? I'd love to
> remove the road blocks for that.
>
> As a side note, Homebrew is not Apache Spark official channel, but it's
> also popular distribution channel in the community. And, it's using Hadoop
> 3.2 distribution already. Hadoop 2.7 is too old for Year 2021 (Apache Spark
> 3.1), isn't it?
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Jun 23, 2020 at 7:55 PM Xiao Li <lix...@databricks.com> wrote:
>
>> Then, it will be a little complex after this PR. It might make the
>> community more confused.
>>
>> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile;
>> however, in the other distributions, we are using Hadoop 3.2 as the
>> default?
>>
>> How to explain this to the community? I would not change the default for
>> consistency.
>>
>> Xiao
>>
>>
>>
>> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Thanks. Uploading PySpark to PyPI is a simple manual step and
>>> our release script is able to build PySpark with Hadoop 2.7 still if we
>>> want.
>>> So, `No` for the following question. I updated my PR according to your
>>> comment.
>>>
>>> > If we change the default, will it impact them? If YES,...
>>>
>>> From the comment on the PR, the following become irrelevant to the
>>> current PR.
>>>
>>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li <lix...@databricks.com> wrote:
>>>
>>>>
>>>> Our monthly pypi downloads of PySpark have reached 5.4 million. We
>>>> should avoid forcing the current PySpark users to upgrade their Hadoop
>>>> versions. If we change the default, will it impact them? If YES, I think we
>>>> should not do it until it is ready and they have a workaround. So far, our
>>>> pypi downloads are still relying on our default version.
>>>>
>>>> Please correct me if my concern is not valid.
>>>>
>>>> Xiao
>>>>
>>>>
>>>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> I bump up this thread again with the title "Use Hadoop-3.2 as a
>>>>> default Hadoop profile in 3.1.0?"
>>>>> There exists some recent discussion on the following PR. Please let us
>>>>> know your thoughts.
>>>>>
>>>>> https://github.com/apache/spark/pull/28897
>>>>>
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <lix...@databricks.com> wrote:
>>>>>
>>>>>> Hi, Steve,
>>>>>>
>>>>>> Thanks for your comments! My major quality concern is not against
>>>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>>>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>>>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 
>>>>>> profile
>>>>>> is more risky due to these changes.
>>>>>>
>>>>>> To speed up the adoption of Spark 3.0, which has many other highly
>>>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>>>>> default.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Xiao.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <ste...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> What is the current default value? as the 2.x releases are becoming
>>>>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>>>>> there will inevitably be surprises.
>>>>>>>
>>>>>>> One issue about using a older versions is that any problem reported
>>>>>>> -especially at stack traces you can blame me for- Will generally be met 
>>>>>>> by
>>>>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>>>>> much test coverage are things getting?
>>>>>>>
>>>>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The
>>>>>>> ABFS client is there, and I the big guava update (HADOOP-16213) went in.
>>>>>>> People will either love or hate that.
>>>>>>>
>>>>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>>>>> backport planned though, including changes to better handle AWS caching 
>>>>>>> of
>>>>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>>>>
>>>>>>> It would be really good if the spark distributions shipped with
>>>>>>> later versions of the hadoop artifacts.
>>>>>>>
>>>>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <lix...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>>>>> thriftserver.
>>>>>>>>
>>>>>>>> To reduce the risk, I would like to keep the current default
>>>>>>>> version unchanged. When it becomes stable, we can change the default
>>>>>>>> profile to Hadoop-3.2.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Xiao
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sro...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm OK with that, but don't have a strong opinion nor info about
>>>>>>>>> the
>>>>>>>>> implications.
>>>>>>>>> That said my guess is we're close to the point where we don't need
>>>>>>>>> to
>>>>>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > Hi, All.
>>>>>>>>> >
>>>>>>>>> > There was a discussion on publishing artifacts built with Hadoop
>>>>>>>>> 3 .
>>>>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>>>>>> will be the same because we didn't change anything yet.
>>>>>>>>> >
>>>>>>>>> > Technically, we need to change two places for publishing.
>>>>>>>>> >
>>>>>>>>> > 1. Jenkins Snapshot Publishing
>>>>>>>>> >
>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>>>>>> >
>>>>>>>>> > 2. Release Snapshot/Release Publishing
>>>>>>>>> >
>>>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>>>>>> >
>>>>>>>>> > To minimize the change, we need to switch our default Hadoop
>>>>>>>>> profile.
>>>>>>>>> >
>>>>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>>>>>> > We had better use `hadoop-3.2` profile by default and
>>>>>>>>> `hadoop-2.7` optionally.
>>>>>>>>> >
>>>>>>>>> > Note that this means we use Hive 2.3.6 by default. Only
>>>>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 
>>>>>>>>> 2.4.x.
>>>>>>>>> >
>>>>>>>>> > Bests,
>>>>>>>>> > Dongjoon.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> [image: Databricks Summit - Watch the talks]
>>>>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> [image: Databricks Summit - Watch the talks]
>>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
<https://databricks.com/sparkaisummit/north-america>

Reply via email to