Hi, Dongjoon,

Please do not misinterpret my point. I already clearly said "I do not know
how to track the popularity of Hadoop 2 vs Hadoop 3."

Also, let me repeat my opinion:  the top priority is to provide two options
for PyPi distribution and let the end users choose the ones they need.
Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any breaking
change, let us follow our protocol documented in
https://spark.apache.org/versioning-policy.html.

If you just want to change the Jenkins setup, I am OK about it. If you want
to change the default distribution, we need more discussions in the
community for getting an agreement.

 Thanks,

Xiao


On Wed, Jun 24, 2020 at 10:07 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Thanks, Xiao, Sean, Nicholas.
>
> To Xiao,
>
> >  it sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>
> If you say so,
> - Apache Hadoop 2.6.0 is the most popular one with 156 dependencies.
> - Apache Spark 2.2.0 is the most popular one with 264 dependencies.
>
> As we know, it doesn't make sense. Are we recommending Apache Spark 2.2.0
> over Apache Spark 3.0.0?
>
> There is a reason why Apache Spark dropped Hadoop 2.6 profile. Hadoop
> 2.7.4 has many limitations in the cloud environment. Apache Hadoop 3.2 will
> unleash Apache Spark 3.1 in the cloud environment.  (Nicholas also pointed
> it).
>
> For Sean's comment, yes. We can focus on that later in a different thread.
>
> > The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
> eventually, not now.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jun 24, 2020 at 7:26 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> The team I'm on currently uses pip-installed PySpark for local
>> development, and we regularly access S3 directly from our
>> laptops/workstations.
>>
>> One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
>> being able to use a recent version of hadoop-aws that has mature support
>> for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
>> there are incompatibilities that prevent you from using Spark built against
>> Hadoop 2.7 with hadoop-aws version 2.8 or newer.
>>
>> On Wed, Jun 24, 2020 at 10:15 AM Sean Owen <sro...@gmail.com> wrote:
>>
>>> Will pyspark users care much about Hadoop version? they won't if running
>>> locally. They will if connecting to a Hadoop cluster. Then again in that
>>> context, they're probably using a distro anyway that harmonizes it.
>>> Hadoop 3's installed based can't be that large yet; it's been around far
>>> less time.
>>>
>>> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
>>> eventually, not now.
>>> But if the question now is build defaults, is it a big deal either way?
>>>
>>> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li <lix...@databricks.com> wrote:
>>>
>>>> I think we just need to provide two options and let end users choose
>>>> the ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make
>>>> Pyspark Hadoop 3.2+ Variant available in PyPI) is a high priority task for
>>>> Spark 3.1 release to me.
>>>>
>>>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3.
>>>> Based on this link
>>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>>>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>>>
>>>>
>>>>

-- 
<https://databricks.com/sparkaisummit/north-america>

Reply via email to