Thanks, Xiao, Sean, Nicholas.

To Xiao,

>  it sounds like Hadoop 3.x is not as popular as Hadoop 2.7.

If you say so,
- Apache Hadoop 2.6.0 is the most popular one with 156 dependencies.
- Apache Spark 2.2.0 is the most popular one with 264 dependencies.

As we know, it doesn't make sense. Are we recommending Apache Spark 2.2.0
over Apache Spark 3.0.0?

There is a reason why Apache Spark dropped Hadoop 2.6 profile. Hadoop 2.7.4
has many limitations in the cloud environment. Apache Hadoop 3.2 will
unleash Apache Spark 3.1 in the cloud environment.  (Nicholas also pointed
it).

For Sean's comment, yes. We can focus on that later in a different thread.

> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
eventually, not now.

Bests,
Dongjoon.


On Wed, Jun 24, 2020 at 7:26 AM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> The team I'm on currently uses pip-installed PySpark for local
> development, and we regularly access S3 directly from our
> laptops/workstations.
>
> One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
> being able to use a recent version of hadoop-aws that has mature support
> for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
> there are incompatibilities that prevent you from using Spark built against
> Hadoop 2.7 with hadoop-aws version 2.8 or newer.
>
> On Wed, Jun 24, 2020 at 10:15 AM Sean Owen <sro...@gmail.com> wrote:
>
>> Will pyspark users care much about Hadoop version? they won't if running
>> locally. They will if connecting to a Hadoop cluster. Then again in that
>> context, they're probably using a distro anyway that harmonizes it.
>> Hadoop 3's installed based can't be that large yet; it's been around far
>> less time.
>>
>> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
>> eventually, not now.
>> But if the question now is build defaults, is it a big deal either way?
>>
>> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li <lix...@databricks.com> wrote:
>>
>>> I think we just need to provide two options and let end users choose the
>>> ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
>>> Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
>>> 3.1 release to me.
>>>
>>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based
>>> on this link
>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>>
>>>
>>>

Reply via email to