Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Holden Karau Wed, 24 Jun 2020 12:13:26 -0700

So I thought our theory for the pypi packages was it was for local
developers, they really shouldn't care about the Hadoop version. If you're
running on a production cluster you ideally pip install from the same
release artifacts as your production cluster to match.


On Wed, Jun 24, 2020 at 12:11 PM Wenchen Fan <cloud0...@gmail.com> wrote:

> Shall we start a new thread to discuss the bundled Hadoop version in
> PySpark? I don't have a strong opinion on changing the default, as users
> can still download the Hadoop 2.7 version.
>
> On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> To Xiao.
>> Why Apache project releases should be blocked by PyPi / CRAN? It's
>> completely optional, isn't it?
>>
>>     > let me repeat my opinion:  the top priority is to provide two
>> options for PyPi distribution
>>
>> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the
>> first incident. Apache Spark already has a history of missing SparkR
>> uploading. We don't say Spark 3.0.0 fails due to CRAN uploading or other
>> non-Apache distribution channels. In short, non-Apache distribution
>> channels cannot be a `blocker` for Apache project releases. We only do our
>> best for the community.
>>
>> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is
>> really irrelevant to this PR. If someone wants to do that and the PR is
>> ready, why don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait
>> for December? Is there a reason why we need to wait?
>>
>> To Sean
>> Thanks!
>>
>> To Nicholas.
>> Do you think `pip install pyspark` is version-agnostic? In the Python
>> world, `pip install somepackage` fails frequently. In production, you
>> should use `pip install somepackage==specificversion`. I don't think the
>> production pipeline has non-versinoned Python package installation.
>>
>> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS
>> PR keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there
>> is a blocker for that PR.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> To rephrase my earlier email, PyPI users would care about the bundled
>>> Hadoop version if they have a workflow that, in effect, looks something
>>> like this:
>>>
>>> ```
>>> pip install pyspark
>>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
>>> spark.read.parquet('s3a://...')
>>> ```
>>>
>>> I agree that Hadoop 3 would be a better default (again, the s3a support
>>> is just much better). But to Xiao's point, if you are expecting Spark to
>>> work with some package like hadoop-aws that assumes an older version of
>>> Hadoop bundled with Spark, then changing the default may break your
>>> workflow.
>>>
>>> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7
>>> to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
>>> would be more difficult to repair. 🤷‍♂️
>>>
>>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> I'm also genuinely curious when PyPI users would care about the
>>>> bundled Hadoop jars - do we even need two versions? that itself is
>>>> extra complexity for end users.
>>>> I do think Hadoop 3 is the better choice for the user who doesn't
>>>> care, and better long term.
>>>> OK but let's at least move ahead with changing defaults.
>>>>
>>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <lix...@databricks.com> wrote:
>>>> >
>>>> > Hi, Dongjoon,
>>>> >
>>>> > Please do not misinterpret my point. I already clearly said "I do not
>>>> know how to track the popularity of Hadoop 2 vs Hadoop 3."
>>>> >
>>>> > Also, let me repeat my opinion:  the top priority is to provide two
>>>> options for PyPi distribution and let the end users choose the ones they
>>>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
>>>> breaking change, let us follow our protocol documented in
>>>> https://spark.apache.org/versioning-policy.html.
>>>> >
>>>> > If you just want to change the Jenkins setup, I am OK about it. If
>>>> you want to change the default distribution, we need more discussions in
>>>> the community for getting an agreement.
>>>> >
>>>> >  Thanks,
>>>> >
>>>> > Xiao
>>>> >
>>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Reply via email to