So I thought our theory for the pypi packages was it was for local developers, they really shouldn't care about the Hadoop version. If you're running on a production cluster you ideally pip install from the same release artifacts as your production cluster to match.
On Wed, Jun 24, 2020 at 12:11 PM Wenchen Fan <cloud0...@gmail.com> wrote: > Shall we start a new thread to discuss the bundled Hadoop version in > PySpark? I don't have a strong opinion on changing the default, as users > can still download the Hadoop 2.7 version. > > On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> To Xiao. >> Why Apache project releases should be blocked by PyPi / CRAN? It's >> completely optional, isn't it? >> >> > let me repeat my opinion: the top priority is to provide two >> options for PyPi distribution >> >> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the >> first incident. Apache Spark already has a history of missing SparkR >> uploading. We don't say Spark 3.0.0 fails due to CRAN uploading or other >> non-Apache distribution channels. In short, non-Apache distribution >> channels cannot be a `blocker` for Apache project releases. We only do our >> best for the community. >> >> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is >> really irrelevant to this PR. If someone wants to do that and the PR is >> ready, why don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait >> for December? Is there a reason why we need to wait? >> >> To Sean >> Thanks! >> >> To Nicholas. >> Do you think `pip install pyspark` is version-agnostic? In the Python >> world, `pip install somepackage` fails frequently. In production, you >> should use `pip install somepackage==specificversion`. I don't think the >> production pipeline has non-versinoned Python package installation. >> >> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS >> PR keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there >> is a blocker for that PR. >> >> Bests, >> Dongjoon. >> >> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> To rephrase my earlier email, PyPI users would care about the bundled >>> Hadoop version if they have a workflow that, in effect, looks something >>> like this: >>> >>> ``` >>> pip install pyspark >>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 >>> spark.read.parquet('s3a://...') >>> ``` >>> >>> I agree that Hadoop 3 would be a better default (again, the s3a support >>> is just much better). But to Xiao's point, if you are expecting Spark to >>> work with some package like hadoop-aws that assumes an older version of >>> Hadoop bundled with Spark, then changing the default may break your >>> workflow. >>> >>> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 >>> to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that >>> would be more difficult to repair. 🤷♂️ >>> >>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sro...@gmail.com> wrote: >>> >>>> I'm also genuinely curious when PyPI users would care about the >>>> bundled Hadoop jars - do we even need two versions? that itself is >>>> extra complexity for end users. >>>> I do think Hadoop 3 is the better choice for the user who doesn't >>>> care, and better long term. >>>> OK but let's at least move ahead with changing defaults. >>>> >>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <lix...@databricks.com> wrote: >>>> > >>>> > Hi, Dongjoon, >>>> > >>>> > Please do not misinterpret my point. I already clearly said "I do not >>>> know how to track the popularity of Hadoop 2 vs Hadoop 3." >>>> > >>>> > Also, let me repeat my opinion: the top priority is to provide two >>>> options for PyPi distribution and let the end users choose the ones they >>>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any >>>> breaking change, let us follow our protocol documented in >>>> https://spark.apache.org/versioning-policy.html. >>>> > >>>> > If you just want to change the Jenkins setup, I am OK about it. If >>>> you want to change the default distribution, we need more discussions in >>>> the community for getting an agreement. >>>> > >>>> > Thanks, >>>> > >>>> > Xiao >>>> > >>>> >>> -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau