On Sun, 12 Jul 2020 at 01:45, gpongracz wrote:
> As someone who mainly operates in AWS it would be very welcome to have the
> option to use an updated version of hadoop using pyspark sourced from pypi.
>
> Acknowledging the issues of backwards compatability...
>
> The most vexing issue is the
As someone who mainly operates in AWS it would be very welcome to have the
option to use an updated version of hadoop using pyspark sourced from pypi.
Acknowledging the issues of backwards compatability...
The most vexing issue is the lack of ability to use s3a STS, ie
I dont have a strong opinion on changing default too but I also a little
bit more prefer to have the option to switch Hadoop version first just to
stay safer.
To be clear, we're more now discussing about the timing about when to set
Hadoop 3.0.0 by default, and which change has to be first,
Hello,
On Wed, Jun 24, 2020 at 2:13 PM Holden Karau wrote:
>
> So I thought our theory for the pypi packages was it was for local
> developers, they really shouldn't care about the Hadoop version. If you're
> running on a production cluster you ideally pip install from the same release
>
So I thought our theory for the pypi packages was it was for local
developers, they really shouldn't care about the Hadoop version. If you're
running on a production cluster you ideally pip install from the same
release artifacts as your production cluster to match.
On Wed, Jun 24, 2020 at 12:11
Shall we start a new thread to discuss the bundled Hadoop version in
PySpark? I don't have a strong opinion on changing the default, as users
can still download the Hadoop 2.7 version.
On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun
wrote:
> To Xiao.
> Why Apache project releases should be
To Xiao.
Why Apache project releases should be blocked by PyPi / CRAN? It's
completely optional, isn't it?
> let me repeat my opinion: the top priority is to provide two options
for PyPi distribution
IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first
incident. Apache
To rephrase my earlier email, PyPI users would care about the bundled
Hadoop version if they have a workflow that, in effect, looks something
like this:
```
pip install pyspark
pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
spark.read.parquet('s3a://...')
```
I agree that Hadoop 3 would
I'm also genuinely curious when PyPI users would care about the
bundled Hadoop jars - do we even need two versions? that itself is
extra complexity for end users.
I do think Hadoop 3 is the better choice for the user who doesn't
care, and better long term.
OK but let's at least move ahead with
Hi, Dongjoon,
Please do not misinterpret my point. I already clearly said "I do not know
how to track the popularity of Hadoop 2 vs Hadoop 3."
Also, let me repeat my opinion: the top priority is to provide two options
for PyPi distribution and let the end users choose the ones they need.
Hadoop
Thanks, Xiao, Sean, Nicholas.
To Xiao,
> it sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
If you say so,
- Apache Hadoop 2.6.0 is the most popular one with 156 dependencies.
- Apache Spark 2.2.0 is the most popular one with 264 dependencies.
As we know, it doesn't make sense. Are we
The team I'm on currently uses pip-installed PySpark for local development,
and we regularly access S3 directly from our laptops/workstations.
One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
being able to use a recent version of hadoop-aws that has mature support
for s3a.
Will pyspark users care much about Hadoop version? they won't if running
locally. They will if connecting to a Hadoop cluster. Then again in that
context, they're probably using a distro anyway that harmonizes it.
Hadoop 3's installed based can't be that large yet; it's been around far
less time.
I think we just need to provide two options and let end users choose the
ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
3.1 release to me.
I do not know how to track the popularity of Hadoop 2 vs
I fully understand your concern, but we cannot live with Hadoop 2.7.4
forever, Xiao. Like Hadoop 2.6, we should let it go.
So, are you saying that CRAN/PyPy should have all combination of Apache
Spark including Hive 1.2 distribution?
What is your suggestion as a PMC on Hadoop 3.2 migration path?
So, we also release Spark binary distros with Hadoop 2.7, 3.2, and no
Hadoop -- all of the options. Picking one profile or the other to release
with pypi etc isn't more or less consistent with those releases, as all
exist.
Is this change only about the source code default, with no effect on
Then, it will be a little complex after this PR. It might make the
community more confused.
In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
in the other distributions, we are using Hadoop 3.2 as the default?
How to explain this to the community? I would not change the
Thanks. Uploading PySpark to PyPI is a simple manual step and our release
script is able to build PySpark with Hadoop 2.7 still if we want.
So, `No` for the following question. I updated my PR according to your
comment.
> If we change the default, will it impact them? If YES,...
>From the
Our monthly pypi downloads of PySpark have reached 5.4 million. We should
avoid forcing the current PySpark users to upgrade their Hadoop versions.
If we change the default, will it impact them? If YES, I think we should
not do it until it is ready and they have a workaround. So far, our pypi
Hi, All.
I bump up this thread again with the title "Use Hadoop-3.2 as a default
Hadoop profile in 3.1.0?"
There exists some recent discussion on the following PR. Please let us know
your thoughts.
https://github.com/apache/spark/pull/28897
Bests,
Dongjoon.
On Fri, Nov 1, 2019 at 9:41 AM
On Tue, Nov 19, 2019 at 10:40 PM Cheng Lian wrote:
> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All the other spark-* artifacts published to
>
the org.spark-project hive 1.2 will need a solution.
>>>> It is old and rather buggy; and It’s been *years*
>>>>
>>>> I think we should decouple hive change from everything else if people
>>>> are concerned?
>>>>
>>>>
t;
>>> I think we should decouple hive change from everything else if people are
>>> concerned?
>>>
>>> ____________
>>> From: Steve Loughran
>>> Sent: Sunday, November 17, 2019 9:22:09 AM
>>> To: Cheng Lian
>>> Cc
> It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people
>>> are concerned?
>>>
>>> ----------
>>> *From:* Steve Loughran
>>> *Sent:* Sunda
It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people
>>> are concerned?
>>>
>>> --------------
>>> *From:* Steve Loughran
>>> *Sent:* S
;
>> --
>> *From:* Steve Loughran
>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>> *To:* Cheng Lian
>> *Cc:* Sean Owen ; Wenchen Fan ;
>> Dongjoon Hyun ; dev ;
>> Yuming Wang
>> *Subject:* Re: Use Hadoop-3.2 as a
-
> *From:* Steve Loughran
> *Sent:* Sunday, November 17, 2019 9:22:09 AM
> *To:* Cheng Lian
> *Cc:* Sean Owen ; Wenchen Fan ;
> Dongjoon Hyun ; dev ;
> Yuming Wang
> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>
> Can I take this moment t
:22:09 AM
To: Cheng Lian
Cc: Sean Owen ; Wenchen Fan ; Dongjoon
Hyun ; dev ; Yuming Wang
Subject: Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
Can I take this moment to remind everyone that the version of hive which spark
has historically bundled (the org.spark-project one
Can I take this moment to remind everyone that the version of hive which
spark has historically bundled (the org.spark-project one) is an orphan
project put together to deal with Hive's shading issues and a source of
unhappiness in the Hive project. What ever get shipped should do its best
to
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
seemed risky, and therefore we only introduced Hive 2.3 under the
hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
here...
I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
than introduce yet another build combination. Does Hadoop 2 + Hive 2
work and is there demand for it?
On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan wrote:
>
> Do we have a limitation on the number of pre-built distributions?
Do we have a limitation on the number of pre-built distributions? Seems
this time we need
1. hadoop 2.7 + hive 1.2
2. hadoop 2.7 + hive 2.3
3. hadoop 3 + hive 2.3
AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't
need to add JDK version to the combination.
On Sat, Nov
Thank you for suggestion.
Having `hive-2.3` profile sounds good to me because it's orthogonal to
Hadoop 3.
IIRC, originally, it was proposed in that way, but we put it under
`hadoop-3.2` to avoid adding new profiles at that time.
And, I'm wondering if you are considering additional pre-built
Cc Yuming, Steve, and Dongjoon
On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian wrote:
> Similar to Xiao, my major concern about making Hadoop 3.2 the default
> Hadoop version is quality control. The current hadoop-3.2 profile covers
> too many major component upgrades, i.e.:
>
>- Hadoop 3.2
>
Similar to Xiao, my major concern about making Hadoop 3.2 the default
Hadoop version is quality control. The current hadoop-3.2 profile covers
too many major component upgrades, i.e.:
- Hadoop 3.2
- Hive 2.3
- JDK 11
We have already found and fixed some feature and performance
i get that cdh and hdp backport a lot and in that way left 2.7 behind. but
they kept the public apis stable at the 2.7 level, because thats kind of
the point. arent those the hadoop apis spark uses?
On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
wrote:
>
>
> On Mon, Nov 4, 2019 at 12:39 AM
On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas
wrote:
> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
> wrote:
>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>
> I second this. If we need to keep a Hadoop 2.x profile around,
I'd move spark's branch-2 line to 2.9.x as
(a) spark's version of httpclient hits a bug in the AWS SDK used in
hadoop-2.8 unless you revert that patch
https://issues.apache.org/jira/browse/SPARK-22919
(b) there's only one future version of 2.8x planned, which is expected once
myself or someone
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
wrote:
> It would be really good if the spark distributions shipped with later
> versions of the hadoop artifacts.
>
I second this. If we need to keep a Hadoop 2.x profile around, why not make
it Hadoop 2.8 or something newer?
Koert Kuipers wrote:
The changes for JDK 11 supports are not increasing the risk of Hadoop 3.2
profile.
Hive 1.2.1 execution JARs are much more stable than Hive 2.3.6 execution
JARs. The changes of thrift-servers are massive. We need more evidence to
prove the quality and stability before we switching the default to
yes i am not against hadoop 3 becoming the default. i was just questioning
the statement that we are close to dropping support for hadoop 2.
we build our own spark releases that we deploy on the clusters of our
clients. these clusters are hdp 2.x, cdh 5, emr, dataproc, etc.
i am aware that
Hi, Koert.
Could you be more specific to your Hadoop version requirement?
Although we will have Hadoop 2.7 profile, Hadoop 2.6 and older support is
officially already dropped in Apache Spark 3.0.0. We can not give you the
answer for Hadoop 2.6 and older version clusters because we are not
i dont see how we can be close to the point where we dont need to support
hadoop 2.x. this does not agree with the reality from my perspective, which
is that all our clients are on hadoop 2.x. not a single one is on hadoop
3.x currently. this includes deployments of cloudera distros, hortonworks
Hi, Xiao.
How JDK11-support can make `Hadoop-3.2 profile` risky? We build and publish
with JDK8.
> In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
only.
Since we build and publish with JDK8 and the
+1 for Hadoop 3.2. Seems lots of cloud integration efforts Steve made is
only available in 3.2. We see lots of users asking for better S3A support
in Spark.
On Fri, Nov 1, 2019 at 9:46 AM Xiao Li wrote:
> Hi, Steve,
>
> Thanks for your comments! My major quality concern is not against Hadoop
>
Hi, Steve,
Thanks for your comments! My major quality concern is not against Hadoop
3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
What is the current default value? as the 2.x releases are becoming EOL;
2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
inevitably be surprises.
One issue about using a older versions is that any
Thank you for the feedback, Sean and Xiao.
Bests,
Dongjoon.
On Mon, Oct 28, 2019 at 12:52 PM Xiao Li wrote:
> The stability and quality of Hadoop 3.2 profile are unknown. The changes
> are massive, including Hive execution and a new version of Hive
> thriftserver.
>
> To reduce the risk, I
The stability and quality of Hadoop 3.2 profile are unknown. The changes
are massive, including Hive execution and a new version of Hive
thriftserver.
To reduce the risk, I would like to keep the current default version
unchanged. When it becomes stable, we can change the default profile to
I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.
On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun wrote:
>
> Hi, All.
>
> There was a discussion on publishing
Hi, All.
There was a discussion on publishing artifacts built with Hadoop 3 .
But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be
the same because we didn't change anything yet.
Technically, we need to change two places for publishing.
1. Jenkins Snapshot Publishing
51 matches
Mail list logo