hadoop-aws versions (was Re: [VOTE] Spark 2.3.1 (RC4))

Steve Loughran Tue, 26 Jun 2018 07:59:52 -0700

following up after a ref to this in  
https://issues.apache.org/jira/browse/HADOOP-15559


the AWS SDK is a very fast moving project, with a release cycle of ~2 weeks, 
but it's in the state Fred Brooks described, "the number of bugs is constant, 
they just move around"; bumpin gup an AWS release is always fun ( 
https://issues.apache.org/jira/browse/HADOOP-14596 , and usually results in 1+ 
issue being raised with the aws SDK project, us doing a workaround & then a 
later release fixing that while adding something new


On 2 Jun 2018, at 02:51, Nicholas Chammas 
<nicholas.cham...@gmail.com<mailto:nicholas.cham...@gmail.com>> wrote:


pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for me either 
(even building with -Phadoop-2.7). I guess I’ve been relying on an unsupported 
pattern and will need to figure something else out going forward in order to 
use s3a://.


Ideally the ASF releases should be done with the -Phadoop-cloud option just to 
get the relevant spark-hadoop-cloud module into the ASF repo, at which point 
you could just depend on it  and get not just the things you need, but none of 
the things you don't, which is always the second half of the 
working-with-transitive-dependencies problem.

For Hadoop 2.8.x & Spark 2.3, you'll need to

* have hadoop-* consistent
* use the aws-sdk for that version: 
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.8.2
* revert out the httpclient updates of SPARK-22919
* exclude any declared jackson dependencies of the hadoop-aws/aws sdk modules 
(HADOOP-13692)
* make sure joda time >= 2.8.1+ is on the classpath else you can't authenticate 
with AWS on a JVM >= 8u51

this is why Hadoop 2.9+ has moved to a (very fat) shaded AWS SDK JAR; you only 
need to get hadoop-* and aws-sdk-bundle JAR in sync, at least provided the 
shaded JAR doesn't actually declare things 
(https://issues.apache.org/jira/browse/HADOOP-15264. We feel that pain too, see.

Anyway, sorry to hear of your suffering.

Nicholas, ping me direct if you are trying to debug things here

-steve



On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
<van...@cloudera.com<mailto:van...@cloudera.com>> wrote:
I have personally never tried to include hadoop-aws that way. But at
the very least, I'd try to use the same version of Hadoop as the Spark
build (2.7.3 IIRC). I don't really expect a different version to work,
and if it did in the past it definitely was not by design.

On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
<nicholas.cham...@gmail.com<mailto:nicholas.cham...@gmail.com>> wrote:
> Building with -Phadoop-2.7 didn’t help, and if I remember correctly,
> building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0 release, so
> it appears something has changed since then.
>
> I wasn’t familiar with -Phadoop-cloud, but I can try that.
>
> My goal here is simply to confirm that this release of Spark works with
> hadoop-aws like past releases did, particularly for Flintrock users who use
> Spark with S3A.
>
> We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop builds with
> every Spark release. If the -hadoop2.7 release build won’t work with
> hadoop-aws anymore, are there plans to provide a new build type that will?


>
> Apologies if the question is poorly formed. I’m batting a bit outside my
> league here. Again, my goal is simply to confirm that I/my users still have
> a way to use s3a://. In the past, that way was simply to call pyspark
> --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very similar. If
> that will no longer work, I’m trying to confirm that the change of behavior
> is intentional or acceptable (as a review for the Spark project) and figure
> out what I need to change (as due diligence for Flintrock’s users).
>
> Nick
>
>
> On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin 
> <van...@cloudera.com<mailto:van...@cloudera.com>> wrote:
>>
>> Using the hadoop-aws package is probably going to be a little more
>> complicated than that. The best bet is to use a custom build of Spark
>> that includes it (use -Phadoop-cloud). Otherwise you're probably
>> looking at some nasty dependency issues, especially if you end up
>> mixing different versions of Hadoop.
>>
>> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
>> <nicholas.cham...@gmail.com<mailto:nicholas.cham...@gmail.com>> wrote:
>> > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4
>> > using
>> > Flintrock. However, trying to load the hadoop-aws package gave me some
>> > errors.
>> >
>> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
>> >
>> > <snipped>
>> >
>> > :: problems summary ::
>> > :::: WARNINGS
>> >                 [NOT FOUND  ]
>> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
>> >         ==== local-m2-cache: tried
>> >
>> >
>> > file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar
>> >                 [NOT FOUND  ]
>> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)
>> >         ==== local-m2-cache: tried
>> >
>> >
>> > file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar
>> >                 [NOT FOUND  ]
>> > org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
>> >         ==== local-m2-cache: tried
>> >
>> >
>> > file:/home/ec2-user/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar
>> >                 [NOT FOUND  ]
>> > com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar (0ms)
>> >         ==== local-m2-cache: tried
>> >
>> >
>> > file:/home/ec2-user/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar
>> >
>> > I’d guess I’m probably using the wrong version of hadoop-aws, but I
>> > called
>> > make-distribution.sh with -Phadoop-2.8 so I’m not sure what else to try.
>> >
>> > Any quick pointers?
>> >
>> > Nick
>> >
>> >
>> > On Fri, Jun 1, 2018 at 6:29 PM Marcelo Vanzin 
>> > <van...@cloudera.com<mailto:van...@cloudera.com>>
>> > wrote:
>> >>
>> >> Starting with my own +1 (binding).
>> >>
>> >> On Fri, Jun 1, 2018 at 3:28 PM, Marcelo Vanzin 
>> >> <van...@cloudera.com<mailto:van...@cloudera.com>>
>> >> wrote:
>> >> > Please vote on releasing the following candidate as Apache Spark
>> >> > version
>> >> > 2.3.1.
>> >> >
>> >> > Given that I expect at least a few people to be busy with Spark
>> >> > Summit
>> >> > next
>> >> > week, I'm taking the liberty of setting an extended voting period.
>> >> > The
>> >> > vote
>> >> > will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>> >> >
>> >> > It passes with a majority of +1 votes, which must include at least 3
>> >> > +1
>> >> > votes
>> >> > from the PMC.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 2.3.1
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
>> >> > https://github.com/apache/spark/tree/v2.3.1-rc4
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found
>> >> > at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>> >> >
>> >> > Signatures used for Spark RCs can be found in this file:
>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >
>> >> > The staging repository for this release can be found at:
>> >> >
>> >> > https://repository.apache.org/content/repositories/orgapachespark-1272/
>> >> >
>> >> > The documentation corresponding to this release can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>> >> >
>> >> > The list of bug fixes going into 2.3.1 can be found at the following
>> >> > URL:
>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12342432
>> >> >
>> >> > FAQ
>> >> >
>> >> > =========================
>> >> > How can I help test this release?
>> >> > =========================
>> >> >
>> >> > If you are a Spark user, you can help us test this release by taking
>> >> > an existing Spark workload and running on this release candidate,
>> >> > then
>> >> > reporting any regressions.
>> >> >
>> >> > If you're working in PySpark you can set up a virtual env and install
>> >> > the current RC and see if anything important breaks, in the
>> >> > Java/Scala
>> >> > you can add the staging repository to your projects resolvers and
>> >> > test
>> >> > with the RC (make sure to clean up the artifact cache before/after so
>> >> > you don't end up building with a out of date RC going forward).
>> >> >
>> >> > ===========================================
>> >> > What should happen to JIRA tickets still targeting 2.3.1?
>> >> > ===========================================
>> >> >
>> >> > The current list of open tickets targeted at 2.3.1 can be found at:
>> >> > https://s.apache.org/Q3Uo
>> >> >
>> >> > Committers should look at those and triage. Extremely important bug
>> >> > fixes, documentation, and API tweaks that impact compatibility should
>> >> > be worked on immediately. Everything else please retarget to an
>> >> > appropriate release.
>> >> >
>> >> > ==================
>> >> > But my bug isn't fixed?
>> >> > ==================
>> >> >
>> >> > In order to make timely releases, we will typically not hold the
>> >> > release unless the bug in question is a regression from the previous
>> >> > release. That being said, if there is something which is a regression
>> >> > that has not been correctly targeted please ping me or a committer to
>> >> > help target the issue.
>> >> >
>> >> >
>> >> > --
>> >> > Marcelo
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: 
>> >> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
>> >>
>> >
>>
>>
>>
>> --
>> Marcelo



--
Marcelo

hadoop-aws versions (was Re: [VOTE] Spark 2.3.1 (RC4))

Reply via email to