Re: Hadoop 3 support
> On 16 Oct 2018, at 22:06, t4 wrote: > > has anyone got spark jars working with hadoop3.1 that they can share? i am > looking to be able to use the latest hadoop-aws fixes from v3.1 we do, but we do with * a patched hive JAR * bulding spark with -Phive,yarn,hadoop-3.1,hadoop-cloud,kinesis profiles to pull in the object store stuff *while leaving out the things which cause conflict* * some extra stuff to wire up the 0-rename-committer w.r.t hadoop aws, the hadoop-2.9 artifacts have the shaded aws JAR; 50 MB of .class to avoid jackson dependency pain, and an early version of S3Guard. For the new commit stuff you will need to go to hadoop 3.1 -steve > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Hadoop 3 support
See the discussion at https://github.com/apache/spark/pull/21588 2018년 10월 17일 (수) 오전 5:06, t4 님이 작성: > has anyone got spark jars working with hadoop3.1 that they can share? i am > looking to be able to use the latest hadoop-aws fixes from v3.1 > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: Hadoop 3 support
has anyone got spark jars working with hadoop3.1 that they can share? i am looking to be able to use the latest hadoop-aws fixes from v3.1 -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Hadoop 3 support
What would be the strategy with hive? Cherry pick patches? Update to more “modern” versions (like 2.3?) I know of a few critical schema evolution fixes that we could port to hive 1.2.1-spark _ From: Steve Loughran <ste...@hortonworks.com> Sent: Tuesday, April 3, 2018 1:33 PM Subject: Re: Hadoop 3 support To: Apache Spark Dev <dev@spark.apache.org> On 3 Apr 2018, at 01:30, Saisai Shao <sai.sai.s...@gmail.com<mailto:sai.sai.s...@gmail.com>> wrote: Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some pom changes should be enough to support Hadoop 3. If we want to use Hadoop 3 shaded client jar, then the pom requires lots of changes, but this is not necessary. [1] https://github.com/apache/hive/blob/6751225a5cde4c40839df8b46e8d241fdda5cd34/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144 2018-04-03 4:57 GMT+08:00 Marcelo Vanzin <van...@cloudera.com<mailto:van...@cloudera.com>>: Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673. On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>> wrote: > Does anybody know what needs to be done in order for Spark to support Hadoop > 3? > To be ruthless, I'd view Hadoop 3.1 as the first one to play with...3.0.x was more of a wide-version check. Hadoop 3.1RC0 is out this week, making it the ideal (last!) time to find showstoppers. 1. I've got a PR which adds a profile to build spark against hadoop 3, with some fixes for zk import along with better hadoop-cloud profile https://github.com/apache/spark/pull/20923 Apply that and patch and both mvn and sbt can build with the RC0 from the ASF staging repo: build/sbt -Phadoop-3,hadoop-cloud,yarn -Psnapshots-and-staging 2. Everything Marcelo says about hive. You can build hadoop locally with a -Dhadoop.version=2.11 and the hive 1.2.1.-spark version check goes through. You can't safely bring up HDFS like that, but you can run spark standalone against things Some strategies Short term: build a new hive-1,2.x-spark which fixes up the version check and merges in those critical patches that cloudera, hortoworks, databricks, + anyone else has got in for their production systems. I don't think we have that many. That leaves a "how to release" story, as the ASF will want it to come out under the ASF auspices, and, given the liability disclaimers, so should everyone. The Hive team could be "invited" to publish it as their own if people ask nicely. Long term -do something about that subclassing to get the thrift endpoint to work. That can include fixing hive's service to be subclass friendly. -move to hive 2 That' s a major piece of work.
Re: Hadoop 3 support
On 3 Apr 2018, at 01:30, Saisai Shao> wrote: Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some pom changes should be enough to support Hadoop 3. If we want to use Hadoop 3 shaded client jar, then the pom requires lots of changes, but this is not necessary. [1] https://github.com/apache/hive/blob/6751225a5cde4c40839df8b46e8d241fdda5cd34/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144 I don't think the hadoop-shaded JAR is complete enough for spark yet...it was very much driven by HBase's needs. But there's only one way to get Hadoop to fix that: try the move, find the problems, complain noisily. Then Hadoop 3.2 and/or a 3.1.x for x>=1 can have the broader shading Assume my name is next to the "Shade hadoop-cloud-storage" problem, though there the fact that aws-java-sdk-bundle is 50 MB already, I don't plan to shade that at all. The AWS shading already isolates everything from amazon's choice of Jackson, which was one of the sore points. -Steve
Re: Hadoop 3 support
Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some pom changes should be enough to support Hadoop 3. If we want to use Hadoop 3 shaded client jar, then the pom requires lots of changes, but this is not necessary. [1] https://github.com/apache/hive/blob/6751225a5cde4c40839df8b46e8d241fdda5cd34/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144 2018-04-03 4:57 GMT+08:00 Marcelo Vanzin: > Saisai filed SPARK-23534, but the main blocking issue is really > SPARK-18673. > > > On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin wrote: > > Does anybody know what needs to be done in order for Spark to support > Hadoop > > 3? > > > > > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: Hadoop 3 support
I haven't looked at it in detail... Somebody's been trying to do that in https://github.com/apache/spark/pull/20659, but that's kind of a huge change. The parts where I'd be concerned are: - using Hive's original hive-exec package brings in a bunch of shaded dependencies, which may break Spark in weird ways. HIVE-16391 was supposed to fix that but nothing has really been done as part of that bug. - the hive-exec "core" package avoids the shaded dependencies but used to have issues of its own. Maybe it's better now, haven't looked. - what about the current thrift server which is basically a fork of the Hive 1.2 source code? - when using Hadoop 3 + an old metastore client that doesn't know about Hadoop 3, things may break. The latter one has two possible fixes: say that Hadoop 3 builds of Spark don't support old metastores; or add code so that Spark loads a separate copy of Hadoop libraries in that case (search for "sharesHadoopClasses" in IsolatedClientLoader for where to start with that). If trying to update Hive it would be good to avoid having to fork it, like it's done currently. But not sure that will be possible given the current hive-exec packaging. On Mon, Apr 2, 2018 at 2:58 PM, Reynold Xinwrote: > Is it difficult to upgrade Hive execution version to the latest version? The > metastore used to be an issue but now that part had been separated from the > execution part. > > > On Mon, Apr 2, 2018 at 1:57 PM, Marcelo Vanzin wrote: >> >> Saisai filed SPARK-23534, but the main blocking issue is really >> SPARK-18673. >> >> >> On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin wrote: >> > Does anybody know what needs to be done in order for Spark to support >> > Hadoop >> > 3? >> > >> >> >> >> -- >> Marcelo > > -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Hadoop 3 support
Is it difficult to upgrade Hive execution version to the latest version? The metastore used to be an issue but now that part had been separated from the execution part. On Mon, Apr 2, 2018 at 1:57 PM, Marcelo Vanzinwrote: > Saisai filed SPARK-23534, but the main blocking issue is really > SPARK-18673. > > > On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin wrote: > > Does anybody know what needs to be done in order for Spark to support > Hadoop > > 3? > > > > > > -- > Marcelo >
Re: Hadoop 3 support
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673. On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xinwrote: > Does anybody know what needs to be done in order for Spark to support Hadoop > 3? > -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Hadoop 3 support
That's just a nice to have improvement right? I'm more curious what is the minimal amount of work required to support 3.0, without all the bells and whistles. (Of course we can also do the bells and whistles, but those would come after we can actually get 3.0 running). On Mon, Apr 2, 2018 at 1:50 PM, Mridul Muralidharanwrote: > Specifically to run spark with hadoop 3 docker support, I have filed a > few jira's tracked under [1]. > > Regards, > Mridul > > [1] https://issues.apache.org/jira/browse/SPARK-23717 > > > On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin wrote: > > Does anybody know what needs to be done in order for Spark to support > Hadoop > > 3? > > >
Re: Hadoop 3 support
Specifically to run spark with hadoop 3 docker support, I have filed a few jira's tracked under [1]. Regards, Mridul [1] https://issues.apache.org/jira/browse/SPARK-23717 On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xinwrote: > Does anybody know what needs to be done in order for Spark to support Hadoop > 3? > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Hadoop 3 support
Does anybody know what needs to be done in order for Spark to support Hadoop 3?