Re: Hadoop 3 support

2018-10-23 Thread Steve Loughran



> On 16 Oct 2018, at 22:06, t4  wrote:
> 
> has anyone got spark jars working with hadoop3.1 that they can share? i am
> looking to be able to use the latest  hadoop-aws fixes from v3.1

we do, but we do with

*  a patched hive JAR
* bulding spark with -Phive,yarn,hadoop-3.1,hadoop-cloud,kinesis  profiles to 
pull in the object store stuff *while leaving out the things which cause 
conflict*
* some extra stuff to wire up the 0-rename-committer

w.r.t hadoop aws, the hadoop-2.9 artifacts have the shaded aws JAR; 50 MB of 
.class to avoid jackson dependency pain, and an early version of S3Guard. For 
the new commit stuff you will need to go to hadoop 3.1

-steve



> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Hadoop 3 support

2018-10-17 Thread Hyukjin Kwon
See the discussion at https://github.com/apache/spark/pull/21588

2018년 10월 17일 (수) 오전 5:06, t4 님이 작성:

> has anyone got spark jars working with hadoop3.1 that they can share? i am
> looking to be able to use the latest  hadoop-aws fixes from v3.1
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Hadoop 3 support

2018-10-16 Thread t4
has anyone got spark jars working with hadoop3.1 that they can share? i am
looking to be able to use the latest  hadoop-aws fixes from v3.1



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Hadoop 3 support

2018-04-04 Thread Felix Cheung
What would be the strategy with hive? Cherry pick patches? Update to more 
“modern” versions (like 2.3?)

I know of a few critical schema evolution fixes that we could port to hive 
1.2.1-spark


_
From: Steve Loughran <ste...@hortonworks.com>
Sent: Tuesday, April 3, 2018 1:33 PM
Subject: Re: Hadoop 3 support
To: Apache Spark Dev <dev@spark.apache.org>




On 3 Apr 2018, at 01:30, Saisai Shao 
<sai.sai.s...@gmail.com<mailto:sai.sai.s...@gmail.com>> wrote:

Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) 
doesn't support run on Hadoop 3. Hive will check the Hadoop version in the 
runtime [1]. Besides this I think some pom changes should be enough to support 
Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of 
changes, but this is not necessary.


[1] 
https://github.com/apache/hive/blob/6751225a5cde4c40839df8b46e8d241fdda5cd34/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144

2018-04-03 4:57 GMT+08:00 Marcelo Vanzin 
<van...@cloudera.com<mailto:van...@cloudera.com>>:
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673.


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin 
<r...@databricks.com<mailto:r...@databricks.com>> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>


To be ruthless, I'd view Hadoop 3.1 as the first one to play with...3.0.x was 
more of a wide-version check. Hadoop 3.1RC0 is out this week, making it the 
ideal (last!) time to find showstoppers.

1. I've got a PR which adds a profile to build spark against hadoop 3, with 
some fixes for zk import along with better hadoop-cloud profile

https://github.com/apache/spark/pull/20923


Apply that and patch and both mvn and sbt can build with the RC0 from the ASF 
staging repo:

build/sbt -Phadoop-3,hadoop-cloud,yarn -Psnapshots-and-staging



2. Everything Marcelo says about hive.

You can build hadoop locally with a -Dhadoop.version=2.11 and the hive 
1.2.1.-spark version check goes through. You can't safely bring up HDFS like 
that, but you can run spark standalone against things

Some strategies

Short term: build a new hive-1,2.x-spark which fixes up the version check and 
merges in those critical patches that cloudera, hortoworks, databricks, + 
anyone else has got in for their production systems. I don't think we have that 
many.

That leaves a "how to release" story, as the ASF will want it to come out under 
the ASF auspices, and, given the liability disclaimers, so should everyone. The 
Hive team could be "invited" to publish it as their own if people ask nicely.

Long term
 -do something about that subclassing to get the thrift endpoint to work. That 
can include fixing hive's service to be subclass friendly.
 -move to hive 2

That' s a major piece of work.




Re: Hadoop 3 support

2018-04-03 Thread Steve Loughran


On 3 Apr 2018, at 01:30, Saisai Shao 
> wrote:

Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) 
doesn't support run on Hadoop 3. Hive will check the Hadoop version in the 
runtime [1]. Besides this I think some pom changes should be enough to support 
Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of 
changes, but this is not necessary.


[1] 
https://github.com/apache/hive/blob/6751225a5cde4c40839df8b46e8d241fdda5cd34/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144


I don't think the hadoop-shaded JAR is complete enough for spark yet...it was 
very much driven by HBase's needs. But there's only one way to get Hadoop to 
fix that: try the move, find the problems, complain noisily. Then Hadoop 3.2 
and/or a 3.1.x for x>=1 can have the broader shading

Assume my name is next to the "Shade hadoop-cloud-storage" problem, though 
there the fact that aws-java-sdk-bundle is 50 MB already, I don't plan to shade 
that at all. The AWS shading already isolates everything from amazon's choice 
of Jackson, which was one of the sore points.

-Steve


Re: Hadoop 3 support

2018-04-02 Thread Saisai Shao
Yes, the main blocking issue is the hive version used in Spark
(1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop
version in the runtime [1]. Besides this I think some pom changes should be
enough to support Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of
changes, but this is not necessary.


[1]
https://github.com/apache/hive/blob/6751225a5cde4c40839df8b46e8d241fdda5cd34/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144

2018-04-03 4:57 GMT+08:00 Marcelo Vanzin :

> Saisai filed SPARK-23534, but the main blocking issue is really
> SPARK-18673.
>
>
> On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin  wrote:
> > Does anybody know what needs to be done in order for Spark to support
> Hadoop
> > 3?
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Hadoop 3 support

2018-04-02 Thread Marcelo Vanzin
I haven't looked at it in detail...

Somebody's been trying to do that in
https://github.com/apache/spark/pull/20659, but that's kind of a huge
change.

The parts where I'd be concerned are:
- using Hive's original hive-exec package brings in a bunch of shaded
dependencies, which may break Spark in weird ways. HIVE-16391 was
supposed to fix that but nothing has really been done as part of that
bug.
- the hive-exec "core" package avoids the shaded dependencies but used
to have issues of its own. Maybe it's better now, haven't looked.
- what about the current thrift server which is basically a fork of
the Hive 1.2 source code?
- when using Hadoop 3 + an old metastore client that doesn't know
about Hadoop 3, things may break.

The latter one has two possible fixes: say that Hadoop 3 builds of
Spark don't support old metastores; or add code so that Spark loads a
separate copy of Hadoop libraries in that case (search for
"sharesHadoopClasses" in IsolatedClientLoader for where to start with
that).

If trying to update Hive it would be good to avoid having to fork it,
like it's done currently. But not sure that will be possible given the
current hive-exec packaging.

On Mon, Apr 2, 2018 at 2:58 PM, Reynold Xin  wrote:
> Is it difficult to upgrade Hive execution version to the latest version? The
> metastore used to be an issue but now that part had been separated from the
> execution part.
>
>
> On Mon, Apr 2, 2018 at 1:57 PM, Marcelo Vanzin  wrote:
>>
>> Saisai filed SPARK-23534, but the main blocking issue is really
>> SPARK-18673.
>>
>>
>> On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin  wrote:
>> > Does anybody know what needs to be done in order for Spark to support
>> > Hadoop
>> > 3?
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Hadoop 3 support

2018-04-02 Thread Reynold Xin
Is it difficult to upgrade Hive execution version to the latest version?
The metastore used to be an issue but now that part had been separated from
the execution part.


On Mon, Apr 2, 2018 at 1:57 PM, Marcelo Vanzin  wrote:

> Saisai filed SPARK-23534, but the main blocking issue is really
> SPARK-18673.
>
>
> On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin  wrote:
> > Does anybody know what needs to be done in order for Spark to support
> Hadoop
> > 3?
> >
>
>
>
> --
> Marcelo
>


Re: Hadoop 3 support

2018-04-02 Thread Marcelo Vanzin
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673.


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin  wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Hadoop 3 support

2018-04-02 Thread Reynold Xin
That's just a nice to have improvement right? I'm more curious what is the
minimal amount of work required to support 3.0, without all the bells and
whistles. (Of course we can also do the bells and whistles, but those would
come after we can actually get 3.0 running).


On Mon, Apr 2, 2018 at 1:50 PM, Mridul Muralidharan 
wrote:

> Specifically to run spark with hadoop 3 docker support, I have filed a
> few jira's tracked under [1].
>
> Regards,
> Mridul
>
> [1] https://issues.apache.org/jira/browse/SPARK-23717
>
>
> On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin  wrote:
> > Does anybody know what needs to be done in order for Spark to support
> Hadoop
> > 3?
> >
>


Re: Hadoop 3 support

2018-04-02 Thread Mridul Muralidharan
Specifically to run spark with hadoop 3 docker support, I have filed a
few jira's tracked under [1].

Regards,
Mridul

[1] https://issues.apache.org/jira/browse/SPARK-23717


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin  wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Hadoop 3 support

2018-04-02 Thread Reynold Xin
Does anybody know what needs to be done in order for Spark to support
Hadoop 3?