Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-21 Thread Sean Owen
On Fri, Nov 20, 2015 at 10:39 PM, Reynold Xin  wrote:
> I don't think we should look at it from only maintenance point of view --
> because in that case the answer is clearly supporting as few versions as
> possible (or just rm -rf spark source code and call it a day). It is a
> tradeoff between the number of users impacted and the maintenance burden.

The upside to supporting only newer versions is less maintenance (no
small thing given how sprawling the build is), but also more ability
to use newer functionality. The downside is of course not letting
older Hadoop users use the latest Spark.


> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?

If the question is about HDFS, really, then I think the answer is
"yes". The big compatibility problem has been protobuf but all of 2.2+
is on 2.5.


> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?

Same client/server question? This is where I'm not as clear. I think
the answer is 'yes' to the extent you're using functionality that
existed in the older YARN. Of course, using some newer API vs old
clusters doesn't work.


> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below stop?
> To what extent do you care about running Spark on older Hadoop clusters.

CDH 5.3 = Hadoop 2.6, FWIW, which was out about a year ago. Support
continues for a long time in the sense that CDH 5 will be supported
for years. However, Spark 2 would never be shipped / supported in CDH
5. So, it's not an issue for Spark 2; Spark 2 will be "supported"
probably only vs Hadoop 3 or at least something later in 2.x than 2.6.

The question is here is really about whether Spark should specially
support, say, Spark 2 + CDH 5.0 or something. My experience so far is
that Spark has not really supported older vendor versions it claims
to, and I'd rather not pretend it does. So this doesn't strike me as a
great reason either.

This is roughly why supporting, say, 2.6 as a pretty safely recent
version seems like an OK place to draw the line 6-8 months from now.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-21 Thread Steve Loughran

> On 20 Nov 2015, at 21:39, Reynold Xin  wrote:
> 
> OK I'm not exactly asking for a vote here :)
> 
> I don't think we should look at it from only maintenance point of view -- 
> because in that case the answer is clearly supporting as few versions as 
> possible (or just rm -rf spark source code and call it a day). It is a 
> tradeoff between the number of users impacted and the maintenance burden.
> 
> So a few questions for those more familiar with Hadoop:
> 
> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3? 
> 

yes, at HDFS 

There's some special cases with HDFS stopping a 2.2-2.5 client talking to 
Hadoop 2.6


-HDFS at rest encryption needs a client that can decode it (2.6.x+)
-HDFS erasure code will need a later version (2.8?)

If you turn SASL on in your datanodes, your DNs don't need to come up on a port 
< 1024, but Hadoop  < 2.6 clients stop being able to work with HDFS at that 
point



> 2. If the answer to 1 is yes, are there known, major issues with backward 
> compatibility?
> 

hadoop native libs, every time. Guava, jackson and protobuf can be managed with 
shading, but hadoop.{so,dll} is a real problem. A hadoop-2.6 JAR will use 
native methods in hadoop.lib which, if not loaded, will break the app.  This is 
a pain as nobody includes that native lib with their java binaries —who can 
even predict which one they have to do. As a consequence, I'd really advise 
against trying to run an app built with the 2.6 JARS inside a YARN cluster  < 
2.6. You can certainly talk to HDFS and the YARN services, but there's a risk a 
codepath will hit a native method that isn't there.


It's trouble the other way too.  -even though we try not break existing code by 
moving/renaming native methods it can happen.

The last time someone did this in a big way, I was the first to find it in 
HADOOP-11064; the changes where reverted/altered but there was no official 
declaration that compatibility at the JNI layer will be maintained. Apparently 
you can't guarantee it over JVM versions either.

We really need a lib versioning story, which is what HADOOP-11127 covers.

> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?
> 

I'd say no, with classpath and hadoop native being the failure points.

There's also feature completeness; Hadoop 2.6 was the first version with all 
the YARN-896 work for long-lived services


> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below stop? 
> To what extent do you care about running Spark on older Hadoop clusters.
> 
> 

I don't know. And I probably don't want to make any forward looking statements 
anyway. But I don't even know how well supported 2.4 is today; 2.6 is the one 
that still gets bug fixes out from the ASF. I can see it lasting a while.


What essentially happens is that we provide bug fixes to the existing releases, 
but for anything new: upgrade.

Assuming that policy continues (disclaimer: personal opinions, etc), then any 
Spark 2.0 release would be rebuilt against all the JARs which the rest of that 
version of HDP would use, and that's the only version we'd recommend using.



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Steve Loughran

On 19 Nov 2015, at 22:14, Reynold Xin 
> wrote:

I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I think 
everybody is for that.

https://issues.apache.org/jira/browse/SPARK-11807

Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is to 
say, keep only Hadoop 2.6 and greater.

What are the community's thoughts on that?


+1

It's the common APIs under pretty much shipping; EMR, CDH & HDP, and there's no 
significant API changes between it and 2.7. [There's a couple of extra records 
in job submissions in 2.7 which you can get at with reflection for AM failure 
reset window and rolling log capture patterns]. It's also getting some ongoing 
maintenance (2.6.3 being planned for dec).

It's not perfect; if I were to list troublespots to me they are : s3a isn't 
ready for use; there's better logging and tracing in later versions. But those 
aren't at the API level.


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Steve Loughran

On 20 Nov 2015, at 14:28, ches...@alpinenow.com 
wrote:

Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months away.

customer will need to upgrade the new Hadoop clusters to Apache 2.6 or later to 
leverage new spark 2.0 in one year. I think this possible as latest release on 
cdh5.x,  HDP 2.x are both on Apache 2.6.0 already. Company will have enough 
time to upgrade cluster.

+1 for me as well

Chester


now, if you are looking that far ahead, the other big issue is "when to retire 
Java 7 support".?

That's a tough decision for all projects. Hadoop 3.x will be Java 8 only, but 
nobody has committed the patch to the trunk codebase to force a java 8 build; + 
most of *todays* hadoop clusters are Java 7. But as you can't even download a 
Java 7 JDK for the desktop from oracle any more today, 2016 is a time to look 
at the language support and decide what is the baseline version

Commentary from Twitter here -as they point out, it's not just the server farm 
that matters, it's all the apps that talk to it


http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3ccab7mwte+kefcxsr6n46-ztcs19ed7cwc9vobtr1jqewdkye...@mail.gmail.com%3E>

-Steve


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Sandy Ryza
To answer your fourth question from Cloudera's perspective, we would never
support a customer running Spark 2.0 on a Hadoop version < 2.6.

-Sandy

On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xin  wrote:

> OK I'm not exactly asking for a vote here :)
>
> I don't think we should look at it from only maintenance point of view --
> because in that case the answer is clearly supporting as few versions as
> possible (or just rm -rf spark source code and call it a day). It is a
> tradeoff between the number of users impacted and the maintenance burden.
>
> So a few questions for those more familiar with Hadoop:
>
> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?
>
> 2. If the answer to 1 is yes, are there known, major issues with backward
> compatibility?
>
> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?
>
> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below
> stop? To what extent do you care about running Spark on older Hadoop
> clusters.
>
>
>
> On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran 
> wrote:
>
>>
>> On 20 Nov 2015, at 14:28, ches...@alpinenow.com wrote:
>>
>> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months
>> away.
>>
>> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or
>> later to leverage new spark 2.0 in one year. I think this possible as
>> latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already.
>> Company will have enough time to upgrade cluster.
>>
>> +1 for me as well
>>
>> Chester
>>
>>
>> now, if you are looking that far ahead, the other big issue is "when to
>> retire Java 7 support".?
>>
>> That's a tough decision for all projects. Hadoop 3.x will be Java 8 only,
>> but nobody has committed the patch to the trunk codebase to force a java 8
>> build; + most of *todays* hadoop clusters are Java 7. But as you can't even
>> download a Java 7 JDK for the desktop from oracle any more today, 2016 is a
>> time to look at the language support and decide what is the baseline
>> version
>>
>> Commentary from Twitter here -as they point out, it's not just the server
>> farm that matters, it's all the apps that talk to it
>>
>>
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3ccab7mwte+kefcxsr6n46-ztcs19ed7cwc9vobtr1jqewdkye...@mail.gmail.com%3E
>>
>> -Steve
>>
>
>


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Reynold Xin
OK I'm not exactly asking for a vote here :)

I don't think we should look at it from only maintenance point of view --
because in that case the answer is clearly supporting as few versions as
possible (or just rm -rf spark source code and call it a day). It is a
tradeoff between the number of users impacted and the maintenance burden.

So a few questions for those more familiar with Hadoop:

1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?

2. If the answer to 1 is yes, are there known, major issues with backward
compatibility?

3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?

4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below
stop? To what extent do you care about running Spark on older Hadoop
clusters.



On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran 
wrote:

>
> On 20 Nov 2015, at 14:28, ches...@alpinenow.com wrote:
>
> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months
> away.
>
> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or
> later to leverage new spark 2.0 in one year. I think this possible as
> latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already.
> Company will have enough time to upgrade cluster.
>
> +1 for me as well
>
> Chester
>
>
> now, if you are looking that far ahead, the other big issue is "when to
> retire Java 7 support".?
>
> That's a tough decision for all projects. Hadoop 3.x will be Java 8 only,
> but nobody has committed the patch to the trunk codebase to force a java 8
> build; + most of *todays* hadoop clusters are Java 7. But as you can't even
> download a Java 7 JDK for the desktop from oracle any more today, 2016 is a
> time to look at the language support and decide what is the baseline
> version
>
> Commentary from Twitter here -as they point out, it's not just the server
> farm that matters, it's all the apps that talk to it
>
>
>
> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3ccab7mwte+kefcxsr6n46-ztcs19ed7cwc9vobtr1jqewdkye...@mail.gmail.com%3E
>
> -Steve
>


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Saisai Shao
+1.

Hadoop 2.6 would be a good choice with many features added (like supporting
long running service, label based scheduling). Currently there's lot of
reflection codes to support multiple version of Yarn, so upgrading to a
newer version will really ease the pain :).

Thanks
Saisai

On Fri, Nov 20, 2015 at 3:58 PM, Jean-Baptiste Onofré 
wrote:

> +1
>
> Regards
> JB
>
>
> On 11/19/2015 11:14 PM, Reynold Xin wrote:
>
>> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
>> think everybody is for that.
>>
>> https://issues.apache.org/jira/browse/SPARK-11807
>>
>> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That
>> is to say, keep only Hadoop 2.6 and greater.
>>
>> What are the community's thoughts on that?
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Chester Chen
for #1-3, the answer is likely No.

  Recently we upgrade to Spark 1.5.1, with CDH5.3, CDH5.4 and HDP2.2  and
others.

  We were using CDH5.3 client to talk to CDH5.4. We were doing this to see
if we support many different hadoop cluster versions without changing the
build. This was ok for yarn-cluster spark 1.3.1, but could not get spark
1.5.1 started. We upgrade the client to CDH5.4, then everything works.

  There are API changes between Apache 2.4 and 2.6, not sure you can mix
match them.

Chester


On Fri, Nov 20, 2015 at 1:59 PM, Sandy Ryza  wrote:

> To answer your fourth question from Cloudera's perspective, we would never
> support a customer running Spark 2.0 on a Hadoop version < 2.6.
>
> -Sandy
>
> On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xin  wrote:
>
>> OK I'm not exactly asking for a vote here :)
>>
>> I don't think we should look at it from only maintenance point of view --
>> because in that case the answer is clearly supporting as few versions as
>> possible (or just rm -rf spark source code and call it a day). It is a
>> tradeoff between the number of users impacted and the maintenance burden.
>>
>> So a few questions for those more familiar with Hadoop:
>>
>> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?
>>
>> 2. If the answer to 1 is yes, are there known, major issues with backward
>> compatibility?
>>
>> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?
>>
>> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below
>> stop? To what extent do you care about running Spark on older Hadoop
>> clusters.
>>
>>
>>
>> On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran 
>> wrote:
>>
>>>
>>> On 20 Nov 2015, at 14:28, ches...@alpinenow.com wrote:
>>>
>>> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months
>>> away.
>>>
>>> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or
>>> later to leverage new spark 2.0 in one year. I think this possible as
>>> latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already.
>>> Company will have enough time to upgrade cluster.
>>>
>>> +1 for me as well
>>>
>>> Chester
>>>
>>>
>>> now, if you are looking that far ahead, the other big issue is "when to
>>> retire Java 7 support".?
>>>
>>> That's a tough decision for all projects. Hadoop 3.x will be Java 8
>>> only, but nobody has committed the patch to the trunk codebase to force a
>>> java 8 build; + most of *todays* hadoop clusters are Java 7. But as you
>>> can't even download a Java 7 JDK for the desktop from oracle any more
>>> today, 2016 is a time to look at the language support and decide what is
>>> the baseline version
>>>
>>> Commentary from Twitter here -as they point out, it's not just the
>>> server farm that matters, it's all the apps that talk to it
>>>
>>>
>>>
>>> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3ccab7mwte+kefcxsr6n46-ztcs19ed7cwc9vobtr1jqewdkye...@mail.gmail.com%3E
>>>
>>> -Steve
>>>
>>
>>
>


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-19 Thread Ted Yu
Should a new job be setup under Spark-Master-Maven-with-YARN for hadoop
2.6.x ?

Cheers

On Thu, Nov 19, 2015 at 5:16 PM, 张志强(旺轩)  wrote:

> I agreed
> +1
>
> --
> 发件人:Reynold Xin
> 日 期:2015年11月20日 06:14:44
> 收件人:dev@spark.apache.org; Sean Owen;
> Thomas Graves
> 主 题:Dropping support for earlier Hadoop versions in Spark 2.0?
>
>
> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
> think everybody is for that.
>
> https://issues.apache.org/jira/browse/SPARK-11807
>
> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is
> to say, keep only Hadoop 2.6 and greater.
>
> What are the community's thoughts on that?
>
>
>


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-19 Thread Henri Dubois-Ferriere
+1

On 19 November 2015 at 14:14, Reynold Xin  wrote:

> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
> think everybody is for that.
>
> https://issues.apache.org/jira/browse/SPARK-11807
>
> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is
> to say, keep only Hadoop 2.6 and greater.
>
> What are the community's thoughts on that?
>
>


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-19 Thread Jean-Baptiste Onofré

+1

Regards
JB

On 11/19/2015 11:14 PM, Reynold Xin wrote:

I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
think everybody is for that.

https://issues.apache.org/jira/browse/SPARK-11807

Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That
is to say, keep only Hadoop 2.6 and greater.

What are the community's thoughts on that?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org