Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Mridul Muralidharan Wed, 20 Nov 2019 09:28:45 -0800

Just for completeness sake, spark is not version neutral to hadoop;
particularly in yarn mode, there is a minimum version requirement
(though fairly generous I believe).


I agree with Steve, it is a long standing pain that we are bundling a
positively ancient version of hive.
Having said that, we should decouple the hive artifact question from
the hadoop version question - though they might be related currently.

Regards,
Mridul

On Tue, Nov 19, 2019 at 2:40 PM Cheng Lian <lian.cs....@gmail.com> wrote:
>
> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version matters 
> except for the spark-hadoop-cloud module, which is only meaningful under the 
> hadoop-3.2 profile. All  the other spark-* artifacts published to Maven 
> central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark 
> distribution. Right now, we only publish PySpark artifacts prebuilt with 
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to 3.2 
> is feasible for PySpark users. Or maybe we should publish PySpark prebuilt 
> with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the 
> proposed hive-2.3 profile, I personally don't have a preference over having 
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing the 
> release management work, in case we decided to publish other spark-* Maven 
> artifacts from a Hadoop 2.7 build, we can still special case 
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote:
>>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and 
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <felixcheun...@hotmail.com> 
>> wrote:
>>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It 
>>> is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people are 
>>> concerned?
>>>
>>> ________________________________
>>> From: Steve Loughran <ste...@cloudera.com.INVALID>
>>> Sent: Sunday, November 17, 2019 9:22:09 AM
>>> To: Cheng Lian <lian.cs....@gmail.com>
>>> Cc: Sean Owen <sro...@gmail.com>; Wenchen Fan <cloud0...@gmail.com>; 
>>> Dongjoon Hyun <dongjoon.h...@gmail.com>; dev <dev@spark.apache.org>; Yuming 
>>> Wang <wgy...@gmail.com>
>>> Subject: Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which 
>>> spark has historically bundled (the org.spark-project one) is an orphan 
>>> project put together to deal with Hive's shading issues and a source of 
>>> unhappiness in the Hive project. What ever get shipped should do its best 
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest 
>>> move from a risk minimisation perspective. If something has broken then it 
>>> is you can start with the assumption that it is in the o.a.s packages 
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if 
>>> there are problems with the hadoop / hive dependencies those teams will 
>>> inevitably ignore filed bug reports for the same reason spark team will 
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the 
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that 
>>> in mind. It's not been tested, it has dependencies on artifacts we know are 
>>> incompatible, and as far as the Hadoop project is concerned: people should 
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts (a) 
>>> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x. 
>>> That way people doing things with their own projects will get up-to-date 
>>> dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever" 
>>> branch-2 release and then declare its predecessors EOL; 2.10 will be the 
>>> transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <lian.cs....@gmail.com> wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I 
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which 
>>> seemed risky, and therefore we only introduced Hive 2.3 under the 
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong 
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that 
>>> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not 
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11 
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sro...@gmail.com> wrote:
>>>
>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>> work and is there demand for it?
>>>
>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>> >
>>> > Do we have a limitation on the number of pre-built distributions? Seems 
>>> > this time we need
>>> > 1. hadoop 2.7 + hive 1.2
>>> > 2. hadoop 2.7 + hive 2.3
>>> > 3. hadoop 3 + hive 2.3
>>> >
>>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so 
>>> > don't need to add JDK version to the combination.
>>> >
>>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <dongjoon.h...@gmail.com> 
>>> > wrote:
>>> >>
>>> >> Thank you for suggestion.
>>> >>
>>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to 
>>> >> Hadoop 3.
>>> >> IIRC, originally, it was proposed in that way, but we put it under 
>>> >> `hadoop-3.2` to avoid adding new profiles at that time.
>>> >>
>>> >> And, I'm wondering if you are considering additional pre-built 
>>> >> distribution and Jenkins jobs.
>>> >>
>>> >> Bests,
>>> >> Dongjoon.
>>> >>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Reply via email to