Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Cheng Lian Wed, 20 Nov 2019 15:56:46 -0800

Hey Nicholas,

Thanks for pointing this out. I just realized that I misread the
spark-hadoop-cloud POM. Previously, in Spark 2.4, two profiles,
"hadoop-2.7" and "hadoop-3.1", were referenced in the spark-hadoop-cloud
POM (here
<https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L174> and
here <https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L213>).
But in the current master (3.0.0-SNAPSHOT), only the "hadoop-3.2" profile
is mentioned. And I came to the wrong conclusion that spark-hadoop-cloud in
Spark 3.0.0 is only available with the "hadoop-3.2" profile. Apologies for
the misleading information.


Cheng



On Tue, Nov 19, 2019 at 8:57 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> > I don't think the default Hadoop version matters except for the
> spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2
> profile.
>
> What do you mean by "only meaningful under the hadoop-3.2 profile"?
>
> On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian <lian.cs....@gmail.com> wrote:
>
>> Hey Steve,
>>
>> In terms of Maven artifact, I don't think the default Hadoop version
>> matters except for the spark-hadoop-cloud module, which is only meaningful
>> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
>> Maven central are Hadoop-version-neutral.
>>
>> Another issue about switching the default Hadoop version to 3.2 is
>> PySpark distribution. Right now, we only publish PySpark artifacts prebuilt
>> with Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency
>> to 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
>> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>>
>> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via
>> the proposed hive-2.3 profile, I personally don't have a preference over
>> having Hadoop 2.7 or 3.2 as the default Hadoop version. But just for
>> minimizing the release management work, in case we decided to publish other
>> spark-* Maven artifacts from a Hadoop 2.7 build, we can still special case
>> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>>
>> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> I also agree with Steve and Felix.
>>>
>>> Let's have another thread to discuss Hive issue
>>>
>>> because this thread was originally for `hadoop` version.
>>>
>>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>>> `hadoop-3.0` versions.
>>>
>>> We don't need to mix both.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>>> It is old and rather buggy; and It’s been *years*
>>>>
>>>> I think we should decouple hive change from everything else if people
>>>> are concerned?
>>>>
>>>> ------------------------------
>>>> *From:* Steve Loughran <ste...@cloudera.com.INVALID>
>>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>>> *To:* Cheng Lian <lian.cs....@gmail.com>
>>>> *Cc:* Sean Owen <sro...@gmail.com>; Wenchen Fan <cloud0...@gmail.com>;
>>>> Dongjoon Hyun <dongjoon.h...@gmail.com>; dev <dev@spark.apache.org>;
>>>> Yuming Wang <wgy...@gmail.com>
>>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>>
>>>> Can I take this moment to remind everyone that the version of hive
>>>> which spark has historically bundled (the org.spark-project one) is an
>>>> orphan project put together to deal with Hive's shading issues and a source
>>>> of unhappiness in the Hive project. What ever get shipped should do its
>>>> best to avoid including that file.
>>>>
>>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the
>>>> safest move from a risk minimisation perspective. If something has broken
>>>> then it is you can start with the assumption that it is in the o.a.s
>>>> packages without having to debug o.a.hadoop and o.a.hive first. There is a
>>>> cost: if there are problems with the hadoop / hive dependencies those teams
>>>> will inevitably ignore filed bug reports for the same reason spark team
>>>> will probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for
>>>> the Hadoop 2.x line include any compatibility issues with Java 9+. Do bear
>>>> that in mind. It's not been tested, it has dependencies on artifacts we
>>>> know are incompatible, and as far as the Hadoop project is concerned:
>>>> people should move to branch 3 if they want to run on a modern version of
>>>> Java
>>>>
>>>> It would be really really good if the published spark maven artefacts
>>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
>>>> 3.x. That way people doing things with their own projects will get
>>>> up-to-date dependencies and don't get WONTFIX responses themselves.
>>>>
>>>> -Steve
>>>>
>>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>>>> the transition release.
>>>>
>>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <lian.cs....@gmail.com>
>>>> wrote:
>>>>
>>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>>>> seemed risky, and therefore we only introduced Hive 2.3 under the
>>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>>>> here...
>>>>
>>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed
>>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>>>> upgrade together looks too risky.
>>>>
>>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>>> work and is there demand for it?
>>>>
>>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Do we have a limitation on the number of pre-built distributions?
>>>> Seems this time we need
>>>> > 1. hadoop 2.7 + hive 1.2
>>>> > 2. hadoop 2.7 + hive 2.3
>>>> > 3. hadoop 3 + hive 2.3
>>>> >
>>>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>>>> don't need to add JDK version to the combination.
>>>> >
>>>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <
>>>> dongjoon.h...@gmail.com> wrote:
>>>> >>
>>>> >> Thank you for suggestion.
>>>> >>
>>>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal
>>>> to Hadoop 3.
>>>> >> IIRC, originally, it was proposed in that way, but we put it under
>>>> `hadoop-3.2` to avoid adding new profiles at that time.
>>>> >>
>>>> >> And, I'm wondering if you are considering additional pre-built
>>>> distribution and Jenkins jobs.
>>>> >>
>>>> >> Bests,
>>>> >> Dongjoon.
>>>> >>
>>>>
>>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Reply via email to