Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
Cheng, could you elaborate on your criteria, `Hive 2.3 code paths are
proven to be stable`?
For me, it's difficult to image that we can reach any stable situation when
we don't use it at all by ourselves.

> The Hive 1.2 code paths can only be removed once the Hive 2.3 code
paths are proven to be stable.

Sean, our published POM is pointing and advertising the illegitimate Hive
1.2 fork as a compile dependency.
Yes. It can be overridden. So, why does Apache Spark need to publish like
that?
If someone want to use that illegitimate Hive 1.2 fork, let them override
it. We are unable to delete those illegitimate Hive 1.2 fork.
Those artifacts will be orphans.

> The published POM will be agnostic to Hadoop / Hive; well,
> it will link against a particular version but can be overridden.

-
https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.12/3.0.0-preview
   ->
https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
   ->
https://mvnrepository.com/artifact/org.spark-project.hive/hive-metastore/1.2.1.spark2

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 5:26 PM Hyukjin Kwon  wrote:

> > Should Hadoop 2 + Hive 2 be considered to work on JDK 11?
> This seems being investigated by Yuming's PR (
> https://github.com/apache/spark/pull/26533) if I am not mistaken.
>
> Oh, yes, what I meant by (default) was the default profiles we will use in
> Spark.
>
>
> 2019년 11월 20일 (수) 오전 10:14, Sean Owen 님이 작성:
>
>> Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
>> sure if 2.7 did, but honestly I've lost track.
>> Anyway, it doesn't matter much as the JDK doesn't cause another build
>> permutation. All are built targeting Java 8.
>>
>> I also don't know if we have to declare a binary release a default.
>> The published POM will be agnostic to Hadoop / Hive; well, it will
>> link against a particular version but can be overridden. That's what
>> you're getting at?
>>
>>
>> On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon  wrote:
>> >
>> > So, are we able to conclude our plans as below?
>> >
>> > 1. In Spark 3,  we release as below:
>> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
>> >   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
>> >
>> > 2. In Spark 3.1, we target:
>> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
>> >
>> > 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)"
>> combo right away after cutting branch-3 to see if Hive 2.3 is considered as
>> stable in general.
>> > I roughly suspect it would be a couple of months after Spark 3.0
>> release (?).
>> >
>> > BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1
>> (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
>> >
>>
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Nicholas Chammas
> I don't think the default Hadoop version matters except for the
spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2
profile.

What do you mean by "only meaningful under the hadoop-3.2 profile"?

On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian  wrote:

> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
> Maven central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark
> distribution. Right now, we only publish PySpark artifacts prebuilt with
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
> 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
> proposed hive-2.3 profile, I personally don't have a preference over having
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
> the release management work, in case we decided to publish other spark-*
> Maven artifacts from a Hadoop 2.7 build, we can still special case
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
> wrote:
>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
>> wrote:
>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>> It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people
>>> are concerned?
>>>
>>> --
>>> *From:* Steve Loughran 
>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>> *To:* Cheng Lian 
>>> *Cc:* Sean Owen ; Wenchen Fan ;
>>> Dongjoon Hyun ; dev ;
>>> Yuming Wang 
>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which
>>> spark has historically bundled (the org.spark-project one) is an orphan
>>> project put together to deal with Hive's shading issues and a source of
>>> unhappiness in the Hive project. What ever get shipped should do its best
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>>> move from a risk minimisation perspective. If something has broken then it
>>> is you can start with the assumption that it is in the o.a.s packages
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>>> there are problems with the hadoop / hive dependencies those teams will
>>> inevitably ignore filed bug reports for the same reason spark team will
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>>> in mind. It's not been tested, it has dependencies on artifacts we know are
>>> incompatible, and as far as the Hadoop project is concerned: people should
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts
>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
>>> 3.x. That way people doing things with their own projects will get
>>> up-to-date dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>>> the transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian 
>>> wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>>> seemed risky, and therefore we only introduced Hive 2.3 under the
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed
>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>>>
>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>> work and is there demand for it?
>>>
>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
>>> >
>>> > Do we 

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Hyukjin Kwon
We don't have an official Spark with Hadoop 3 yet (except the preview) if I
am not mistaken.
I think it's more natural to one minor release term before switching this
...
How about we target Hadoop 3 as default in Spark 3.1?


2019년 11월 20일 (수) 오전 7:40, Cheng Lian 님이 작성:

> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
> Maven central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark
> distribution. Right now, we only publish PySpark artifacts prebuilt with
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
> 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
> proposed hive-2.3 profile, I personally don't have a preference over having
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
> the release management work, in case we decided to publish other spark-*
> Maven artifacts from a Hadoop 2.7 build, we can still special case
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
> wrote:
>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
>> wrote:
>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>> It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people
>>> are concerned?
>>>
>>> --
>>> *From:* Steve Loughran 
>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>> *To:* Cheng Lian 
>>> *Cc:* Sean Owen ; Wenchen Fan ;
>>> Dongjoon Hyun ; dev ;
>>> Yuming Wang 
>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which
>>> spark has historically bundled (the org.spark-project one) is an orphan
>>> project put together to deal with Hive's shading issues and a source of
>>> unhappiness in the Hive project. What ever get shipped should do its best
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>>> move from a risk minimisation perspective. If something has broken then it
>>> is you can start with the assumption that it is in the o.a.s packages
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>>> there are problems with the hadoop / hive dependencies those teams will
>>> inevitably ignore filed bug reports for the same reason spark team will
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>>> in mind. It's not been tested, it has dependencies on artifacts we know are
>>> incompatible, and as far as the Hadoop project is concerned: people should
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts
>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
>>> 3.x. That way people doing things with their own projects will get
>>> up-to-date dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>>> the transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian 
>>> wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>>> seemed risky, and therefore we only introduced Hive 2.3 under the
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed
>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>>>
>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>> work and is there demand for it?
>>>
>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
>>> >
>>> > Do 

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Hyukjin Kwon
> Should Hadoop 2 + Hive 2 be considered to work on JDK 11?
This seems being investigated by Yuming's PR (
https://github.com/apache/spark/pull/26533) if I am not mistaken.

Oh, yes, what I meant by (default) was the default profiles we will use in
Spark.


2019년 11월 20일 (수) 오전 10:14, Sean Owen 님이 작성:

> Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
> sure if 2.7 did, but honestly I've lost track.
> Anyway, it doesn't matter much as the JDK doesn't cause another build
> permutation. All are built targeting Java 8.
>
> I also don't know if we have to declare a binary release a default.
> The published POM will be agnostic to Hadoop / Hive; well, it will
> link against a particular version but can be overridden. That's what
> you're getting at?
>
>
> On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon  wrote:
> >
> > So, are we able to conclude our plans as below?
> >
> > 1. In Spark 3,  we release as below:
> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
> >   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
> >
> > 2. In Spark 3.1, we target:
> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
> >
> > 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)"
> combo right away after cutting branch-3 to see if Hive 2.3 is considered as
> stable in general.
> > I roughly suspect it would be a couple of months after Spark 3.0
> release (?).
> >
> > BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1
> (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
> >
>


Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Sean Owen
Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
sure if 2.7 did, but honestly I've lost track.
Anyway, it doesn't matter much as the JDK doesn't cause another build
permutation. All are built targeting Java 8.

I also don't know if we have to declare a binary release a default.
The published POM will be agnostic to Hadoop / Hive; well, it will
link against a particular version but can be overridden. That's what
you're getting at?


On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon  wrote:
>
> So, are we able to conclude our plans as below?
>
> 1. In Spark 3,  we release as below:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
>
> 2. In Spark 3.1, we target:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
>
> 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo 
> right away after cutting branch-3 to see if Hive 2.3 is considered as stable 
> in general.
> I roughly suspect it would be a couple of months after Spark 3.0 release 
> (?).
>
> BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) + 
> JDK8 (default)" combination is deprecated anyway in Spark 3.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Hyukjin Kwon
So, are we able to conclude our plans as below?

1. In Spark 3,  we release as below:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)

2. In Spark 3.1, we target:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)

3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo
right away after cutting branch-3 to see if Hive 2.3 is considered as
stable in general.
I roughly suspect it would be a couple of months after Spark 3.0
release (?).

BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) +
JDK8 (default)" combination is deprecated anyway in Spark 3.



2019년 11월 20일 (수) 오전 9:52, Cheng Lian 님이 작성:

> Thanks for taking care of this, Dongjoon!
>
> We can target SPARK-20202 to 3.1.0, but I don't think we should do it
> immediately after cutting the branch-3.0. The Hive 1.2 code paths can only
> be removed once the Hive 2.3 code paths are proven to be stable. If it
> turned out to be buggy in Spark 3.1, we may want to further postpone
> SPARK-20202 to 3.2.0 by then.
>
> On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun 
> wrote:
>
>> Yes. It does. I meant SPARK-20202.
>>
>> Thanks. I understand that it can be considered like Scala version issue.
>> So, that's the reason why I put this as a `policy` issue from the
>> beginning.
>>
>> > First of all, I want to put this as a policy issue instead of a
>> technical issue.
>>
>> In the policy perspective, we should remove this immediately if we have a
>> solution to fix this.
>> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to
>> the current discussion status.
>>
>> https://issues.apache.org/jira/browse/SPARK-20202
>>
>> And, if there is no other issues, I'll create a PR to remove it from
>> `master` branch when we cut `branch-3.0`.
>>
>> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
>> you think about this, Sean?
>> The preparation is already started in another email thread and I believe
>> that is a keystone to prove `Hive 2.3` version stability
>> (which Cheng/Hyukjin/you asked).
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian  wrote:
>>
>>> It's kinda like Scala version upgrade. Historically, we only remove the
>>> support of an older Scala version when the newer version is proven to be
>>> stable after one or more Spark minor versions.
>>>
>>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian 
>>> wrote:
>>>
 Hmm, what exactly did you mean by "remove the usage of forked `hive` in
 Apache Spark 3.0 completely officially"? I thought you wanted to remove the
 forked Hive 1.2 dependencies completely, no? As long as we still keep the
 Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
 particular preference between using Hive 1.2 or 2.3 as the default Hive
 version. After all, for end-users and providers who need a particular
 version combination, they can always build Spark with proper profiles
 themselves.

 And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that
 it's due to the folder name.

 On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
 wrote:

> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>
> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
> the renaming the directories until 3.0.0 deadline to minimize the diff.
>
> We can replace it immediately if we want right now.
>
>
>
> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>> Hi, Cheng.
>>
>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8
>> world.
>> If we consider them, it could be the followings.
>>
>> +--+-++
>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>> +-+
>> |Legitimate|X| O  |
>> |JDK11 |X| O  |
>> |Hadoop3   |X| O  |
>> |Hadoop2   |O| O  |
>> |Functions | Baseline|   More |
>> |Bug fixes | Baseline|   More |
>> +-+
>>
>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>> (including Jenkins/GitHubAction/AppVeyor).
>>
>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>> to give more visibility to the whole community,
>>
>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>> distribution
>> 2. We need to switch our default Hive usage to 2.3 in `master` for
>> 

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Thanks for taking care of this, Dongjoon!

We can target SPARK-20202 to 3.1.0, but I don't think we should do it
immediately after cutting the branch-3.0. The Hive 1.2 code paths can only
be removed once the Hive 2.3 code paths are proven to be stable. If it
turned out to be buggy in Spark 3.1, we may want to further postpone
SPARK-20202 to 3.2.0 by then.

On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun 
wrote:

> Yes. It does. I meant SPARK-20202.
>
> Thanks. I understand that it can be considered like Scala version issue.
> So, that's the reason why I put this as a `policy` issue from the
> beginning.
>
> > First of all, I want to put this as a policy issue instead of a
> technical issue.
>
> In the policy perspective, we should remove this immediately if we have a
> solution to fix this.
> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to
> the current discussion status.
>
> https://issues.apache.org/jira/browse/SPARK-20202
>
> And, if there is no other issues, I'll create a PR to remove it from
> `master` branch when we cut `branch-3.0`.
>
> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
> you think about this, Sean?
> The preparation is already started in another email thread and I believe
> that is a keystone to prove `Hive 2.3` version stability
> (which Cheng/Hyukjin/you asked).
>
> Bests,
> Dongjoon.
>
>
> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian  wrote:
>
>> It's kinda like Scala version upgrade. Historically, we only remove the
>> support of an older Scala version when the newer version is proven to be
>> stable after one or more Spark minor versions.
>>
>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian  wrote:
>>
>>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>>> version. After all, for end-users and providers who need a particular
>>> version combination, they can always build Spark with proper profiles
>>> themselves.
>>>
>>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that
>>> it's due to the folder name.
>>>
>>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
>>> wrote:
>>>
 BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

 For directory name, we use '1.2.1' and '2.3.5' because we just delayed
 the renaming the directories until 3.0.0 deadline to minimize the diff.

 We can replace it immediately if we want right now.



 On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
 wrote:

> Hi, Cheng.
>
> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
> If we consider them, it could be the followings.
>
> +--+-++
> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
> +-+
> |Legitimate|X| O  |
> |JDK11 |X| O  |
> |Hadoop3   |X| O  |
> |Hadoop2   |O| O  |
> |Functions | Baseline|   More |
> |Bug fixes | Baseline|   More |
> +-+
>
> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
> (including Jenkins/GitHubAction/AppVeyor).
>
> For me, AS-IS 3.0 is not enough for that. According to your advices,
> to give more visibility to the whole community,
>
> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
> distribution
> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
> after `branch-3.0` branch cut.
>
> I know that we have been reluctant to (1) and (2) due to its burden.
> But, it's time to prepare. Without them, we are going to be
> insufficient again and again.
>
> Bests,
> Dongjoon.
>
>
>
>
> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian 
> wrote:
>
>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 
>> 1.2
>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>> 
>> and here
>> 
>> .)
>>
>> Again, I'm happy to get rid 

Re: Enabling fully disaggregated shuffle on Spark

2019-11-19 Thread John Zhuge
Great work, Bo! Would love to hear the details.


On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue  wrote:

> I'm interested in remote shuffle services as well. I'd love to hear about
> what you're using in production!
>
> rb
>
> On Tue, Nov 19, 2019 at 2:43 PM bo yang  wrote:
>
>> Hi Ben,
>>
>> Thanks for the writing up! This is Bo from Uber. I am in Felix's team in
>> Seattle, and working on disaggregated shuffle (we called it remote shuffle
>> service, RSS, internally). We have put RSS into production for a while, and
>> learned a lot during the work (tried quite a few techniques to improve the
>> remote shuffle performance). We could share our learning with the
>> community, and also would like to hear feedback/suggestions on how to
>> further improve remote shuffle performance. We could chat more details if
>> you or other people are interested.
>>
>> Best,
>> Bo
>>
>> On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
>> wrote:
>>
>>> I would like to start a conversation about extending the Spark shuffle
>>> manager surface to support fully disaggregated shuffle implementations.
>>> This is closely related to the work in SPARK-25299
>>> , which is focused
>>> on refactoring the shuffle manager API (and in particular,
>>> SortShuffleManager) to use a pluggable storage backend. The motivation for
>>> that SPIP is further enabling Spark on Kubernetes.
>>>
>>>
>>> The motivation for this proposal is enabling full externalized
>>> (disaggregated) shuffle service implementations. (Facebook’s Cosco
>>> shuffle
>>> 
>>> is one example of such a disaggregated shuffle service.) These changes
>>> allow the bulk of the shuffle to run in a remote service such that minimal
>>> state resides in executors and local disk spill is minimized. The net
>>> effect is increased job stability and performance improvements in certain
>>> scenarios. These changes should work well with or are complementary to
>>> SPARK-25299. Some or all points may be merged into that issue as
>>> appropriate.
>>>
>>>
>>> Below is a description of each component of this proposal. These changes
>>> can ideally be introduced incrementally. I would like to gather feedback
>>> and gauge interest from others in the community to collaborate on this.
>>> There are likely more points that would  be useful to disaggregated shuffle
>>> services. We can outline a more concrete plan after gathering enough input.
>>> A working session could help us kick off this joint effort; maybe something
>>> in the mid-January to mid-February timeframe (depending on interest and
>>> availability. I’m happy to host at our Sunnyvale, CA offices.
>>>
>>>
>>> ProposalScheduling and re-executing tasks
>>>
>>> Allow coordination between the service and the Spark DAG scheduler as to
>>> whether a given block/partition needs to be recomputed when a task fails or
>>> when shuffle block data cannot be read. Having such coordination is
>>> important, e.g., for suppressing recomputation after aborted executors or
>>> for forcing late recomputation if the service internally acts as a cache.
>>> One catchall solution is to have the shuffle manager provide an indication
>>> of whether shuffle data is external to executors (or nodes). Another
>>> option: allow the shuffle manager (likely on the driver) to be queried for
>>> the existence of shuffle data for a given executor ID (or perhaps map task,
>>> reduce task, etc). Note that this is at the level of data the scheduler is
>>> aware of (i.e., map/reduce partitions) rather than block IDs, which are
>>> internal details for some shuffle managers.
>>> ShuffleManager API
>>>
>>> Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that the
>>> service knows that data is still active. This is one way to enable
>>> time-/job-scoped data because a disaggregated shuffle service cannot rely
>>> on robust communication with Spark and in general has a distinct lifecycle
>>> from the Spark deployment(s) it talks to. This would likely take the form
>>> of a callback on ShuffleManager itself, but there are other approaches.
>>>
>>>
>>> Add lifecycle hooks to shuffle readers and writers (e.g., to
>>> close/recycle connections/streams/file handles as well as provide commit
>>> semantics). SPARK-25299 adds commit semantics to the internal data storage
>>> layer, but this is applicable to all shuffle managers at a higher level and
>>> should apply equally to the ShuffleWriter.
>>>
>>>
>>> Do not require ShuffleManagers to expose ShuffleBlockResolvers where
>>> they are not needed. Ideally, this would be an implementation detail of the
>>> shuffle manager itself. If there is substantial overlap between the
>>> SortShuffleManager and other implementations, then the storage details can
>>> be abstracted at the appropriate level. (SPARK-25299 does not currently
>>> change this.)
>>>
>>>
>>> Do not require MapStatus to include 

Re: Enabling fully disaggregated shuffle on Spark

2019-11-19 Thread Ryan Blue
I'm interested in remote shuffle services as well. I'd love to hear about
what you're using in production!

rb

On Tue, Nov 19, 2019 at 2:43 PM bo yang  wrote:

> Hi Ben,
>
> Thanks for the writing up! This is Bo from Uber. I am in Felix's team in
> Seattle, and working on disaggregated shuffle (we called it remote shuffle
> service, RSS, internally). We have put RSS into production for a while, and
> learned a lot during the work (tried quite a few techniques to improve the
> remote shuffle performance). We could share our learning with the
> community, and also would like to hear feedback/suggestions on how to
> further improve remote shuffle performance. We could chat more details if
> you or other people are interested.
>
> Best,
> Bo
>
> On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
> wrote:
>
>> I would like to start a conversation about extending the Spark shuffle
>> manager surface to support fully disaggregated shuffle implementations.
>> This is closely related to the work in SPARK-25299
>> , which is focused on
>> refactoring the shuffle manager API (and in particular, SortShuffleManager)
>> to use a pluggable storage backend. The motivation for that SPIP is further
>> enabling Spark on Kubernetes.
>>
>>
>> The motivation for this proposal is enabling full externalized
>> (disaggregated) shuffle service implementations. (Facebook’s Cosco
>> shuffle
>> 
>> is one example of such a disaggregated shuffle service.) These changes
>> allow the bulk of the shuffle to run in a remote service such that minimal
>> state resides in executors and local disk spill is minimized. The net
>> effect is increased job stability and performance improvements in certain
>> scenarios. These changes should work well with or are complementary to
>> SPARK-25299. Some or all points may be merged into that issue as
>> appropriate.
>>
>>
>> Below is a description of each component of this proposal. These changes
>> can ideally be introduced incrementally. I would like to gather feedback
>> and gauge interest from others in the community to collaborate on this.
>> There are likely more points that would  be useful to disaggregated shuffle
>> services. We can outline a more concrete plan after gathering enough input.
>> A working session could help us kick off this joint effort; maybe something
>> in the mid-January to mid-February timeframe (depending on interest and
>> availability. I’m happy to host at our Sunnyvale, CA offices.
>>
>>
>> ProposalScheduling and re-executing tasks
>>
>> Allow coordination between the service and the Spark DAG scheduler as to
>> whether a given block/partition needs to be recomputed when a task fails or
>> when shuffle block data cannot be read. Having such coordination is
>> important, e.g., for suppressing recomputation after aborted executors or
>> for forcing late recomputation if the service internally acts as a cache.
>> One catchall solution is to have the shuffle manager provide an indication
>> of whether shuffle data is external to executors (or nodes). Another
>> option: allow the shuffle manager (likely on the driver) to be queried for
>> the existence of shuffle data for a given executor ID (or perhaps map task,
>> reduce task, etc). Note that this is at the level of data the scheduler is
>> aware of (i.e., map/reduce partitions) rather than block IDs, which are
>> internal details for some shuffle managers.
>> ShuffleManager API
>>
>> Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that the
>> service knows that data is still active. This is one way to enable
>> time-/job-scoped data because a disaggregated shuffle service cannot rely
>> on robust communication with Spark and in general has a distinct lifecycle
>> from the Spark deployment(s) it talks to. This would likely take the form
>> of a callback on ShuffleManager itself, but there are other approaches.
>>
>>
>> Add lifecycle hooks to shuffle readers and writers (e.g., to
>> close/recycle connections/streams/file handles as well as provide commit
>> semantics). SPARK-25299 adds commit semantics to the internal data storage
>> layer, but this is applicable to all shuffle managers at a higher level and
>> should apply equally to the ShuffleWriter.
>>
>>
>> Do not require ShuffleManagers to expose ShuffleBlockResolvers where they
>> are not needed. Ideally, this would be an implementation detail of the
>> shuffle manager itself. If there is substantial overlap between the
>> SortShuffleManager and other implementations, then the storage details can
>> be abstracted at the appropriate level. (SPARK-25299 does not currently
>> change this.)
>>
>>
>> Do not require MapStatus to include blockmanager IDs where they are not
>> relevant. This is captured by ShuffleBlockInfo
>> 

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Sean Owen
Same idea? support this combo in 3.0 and then remove Hadoop 2 support
in 3.1 or something? or at least make them non-default, not
necessarily publish special builds?

On Tue, Nov 19, 2019 at 4:53 PM Dongjoon Hyun  wrote:
> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do you 
> think about this, Sean?
> The preparation is already started in another email thread and I believe that 
> is a keystone to prove `Hive 2.3` version stability
> (which Cheng/Hyukjin/you asked).
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
Yes. It does. I meant SPARK-20202.

Thanks. I understand that it can be considered like Scala version issue.
So, that's the reason why I put this as a `policy` issue from the beginning.

> First of all, I want to put this as a policy issue instead of a technical
issue.

In the policy perspective, we should remove this immediately if we have a
solution to fix this.
For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to the
current discussion status.

https://issues.apache.org/jira/browse/SPARK-20202

And, if there is no other issues, I'll create a PR to remove it from
`master` branch when we cut `branch-3.0`.

For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
you think about this, Sean?
The preparation is already started in another email thread and I believe
that is a keystone to prove `Hive 2.3` version stability
(which Cheng/Hyukjin/you asked).

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian  wrote:

> It's kinda like Scala version upgrade. Historically, we only remove the
> support of an older Scala version when the newer version is proven to be
> stable after one or more Spark minor versions.
>
> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian  wrote:
>
>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>> version. After all, for end-users and providers who need a particular
>> version combination, they can always build Spark with proper profiles
>> themselves.
>>
>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
>> due to the folder name.
>>
>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>
>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>
>>> We can replace it immediately if we want right now.
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, Cheng.

 This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
 If we consider them, it could be the followings.

 +--+-++
 |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
 +-+
 |Legitimate|X| O  |
 |JDK11 |X| O  |
 |Hadoop3   |X| O  |
 |Hadoop2   |O| O  |
 |Functions | Baseline|   More |
 |Bug fixes | Baseline|   More |
 +-+

 To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
 (including Jenkins/GitHubAction/AppVeyor).

 For me, AS-IS 3.0 is not enough for that. According to your advices,
 to give more visibility to the whole community,

 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
 distribution
 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
 after `branch-3.0` branch cut.

 I know that we have been reluctant to (1) and (2) due to its burden.
 But, it's time to prepare. Without them, we are going to be
 insufficient again and again.

 Bests,
 Dongjoon.




 On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian 
 wrote:

> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
> minor release to stabilize Hive 2.3 code paths before retiring the Hive 
> 1.2
> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
> 
> and here
> 
> .)
>
> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
> Spark 3.0. For preview releases, I'm afraid that their visibility is not
> good enough for covering such major upgrades.
>
> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for feedback, Hyujkjin and Sean.
>>
>> I proposed `preview-2` for that purpose but I'm also +1 for do that
>> at 3.1
>> if we can make a decision to eliminate the illegitimate Hive fork

Re: Enabling fully disaggregated shuffle on Spark

2019-11-19 Thread bo yang
Hi Ben,

Thanks for the writing up! This is Bo from Uber. I am in Felix's team in
Seattle, and working on disaggregated shuffle (we called it remote shuffle
service, RSS, internally). We have put RSS into production for a while, and
learned a lot during the work (tried quite a few techniques to improve the
remote shuffle performance). We could share our learning with the
community, and also would like to hear feedback/suggestions on how to
further improve remote shuffle performance. We could chat more details if
you or other people are interested.

Best,
Bo

On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
wrote:

> I would like to start a conversation about extending the Spark shuffle
> manager surface to support fully disaggregated shuffle implementations.
> This is closely related to the work in SPARK-25299
> , which is focused on
> refactoring the shuffle manager API (and in particular, SortShuffleManager)
> to use a pluggable storage backend. The motivation for that SPIP is further
> enabling Spark on Kubernetes.
>
>
> The motivation for this proposal is enabling full externalized
> (disaggregated) shuffle service implementations. (Facebook’s Cosco shuffle
> 
> is one example of such a disaggregated shuffle service.) These changes
> allow the bulk of the shuffle to run in a remote service such that minimal
> state resides in executors and local disk spill is minimized. The net
> effect is increased job stability and performance improvements in certain
> scenarios. These changes should work well with or are complementary to
> SPARK-25299. Some or all points may be merged into that issue as
> appropriate.
>
>
> Below is a description of each component of this proposal. These changes
> can ideally be introduced incrementally. I would like to gather feedback
> and gauge interest from others in the community to collaborate on this.
> There are likely more points that would  be useful to disaggregated shuffle
> services. We can outline a more concrete plan after gathering enough input.
> A working session could help us kick off this joint effort; maybe something
> in the mid-January to mid-February timeframe (depending on interest and
> availability. I’m happy to host at our Sunnyvale, CA offices.
>
>
> ProposalScheduling and re-executing tasks
>
> Allow coordination between the service and the Spark DAG scheduler as to
> whether a given block/partition needs to be recomputed when a task fails or
> when shuffle block data cannot be read. Having such coordination is
> important, e.g., for suppressing recomputation after aborted executors or
> for forcing late recomputation if the service internally acts as a cache.
> One catchall solution is to have the shuffle manager provide an indication
> of whether shuffle data is external to executors (or nodes). Another
> option: allow the shuffle manager (likely on the driver) to be queried for
> the existence of shuffle data for a given executor ID (or perhaps map task,
> reduce task, etc). Note that this is at the level of data the scheduler is
> aware of (i.e., map/reduce partitions) rather than block IDs, which are
> internal details for some shuffle managers.
> ShuffleManager API
>
> Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that the
> service knows that data is still active. This is one way to enable
> time-/job-scoped data because a disaggregated shuffle service cannot rely
> on robust communication with Spark and in general has a distinct lifecycle
> from the Spark deployment(s) it talks to. This would likely take the form
> of a callback on ShuffleManager itself, but there are other approaches.
>
>
> Add lifecycle hooks to shuffle readers and writers (e.g., to close/recycle
> connections/streams/file handles as well as provide commit semantics).
> SPARK-25299 adds commit semantics to the internal data storage layer, but
> this is applicable to all shuffle managers at a higher level and should
> apply equally to the ShuffleWriter.
>
>
> Do not require ShuffleManagers to expose ShuffleBlockResolvers where they
> are not needed. Ideally, this would be an implementation detail of the
> shuffle manager itself. If there is substantial overlap between the
> SortShuffleManager and other implementations, then the storage details can
> be abstracted at the appropriate level. (SPARK-25299 does not currently
> change this.)
>
>
> Do not require MapStatus to include blockmanager IDs where they are not
> relevant. This is captured by ShuffleBlockInfo
> 
> including an optional BlockManagerId in SPARK-25299. However, this change
> should be lifted to the MapStatus level so that it applies to all
> ShuffleManagers. Alternatively, use a more general data-location
> abstraction than BlockManagerId. This gives the shuffle 

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Cheng Lian
Hey Steve,

In terms of Maven artifact, I don't think the default Hadoop version
matters except for the spark-hadoop-cloud module, which is only meaningful
under the hadoop-3.2 profile. All  the other spark-* artifacts published to
Maven central are Hadoop-version-neutral.

Another issue about switching the default Hadoop version to 3.2 is PySpark
distribution. Right now, we only publish PySpark artifacts prebuilt with
Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
3.2 is feasible for PySpark users. Or maybe we should publish PySpark
prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.

Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
proposed hive-2.3 profile, I personally don't have a preference over having
Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
the release management work, in case we decided to publish other spark-*
Maven artifacts from a Hadoop 2.7 build, we can still special case
spark-hadoop-cloud and publish it using a hadoop-3.2 build.

On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun 
wrote:

> I also agree with Steve and Felix.
>
> Let's have another thread to discuss Hive issue
>
> because this thread was originally for `hadoop` version.
>
> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
> `hadoop-3.0` versions.
>
> We don't need to mix both.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
> wrote:
>
>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It
>> is old and rather buggy; and It’s been *years*
>>
>> I think we should decouple hive change from everything else if people are
>> concerned?
>>
>> --
>> *From:* Steve Loughran 
>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>> *To:* Cheng Lian 
>> *Cc:* Sean Owen ; Wenchen Fan ;
>> Dongjoon Hyun ; dev ;
>> Yuming Wang 
>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>
>> Can I take this moment to remind everyone that the version of hive which
>> spark has historically bundled (the org.spark-project one) is an orphan
>> project put together to deal with Hive's shading issues and a source of
>> unhappiness in the Hive project. What ever get shipped should do its best
>> to avoid including that file.
>>
>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>> move from a risk minimisation perspective. If something has broken then it
>> is you can start with the assumption that it is in the o.a.s packages
>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>> there are problems with the hadoop / hive dependencies those teams will
>> inevitably ignore filed bug reports for the same reason spark team will
>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>> in mind. It's not been tested, it has dependencies on artifacts we know are
>> incompatible, and as far as the Hadoop project is concerned: people should
>> move to branch 3 if they want to run on a modern version of Java
>>
>> It would be really really good if the published spark maven artefacts (a)
>> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
>> That way people doing things with their own projects will get up-to-date
>> dependencies and don't get WONTFIX responses themselves.
>>
>> -Steve
>>
>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>> the transition release.
>>
>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian  wrote:
>>
>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>> seemed risky, and therefore we only introduced Hive 2.3 under the
>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>> here...
>>
>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
>> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>> upgrade together looks too risky.
>>
>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>>
>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>> work and is there demand for it?
>>
>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
>> >
>> > Do we have a limitation on the number of pre-built distributions? Seems
>> this time we need
>> > 1. hadoop 2.7 + hive 1.2
>> > 2. hadoop 2.7 + hive 2.3
>> > 3. hadoop 3 + hive 2.3
>> >
>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>> don't need to add JDK version to the combination.
>> >
>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
>> wrote:
>> >>
>> >> Thank you 

Re: Migration `Spark QA Compile` Jenkins jobs to GitHub Action

2019-11-19 Thread Dongjoon Hyun
Thank you, Sean, Shane, and Xiao!

Bests,
Dongjoon.

On Tue, Nov 19, 2019 at 2:15 PM Shane Knapp  wrote:

> i had a few minutes and everything has been deleted!
>
> On Tue, Nov 19, 2019 at 2:02 PM Shane Knapp  wrote:
> >
> > thank sean!
> >
> > i am all for moving these jobs to github actions, and will be doing
> > this 'soon' as i'm @ kubecon this week.
> >
> > btw the R ecosystem definitely needs some attention, however, but
> > that's an issue for another time.  :)
> >
> > On Tue, Nov 19, 2019 at 1:49 PM Sean Owen  wrote:
> > >
> > > I would favor moving whatever we can to Github. It's difficult to
> > > modify the Jenkins instances without Shane's valiant help, and over
> > > time makes more sense to modernize and integrate it into the project.
> > >
> > > On Tue, Nov 19, 2019 at 3:35 PM Dongjoon Hyun 
> wrote:
> > > >
> > > > Hi, All.
> > > >
> > > > Apache Spark community used the following dashboard as post-hook
> verifications.
> > > >
> > > >
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> > > >
> > > > There are six registered jobs.
> > > >
> > > > 1. spark-branch-2.4-compile-maven-hadoop-2.6
> > > > 2. spark-branch-2.4-compile-maven-hadoop-2.7
> > > > 3. spark-branch-2.4-lint
> > > > 4. spark-master-compile-maven-hadoop-2.7
> > > > 5. spark-master-compile-maven-hadoop-3.2
> > > > 6. spark-master-lint
> > > >
> > > > Now, we added `GitHub Action` jobs. You can see the green check at
> every commit.
> > > >
> > > > https://github.com/apache/spark/commits/master
> > > > https://github.com/apache/spark/commits/branch-2.4
> > > >
> > > > If you click the green check, you can see the detail.
> > > > The followings are the example runs at the last commits on both
> branches.
> > > >
> > > > https://github.com/apache/spark/runs/310411948 (master)
> > > > https://github.com/apache/spark/runs/309522646 (branch-2.4)
> > > >
> > > > New `GitHub Action` have more combination than the old Jenkins jobs.
> > > >
> > > > - branch-2.4-scala-2.11-hadoop-2.6 (compile/package/install)
> > > > - branch-2.4-scala-2.12-hadoop-2.6 (compile/package/install)
> > > > - branch-2.4-scala-2.11-hadoop-2.7 (compile/package/install)
> > > > - branch-2.4-scala-2.12-hadoop-2.7 (compile/package/install)
> > > > - branch-2.4-linters (Scala/Java/Python/R)
> > > > - master-scala-2.12-hadoop-2.7 (compile/package/install)
> > > > - master-scala-2.12-hadoop-3.2 (compile/package/install)
> > > > - master-scala-2.12-hadoop-3.2-jdk11 (compile/package/install)
> > > > - master-linters (Scala/Java/Python/R)
> > > >
> > > > In addition, this is a part of Apache Spark code base and everyone
> can make contributions on this.
> > > >
> > > > Finally, as the last piece of this work, we are going to remove the
> legacy Jenkins jobs via the following JIRA issue.
> > > >
> > > > https://issues.apache.org/jira/browse/SPARK-29935
> > > >
> > > > Please let me know if you have any concerns on this.
> > > > (We can keep the legacy jobs, but two of them are already broken.)
> > > >
> > > > Bests,
> > > > Dongjoon.
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> >
> >
> > --
> > Shane Knapp
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Migration `Spark QA Compile` Jenkins jobs to GitHub Action

2019-11-19 Thread Shane Knapp
i had a few minutes and everything has been deleted!

On Tue, Nov 19, 2019 at 2:02 PM Shane Knapp  wrote:
>
> thank sean!
>
> i am all for moving these jobs to github actions, and will be doing
> this 'soon' as i'm @ kubecon this week.
>
> btw the R ecosystem definitely needs some attention, however, but
> that's an issue for another time.  :)
>
> On Tue, Nov 19, 2019 at 1:49 PM Sean Owen  wrote:
> >
> > I would favor moving whatever we can to Github. It's difficult to
> > modify the Jenkins instances without Shane's valiant help, and over
> > time makes more sense to modernize and integrate it into the project.
> >
> > On Tue, Nov 19, 2019 at 3:35 PM Dongjoon Hyun  
> > wrote:
> > >
> > > Hi, All.
> > >
> > > Apache Spark community used the following dashboard as post-hook 
> > > verifications.
> > >
> > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> > >
> > > There are six registered jobs.
> > >
> > > 1. spark-branch-2.4-compile-maven-hadoop-2.6
> > > 2. spark-branch-2.4-compile-maven-hadoop-2.7
> > > 3. spark-branch-2.4-lint
> > > 4. spark-master-compile-maven-hadoop-2.7
> > > 5. spark-master-compile-maven-hadoop-3.2
> > > 6. spark-master-lint
> > >
> > > Now, we added `GitHub Action` jobs. You can see the green check at every 
> > > commit.
> > >
> > > https://github.com/apache/spark/commits/master
> > > https://github.com/apache/spark/commits/branch-2.4
> > >
> > > If you click the green check, you can see the detail.
> > > The followings are the example runs at the last commits on both branches.
> > >
> > > https://github.com/apache/spark/runs/310411948 (master)
> > > https://github.com/apache/spark/runs/309522646 (branch-2.4)
> > >
> > > New `GitHub Action` have more combination than the old Jenkins jobs.
> > >
> > > - branch-2.4-scala-2.11-hadoop-2.6 (compile/package/install)
> > > - branch-2.4-scala-2.12-hadoop-2.6 (compile/package/install)
> > > - branch-2.4-scala-2.11-hadoop-2.7 (compile/package/install)
> > > - branch-2.4-scala-2.12-hadoop-2.7 (compile/package/install)
> > > - branch-2.4-linters (Scala/Java/Python/R)
> > > - master-scala-2.12-hadoop-2.7 (compile/package/install)
> > > - master-scala-2.12-hadoop-3.2 (compile/package/install)
> > > - master-scala-2.12-hadoop-3.2-jdk11 (compile/package/install)
> > > - master-linters (Scala/Java/Python/R)
> > >
> > > In addition, this is a part of Apache Spark code base and everyone can 
> > > make contributions on this.
> > >
> > > Finally, as the last piece of this work, we are going to remove the 
> > > legacy Jenkins jobs via the following JIRA issue.
> > >
> > > https://issues.apache.org/jira/browse/SPARK-29935
> > >
> > > Please let me know if you have any concerns on this.
> > > (We can keep the legacy jobs, but two of them are already broken.)
> > >
> > > Bests,
> > > Dongjoon.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
It's kinda like Scala version upgrade. Historically, we only remove the
support of an older Scala version when the newer version is proven to be
stable after one or more Spark minor versions.

On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian  wrote:

> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
> forked Hive 1.2 dependencies completely, no? As long as we still keep the
> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
> particular preference between using Hive 1.2 or 2.3 as the default Hive
> version. After all, for end-users and providers who need a particular
> version combination, they can always build Spark with proper profiles
> themselves.
>
> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
> due to the folder name.
>
> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
> wrote:
>
>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>
>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>
>> We can replace it immediately if we want right now.
>>
>>
>>
>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Cheng.
>>>
>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>> If we consider them, it could be the followings.
>>>
>>> +--+-++
>>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>> +-+
>>> |Legitimate|X| O  |
>>> |JDK11 |X| O  |
>>> |Hadoop3   |X| O  |
>>> |Hadoop2   |O| O  |
>>> |Functions | Baseline|   More |
>>> |Bug fixes | Baseline|   More |
>>> +-+
>>>
>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>> (including Jenkins/GitHubAction/AppVeyor).
>>>
>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>> to give more visibility to the whole community,
>>>
>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>> distribution
>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>> after `branch-3.0` branch cut.
>>>
>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>> But, it's time to prepare. Without them, we are going to be insufficient
>>> again and again.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian 
>>> wrote:
>>>
 Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
 minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
 buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
 referring both Hive 2.3.6 and 2.3.5 at the moment, see here
 
 and here
 
 .)

 Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
 Spark 3.0. For preview releases, I'm afraid that their visibility is not
 good enough for covering such major upgrades.

 On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
 wrote:

> Thank you for feedback, Hyujkjin and Sean.
>
> I proposed `preview-2` for that purpose but I'm also +1 for do that at
> 3.1
> if we can make a decision to eliminate the illegitimate Hive fork
> reference
> immediately after `branch-3.0` cut.
>
> Sean, I'm referencing Cheng Lian's email for the status of
> `hadoop-2.7`.
>
> -
> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>
> The way I see this is that it's not a user problem. Apache Spark
> community didn't try to drop the illegitimate Hive fork yet.
> We need to drop it by ourselves because we created it and it's our bad.
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>
>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>> works with hive-2.3? it isn't tied to hadoop-3.2?
>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>> 2.x, for end users using Hive via Spark?
>> I don't have a strong opinion, other than sharing the view that we
>> have to dump the Hive 1.x fork at the first opportunity.
>> Question is simply how much risk that entails. 

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Hmm, what exactly did you mean by "remove the usage of forked `hive` in
Apache Spark 3.0 completely officially"? I thought you wanted to remove the
forked Hive 1.2 dependencies completely, no? As long as we still keep the
Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
particular preference between using Hive 1.2 or 2.3 as the default Hive
version. After all, for end-users and providers who need a particular
version combination, they can always build Spark with proper profiles
themselves.

And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun 
wrote:

> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>
> For directory name, we use '1.2.1' and '2.3.5' because we just delayed the
> renaming the directories until 3.0.0 deadline to minimize the diff.
>
> We can replace it immediately if we want right now.
>
>
>
> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
> wrote:
>
>> Hi, Cheng.
>>
>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>> If we consider them, it could be the followings.
>>
>> +--+-++
>> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>> +-+
>> |Legitimate|X| O  |
>> |JDK11 |X| O  |
>> |Hadoop3   |X| O  |
>> |Hadoop2   |O| O  |
>> |Functions | Baseline|   More |
>> |Bug fixes | Baseline|   More |
>> +-+
>>
>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>> (including Jenkins/GitHubAction/AppVeyor).
>>
>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>> to give more visibility to the whole community,
>>
>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>> distribution
>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>> after `branch-3.0` branch cut.
>>
>> I know that we have been reluctant to (1) and (2) due to its burden.
>> But, it's time to prepare. Without them, we are going to be insufficient
>> again and again.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian  wrote:
>>
>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
>>> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>> 
>>> and here
>>> 
>>> .)
>>>
>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>> good enough for covering such major upgrades.
>>>
>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you for feedback, Hyujkjin and Sean.

 I proposed `preview-2` for that purpose but I'm also +1 for do that at
 3.1
 if we can make a decision to eliminate the illegitimate Hive fork
 reference
 immediately after `branch-3.0` cut.

 Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.

 -
 https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E

 The way I see this is that it's not a user problem. Apache Spark
 community didn't try to drop the illegitimate Hive fork yet.
 We need to drop it by ourselves because we created it and it's our bad.

 Bests,
 Dongjoon.



 On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:

> Just to clarify, as even I have lost the details over time: hadoop-2.7
> works with hive-2.3? it isn't tied to hadoop-3.2?
> Roughly how much risk is there in using the Hive 1.x fork over Hive
> 2.x, for end users using Hive via Spark?
> I don't have a strong opinion, other than sharing the view that we
> have to dump the Hive 1.x fork at the first opportunity.
> Question is simply how much risk that entails. Keeping in mind that
> Spark 3.0 is already something that people understand works
> differently. We can accept some behavior changes.
>
> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
> >
> > Hi, All.
> >
> > First of all, I want to put this as a policy issue instead of a
> technical issue.
> > Also, 

Re: Migration `Spark QA Compile` Jenkins jobs to GitHub Action

2019-11-19 Thread Shane Knapp
thank sean!

i am all for moving these jobs to github actions, and will be doing
this 'soon' as i'm @ kubecon this week.

btw the R ecosystem definitely needs some attention, however, but
that's an issue for another time.  :)

On Tue, Nov 19, 2019 at 1:49 PM Sean Owen  wrote:
>
> I would favor moving whatever we can to Github. It's difficult to
> modify the Jenkins instances without Shane's valiant help, and over
> time makes more sense to modernize and integrate it into the project.
>
> On Tue, Nov 19, 2019 at 3:35 PM Dongjoon Hyun  wrote:
> >
> > Hi, All.
> >
> > Apache Spark community used the following dashboard as post-hook 
> > verifications.
> >
> > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> >
> > There are six registered jobs.
> >
> > 1. spark-branch-2.4-compile-maven-hadoop-2.6
> > 2. spark-branch-2.4-compile-maven-hadoop-2.7
> > 3. spark-branch-2.4-lint
> > 4. spark-master-compile-maven-hadoop-2.7
> > 5. spark-master-compile-maven-hadoop-3.2
> > 6. spark-master-lint
> >
> > Now, we added `GitHub Action` jobs. You can see the green check at every 
> > commit.
> >
> > https://github.com/apache/spark/commits/master
> > https://github.com/apache/spark/commits/branch-2.4
> >
> > If you click the green check, you can see the detail.
> > The followings are the example runs at the last commits on both branches.
> >
> > https://github.com/apache/spark/runs/310411948 (master)
> > https://github.com/apache/spark/runs/309522646 (branch-2.4)
> >
> > New `GitHub Action` have more combination than the old Jenkins jobs.
> >
> > - branch-2.4-scala-2.11-hadoop-2.6 (compile/package/install)
> > - branch-2.4-scala-2.12-hadoop-2.6 (compile/package/install)
> > - branch-2.4-scala-2.11-hadoop-2.7 (compile/package/install)
> > - branch-2.4-scala-2.12-hadoop-2.7 (compile/package/install)
> > - branch-2.4-linters (Scala/Java/Python/R)
> > - master-scala-2.12-hadoop-2.7 (compile/package/install)
> > - master-scala-2.12-hadoop-3.2 (compile/package/install)
> > - master-scala-2.12-hadoop-3.2-jdk11 (compile/package/install)
> > - master-linters (Scala/Java/Python/R)
> >
> > In addition, this is a part of Apache Spark code base and everyone can make 
> > contributions on this.
> >
> > Finally, as the last piece of this work, we are going to remove the legacy 
> > Jenkins jobs via the following JIRA issue.
> >
> > https://issues.apache.org/jira/browse/SPARK-29935
> >
> > Please let me know if you have any concerns on this.
> > (We can keep the legacy jobs, but two of them are already broken.)
> >
> > Bests,
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Migration `Spark QA Compile` Jenkins jobs to GitHub Action

2019-11-19 Thread Sean Owen
I would favor moving whatever we can to Github. It's difficult to
modify the Jenkins instances without Shane's valiant help, and over
time makes more sense to modernize and integrate it into the project.

On Tue, Nov 19, 2019 at 3:35 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Apache Spark community used the following dashboard as post-hook 
> verifications.
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
>
> There are six registered jobs.
>
> 1. spark-branch-2.4-compile-maven-hadoop-2.6
> 2. spark-branch-2.4-compile-maven-hadoop-2.7
> 3. spark-branch-2.4-lint
> 4. spark-master-compile-maven-hadoop-2.7
> 5. spark-master-compile-maven-hadoop-3.2
> 6. spark-master-lint
>
> Now, we added `GitHub Action` jobs. You can see the green check at every 
> commit.
>
> https://github.com/apache/spark/commits/master
> https://github.com/apache/spark/commits/branch-2.4
>
> If you click the green check, you can see the detail.
> The followings are the example runs at the last commits on both branches.
>
> https://github.com/apache/spark/runs/310411948 (master)
> https://github.com/apache/spark/runs/309522646 (branch-2.4)
>
> New `GitHub Action` have more combination than the old Jenkins jobs.
>
> - branch-2.4-scala-2.11-hadoop-2.6 (compile/package/install)
> - branch-2.4-scala-2.12-hadoop-2.6 (compile/package/install)
> - branch-2.4-scala-2.11-hadoop-2.7 (compile/package/install)
> - branch-2.4-scala-2.12-hadoop-2.7 (compile/package/install)
> - branch-2.4-linters (Scala/Java/Python/R)
> - master-scala-2.12-hadoop-2.7 (compile/package/install)
> - master-scala-2.12-hadoop-3.2 (compile/package/install)
> - master-scala-2.12-hadoop-3.2-jdk11 (compile/package/install)
> - master-linters (Scala/Java/Python/R)
>
> In addition, this is a part of Apache Spark code base and everyone can make 
> contributions on this.
>
> Finally, as the last piece of this work, we are going to remove the legacy 
> Jenkins jobs via the following JIRA issue.
>
> https://issues.apache.org/jira/browse/SPARK-29935
>
> Please let me know if you have any concerns on this.
> (We can keep the legacy jobs, but two of them are already broken.)
>
> Bests,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Migration `Spark QA Compile` Jenkins jobs to GitHub Action

2019-11-19 Thread Dongjoon Hyun
Hi, All.

Apache Spark community used the following dashboard as post-hook
verifications.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/

There are six registered jobs.

1. spark-branch-2.4-compile-maven-hadoop-2.6
2. spark-branch-2.4-compile-maven-hadoop-2.7
3. spark-branch-2.4-lint
4. spark-master-compile-maven-hadoop-2.7
5. spark-master-compile-maven-hadoop-3.2
6. spark-master-lint

Now, we added `GitHub Action` jobs. You can see the green check at every
commit.

https://github.com/apache/spark/commits/master
https://github.com/apache/spark/commits/branch-2.4

If you click the green check, you can see the detail.
The followings are the example runs at the last commits on both branches.

https://github.com/apache/spark/runs/310411948 (master)
https://github.com/apache/spark/runs/309522646 (branch-2.4)

New `GitHub Action` have more combination than the old Jenkins jobs.

- branch-2.4-scala-2.11-hadoop-2.6 (compile/package/install)
- branch-2.4-scala-2.12-hadoop-2.6 (compile/package/install)
- branch-2.4-scala-2.11-hadoop-2.7 (compile/package/install)
- branch-2.4-scala-2.12-hadoop-2.7 (compile/package/install)
- branch-2.4-linters (Scala/Java/Python/R)
- master-scala-2.12-hadoop-2.7 (compile/package/install)
- master-scala-2.12-hadoop-3.2 (compile/package/install)
- master-scala-2.12-hadoop-3.2-jdk11 (compile/package/install)
- master-linters (Scala/Java/Python/R)

In addition, this is a part of Apache Spark code base and everyone can make
contributions on this.

Finally, as the last piece of this work, we are going to remove the legacy
Jenkins jobs via the following JIRA issue.

https://issues.apache.org/jira/browse/SPARK-29935

Please let me know if you have any concerns on this.
(We can keep the legacy jobs, but two of them are already broken.)

Bests,
Dongjoon.


Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

For directory name, we use '1.2.1' and '2.3.5' because we just delayed the
renaming the directories until 3.0.0 deadline to minimize the diff.

We can replace it immediately if we want right now.



On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun 
wrote:

> Hi, Cheng.
>
> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
> If we consider them, it could be the followings.
>
> +--+-++
> |  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
> +-+
> |Legitimate|X| O  |
> |JDK11 |X| O  |
> |Hadoop3   |X| O  |
> |Hadoop2   |O| O  |
> |Functions | Baseline|   More |
> |Bug fixes | Baseline|   More |
> +-+
>
> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
> (including Jenkins/GitHubAction/AppVeyor).
>
> For me, AS-IS 3.0 is not enough for that. According to your advices,
> to give more visibility to the whole community,
>
> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
> distribution
> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
> after `branch-3.0` branch cut.
>
> I know that we have been reluctant to (1) and (2) due to its burden.
> But, it's time to prepare. Without them, we are going to be insufficient
> again and again.
>
> Bests,
> Dongjoon.
>
>
>
>
> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian  wrote:
>
>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
>> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>> 
>> and here
>> 
>> .)
>>
>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>> good enough for covering such major upgrades.
>>
>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for feedback, Hyujkjin and Sean.
>>>
>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>> 3.1
>>> if we can make a decision to eliminate the illegitimate Hive fork
>>> reference
>>> immediately after `branch-3.0` cut.
>>>
>>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>>
>>> -
>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>
>>> The way I see this is that it's not a user problem. Apache Spark
>>> community didn't try to drop the illegitimate Hive fork yet.
>>> We need to drop it by ourselves because we created it and it's our bad.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>>>
 Just to clarify, as even I have lost the details over time: hadoop-2.7
 works with hive-2.3? it isn't tied to hadoop-3.2?
 Roughly how much risk is there in using the Hive 1.x fork over Hive
 2.x, for end users using Hive via Spark?
 I don't have a strong opinion, other than sharing the view that we
 have to dump the Hive 1.x fork at the first opportunity.
 Question is simply how much risk that entails. Keeping in mind that
 Spark 3.0 is already something that people understand works
 differently. We can accept some behavior changes.

 On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun 
 wrote:
 >
 > Hi, All.
 >
 > First of all, I want to put this as a policy issue instead of a
 technical issue.
 > Also, this is orthogonal from `hadoop` version discussion.
 >
 > Apache Spark community kept (not maintained) the forked Apache Hive
 > 1.2.1 because there has been no other options before. As we see at
 > SPARK-20202, it's not a desirable situation among the Apache projects.
 >
 > https://issues.apache.org/jira/browse/SPARK-20202
 >
 > Also, please note that we `kept`, not `maintained`, because we know
 it's not good.
 > There are several attempt to update that forked repository
 > for several reasons (Hadoop 3 support is one of the example),
 > but those attempts are also turned down.
 >
 > From Apache Spark 3.0, it seems that we have a new feasible option
 > `hive-2.3` profile. What about moving forward in this direction
 further?
 >

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+--+-++
|  | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-+
|Legitimate|X| O  |
|JDK11 |X| O  |
|Hadoop3   |X| O  |
|Hadoop2   |O| O  |
|Functions | Baseline|   More |
|Bug fixes | Baseline|   More |
+-+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
(including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient
again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian  wrote:

> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
> 
> and here
> 
> .)
>
> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7
> and the Hive 1.2 fork, but I do believe that we need a safety net for Spark
> 3.0. For preview releases, I'm afraid that their visibility is not good
> enough for covering such major upgrades.
>
> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for feedback, Hyujkjin and Sean.
>>
>> I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
>> if we can make a decision to eliminate the illegitimate Hive fork
>> reference
>> immediately after `branch-3.0` cut.
>>
>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>
>> -
>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>
>> The way I see this is that it's not a user problem. Apache Spark
>> community didn't try to drop the illegitimate Hive fork yet.
>> We need to drop it by ourselves because we created it and it's our bad.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>>
>>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>> 2.x, for end users using Hive via Spark?
>>> I don't have a strong opinion, other than sharing the view that we
>>> have to dump the Hive 1.x fork at the first opportunity.
>>> Question is simply how much risk that entails. Keeping in mind that
>>> Spark 3.0 is already something that people understand works
>>> differently. We can accept some behavior changes.
>>>
>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > First of all, I want to put this as a policy issue instead of a
>>> technical issue.
>>> > Also, this is orthogonal from `hadoop` version discussion.
>>> >
>>> > Apache Spark community kept (not maintained) the forked Apache Hive
>>> > 1.2.1 because there has been no other options before. As we see at
>>> > SPARK-20202, it's not a desirable situation among the Apache projects.
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-20202
>>> >
>>> > Also, please note that we `kept`, not `maintained`, because we know
>>> it's not good.
>>> > There are several attempt to update that forked repository
>>> > for several reasons (Hadoop 3 support is one of the example),
>>> > but those attempts are also turned down.
>>> >
>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>> > `hive-2.3` profile. What about moving forward in this direction
>>> further?
>>> >
>>> > For example, can we remove the usage of forked `hive` in Apache Spark
>>> 3.0
>>> > completely officially? If someone still needs to use the forked
>>> `hive`, we can
>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>> profile in the community.
>>> >
>>> > I want to say this is a goal we should achieve someday.
>>> > If we don't do anything, nothing happen. At least we need to prepare
>>> this.
>>> > Without any preparation, Spark 3.1+ will be 

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
referring both Hive 2.3.6 and 2.3.5 at the moment, see here

and here

.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7
and the Hive 1.2 fork, but I do believe that we need a safety net for Spark
3.0. For preview releases, I'm afraid that their visibility is not good
enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun 
wrote:

> Thank you for feedback, Hyujkjin and Sean.
>
> I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
> if we can make a decision to eliminate the illegitimate Hive fork reference
> immediately after `branch-3.0` cut.
>
> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>
> -
> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>
> The way I see this is that it's not a user problem. Apache Spark community
> didn't try to drop the illegitimate Hive fork yet.
> We need to drop it by ourselves because we created it and it's our bad.
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:
>
>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>> works with hive-2.3? it isn't tied to hadoop-3.2?
>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>> 2.x, for end users using Hive via Spark?
>> I don't have a strong opinion, other than sharing the view that we
>> have to dump the Hive 1.x fork at the first opportunity.
>> Question is simply how much risk that entails. Keeping in mind that
>> Spark 3.0 is already something that people understand works
>> differently. We can accept some behavior changes.
>>
>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > First of all, I want to put this as a policy issue instead of a
>> technical issue.
>> > Also, this is orthogonal from `hadoop` version discussion.
>> >
>> > Apache Spark community kept (not maintained) the forked Apache Hive
>> > 1.2.1 because there has been no other options before. As we see at
>> > SPARK-20202, it's not a desirable situation among the Apache projects.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-20202
>> >
>> > Also, please note that we `kept`, not `maintained`, because we know
>> it's not good.
>> > There are several attempt to update that forked repository
>> > for several reasons (Hadoop 3 support is one of the example),
>> > but those attempts are also turned down.
>> >
>> > From Apache Spark 3.0, it seems that we have a new feasible option
>> > `hive-2.3` profile. What about moving forward in this direction further?
>> >
>> > For example, can we remove the usage of forked `hive` in Apache Spark
>> 3.0
>> > completely officially? If someone still needs to use the forked `hive`,
>> we can
>> > have a profile `hive-1.2`. Of course, it should not be a default
>> profile in the community.
>> >
>> > I want to say this is a goal we should achieve someday.
>> > If we don't do anything, nothing happen. At least we need to prepare
>> this.
>> > Without any preparation, Spark 3.1+ will be the same.
>> >
>> > Shall we focus on what are our problems with Hive 2.3.6?
>> > If the only reason is that we didn't use it before, we can release
>> another
>> > `3.0.0-preview` for that.
>> >
>> > Bests,
>> > Dongjoon.
>>
>


Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.

-
https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E

The way I see this is that it's not a user problem. Apache Spark community
didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen  wrote:

> Just to clarify, as even I have lost the details over time: hadoop-2.7
> works with hive-2.3? it isn't tied to hadoop-3.2?
> Roughly how much risk is there in using the Hive 1.x fork over Hive
> 2.x, for end users using Hive via Spark?
> I don't have a strong opinion, other than sharing the view that we
> have to dump the Hive 1.x fork at the first opportunity.
> Question is simply how much risk that entails. Keeping in mind that
> Spark 3.0 is already something that people understand works
> differently. We can accept some behavior changes.
>
> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > First of all, I want to put this as a policy issue instead of a
> technical issue.
> > Also, this is orthogonal from `hadoop` version discussion.
> >
> > Apache Spark community kept (not maintained) the forked Apache Hive
> > 1.2.1 because there has been no other options before. As we see at
> > SPARK-20202, it's not a desirable situation among the Apache projects.
> >
> > https://issues.apache.org/jira/browse/SPARK-20202
> >
> > Also, please note that we `kept`, not `maintained`, because we know it's
> not good.
> > There are several attempt to update that forked repository
> > for several reasons (Hadoop 3 support is one of the example),
> > but those attempts are also turned down.
> >
> > From Apache Spark 3.0, it seems that we have a new feasible option
> > `hive-2.3` profile. What about moving forward in this direction further?
> >
> > For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> > completely officially? If someone still needs to use the forked `hive`,
> we can
> > have a profile `hive-1.2`. Of course, it should not be a default profile
> in the community.
> >
> > I want to say this is a goal we should achieve someday.
> > If we don't do anything, nothing happen. At least we need to prepare
> this.
> > Without any preparation, Spark 3.1+ will be the same.
> >
> > Shall we focus on what are our problems with Hive 2.3.6?
> > If the only reason is that we didn't use it before, we can release
> another
> > `3.0.0-preview` for that.
> >
> > Bests,
> > Dongjoon.
>


Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Sean Owen
Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical 
> issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
> https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not 
> good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in 
> the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Ask for ARM CI for spark

2019-11-19 Thread bo zhaobo
Hi @Sean Owen  ,

Thanks for your reply and patient.
First, we are so apologized for the bad words in the previous emails. We
just want to make the users can see the current support status in some
place of spark community. I'm really appreciated that you and spark
community make spark better on ARM, and open to us.
Second, that's correct, adding this kind information of CICD into
releasenotes is improper. So we won't do that.
Third, I think your suggest is good for current situation. We will follow
your kind suggestion and send a email to user@ and dev@ to describe our
testing result, including the test coverage and the known issue we found.
Also we hope that we do this could be good for attracting more
users/developers of spark use ARM.
Fourth, we still have concerns that user still can not know very clear that
Spark can run on which ARCH with generic testing. So user will ask the same
question again and again which mentioned by huangtianhua in spark
community, eventhrough they plan to build/test on the specific ARCH. Here
we are not true whether community has a good way to resolve this. Here just
a suggest from us, how about describe the current testing status(all test
status) of Amplab in some place of spark? Then users can know spark testing
which already be pretested with generic test cases in upstream community,
and feel confident to use spark on anyplace what they want according to
that information.

In the end, we always believe community and follow community suggestions.
Please feel free to tell us about our outcomes at work, and welcome to work
together on Spark ARM. If any issue hit on ARM, please also @us for
discuss. ;-)

Thank you @Sean Owen  and team.

BR

ZhaoBo

[image: Mailtrack]

Sender
notified by
Mailtrack

19/11/19
下午05:49:00

Sean Owen  于2019年11月18日周一 上午10:06写道:

> Same response as before:
>
> - It is in the list of resolved JIRAs, of course
> - It (largely) worked previously
> - I think you're also saying you don't have 100% tests passing anyway,
> though probably just small issues
> - It does not seem to merit a special announcement from the PMC among
> the 2000+ changes in Spark 3
> - You are welcome to announce (on the project's user@ list if you
> like) whatever you want. Obviously, this is already well advertised on
> dev@
>
> I think you are asking for what borders on endorsement, and no that
> doesn't sound appropriate. Please just announce whatever you like as
> suggested.
>
> Sean
>
> On Sun, Nov 17, 2019 at 8:01 PM Tianhua huang 
> wrote:
> >
> > @Sean Owen,
> > I'm afraid I don't agree with you this time, I still remember no one can
> tell me whether Spark supports ARM or how much Spark can support ARM when I
> asked this first time on Dev@,  you're very kind and told me to build and
> test on ARM locally and so sorry I think you were not sure much about this
> at that moment, right? Then I and my team work with community, we
> found/fixed several issues, integrate arm jobs into AMPLAB Jenkins, and the
> daily jobs has been stablely running for few weeks... after these efforts
> why not announce this officially in Spark releasenote? I believe after this
> everyone will know Spark is fully testing on ARM on community CI, Spark
> supports ARM basically, it's amazing and this will be very helpful. So what
> do you think? Or what are you worrying about?
>