Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Cheng Lian
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
seemed risky, and therefore we only introduced Hive 2.3 under the
hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
here...

Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
upgrade together looks too risky.

On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:

> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
> than introduce yet another build combination. Does Hadoop 2 + Hive 2
> work and is there demand for it?
>
> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
> >
> > Do we have a limitation on the number of pre-built distributions? Seems
> this time we need
> > 1. hadoop 2.7 + hive 1.2
> > 2. hadoop 2.7 + hive 2.3
> > 3. hadoop 3 + hive 2.3
> >
> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
> don't need to add JDK version to the combination.
> >
> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
> wrote:
> >>
> >> Thank you for suggestion.
> >>
> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> >> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
> >>
> >> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
> >>
> >> Bests,
> >> Dongjoon.
> >>
>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-16 Thread Hyukjin Kwon
DisplayName looks good in general but actually here I would like first to
find a existing pattern to document in guidelines given the actual existing
practice we all are used to. I'm trying to be very conservative since this
guidelines affect everybody.

I think it might be better to discuss separately if we want to change what
we have been used to.

Also, using arbitrary names might not be actually free due to such bug like
https://github.com/apache/spark/pull/25630 . It will need some more efforts
to investigate as well.

On Fri, 15 Nov 2019, 20:56 Steve Loughran, 
wrote:

>  Junit5: Display names.
>
> Goes all the way to the XML.
>
>
> https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names
>
> On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Should we also add a guideline for non Scala tests? Other languages
>> (Java, Python, R) don't support using string as a test name.
>>
>> Best Regards,
>> Ryan
>>
>>
>> On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon  wrote:
>>
>>> I opened a PR - https://github.com/apache/spark-website/pull/231
>>>
>>> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>>>
 > In general a test should be self descriptive and I don't think we
 should be adding JIRA ticket references wholesale. Any action that the
 reader has to take to understand why a test was introduced is one too many.
 However in some cases the thing we are trying to test is very subtle and in
 that case a reference to a JIRA ticket might be useful, I do still feel
 that this should be a backstop and that properly documenting your tests is
 a much better way of dealing with this.

 Yeah, the test should be self-descriptive. I don't think adding a JIRA
 prefix harms this point. Probably I should add this sentence in the
 guidelines as well.
 Adding a JIRA prefix just adds one extra hint to track down details. I
 think it's fine to stick to this practice and make it simpler and clear to
 follow.

 > 1. what if multiple JIRA IDs relating to the same test? we just take
 the very first JIRA ID?
 Ideally one JIRA should describe one issue and one PR should fix one
 JIRA with a dedicated test.
 Yeah, I think I would take the very first JIRA ID.

 > 2. are we going to have a full scan of all existing tests and attach
 a JIRA ID to it?
 Yea, let's don't do this.

 > It's a nice-to-have, not super essential, just because ...
 It's been asked multiple times and each committer seems having a
 different understanding on this.
 It's not a biggie but wanted to make it clear and conclude this.

 > I'd add this only when a test specifically targets a certain issue.
 Yes, so this one I am not sure. From what I heard, people adds the JIRA
 in cases below:

 - Whenever the JIRA type is a bug
 - When a PR adds a couple of tests
 - Only when a test specifically targets a certain issue.
 - ...

 Which one do we prefer and simpler to follow?

 Or I can combine as below (im gonna reword when I actually document
 this):
 1. In general, we should add a JIRA ID as prefix of a test when a PR
 targets to fix a specific issue.
 In practice, it usually happens when a JIRA type is a bug or a PR
 adds a couple of tests.
 2. Uses "SPARK-: test name" format

 If we have no objection with ^, let me go with this.

 2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:

> Let's suggest "SPARK-12345:" but not go back and change a bunch of
> test cases.
> I'd add this only when a test specifically targets a certain issue.
> It's a nice-to-have, not super essential, just because in the rare
> case you need to understand why a test asserts something, you can go
> back and find what added it in the git history without much trouble.
>
> On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
> wrote:
> >
> > Hi all,
> >
> > Maybe it's not a big deal but it brought some confusions time to
> time into Spark dev and community. I think it's time to discuss about
> when/which format to add a JIRA ID as a prefix for the test case name in
> Scala test cases.
> >
> > Currently we have many test case names with prefixes as below:
> >
> > test("SPARK-X blah blah")
> > test("SPARK-X: blah blah")
> > test("SPARK-X - blah blah")
> > test("[SPARK-X] blah blah")
> > …
> >
> > It is a good practice to have the JIRA ID in general because, for
> instance,
> > it makes us put less efforts to track commit histories (or even when
> the files
> > are totally moved), or to track related information of tests failed.
> > Considering Spark's getting big, I think it's good to document.
> >
> > I would like to suggest this and document it in our guideline:
> >
> > 1. Add a 

Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview

2019-11-16 Thread Nicholas Chammas
> Data Source API with Catalog Supports

Where can we read more about this? The linked Nabble thread doesn't mention
the word "Catalog".

On Thu, Nov 7, 2019 at 5:53 PM Xingbo Jiang  wrote:

> Hi all,
>
> To enable wide-scale community testing of the upcoming Spark 3.0 release,
> the Apache Spark community has posted a preview release of Spark 3.0. This
> preview is *not a stable release in terms of either API or functionality*,
> but it is meant to give the community early access to try the code that
> will become Spark 3.0. If you would like to test the release, please
> download it, and send feedback using either the mailing lists
>  or JIRA
> 
> .
>
> There are a lot of exciting new features added to Spark 3.0, including
> Dynamic Partition Pruning, Adaptive Query Execution, Accelerator-aware
> Scheduling, Data Source API with Catalog Supports, Vectorization in SparkR,
> support of Hadoop 3/JDK 11/Scala 2.12, and many more. For a full list of
> major features and changes in Spark 3.0.0-preview, please check the thread(
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-feature-list-and-major-changes-td28050.html
> ).
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.0.0-preview, head over to the download page:
> https://archive.apache.org/dist/spark/spark-3.0.0-preview
>
> Thanks,
>
> Xingbo
>
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Sean Owen
I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
than introduce yet another build combination. Does Hadoop 2 + Hive 2
work and is there demand for it?

On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
>
> Do we have a limitation on the number of pre-built distributions? Seems this 
> time we need
> 1. hadoop 2.7 + hive 1.2
> 2. hadoop 2.7 + hive 2.3
> 3. hadoop 3 + hive 2.3
>
> AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't 
> need to add JDK version to the combination.
>
> On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun  wrote:
>>
>> Thank you for suggestion.
>>
>> Having `hive-2.3` profile sounds good to me because it's orthogonal to 
>> Hadoop 3.
>> IIRC, originally, it was proposed in that way, but we put it under 
>> `hadoop-3.2` to avoid adding new profiles at that time.
>>
>> And, I'm wondering if you are considering additional pre-built distribution 
>> and Jenkins jobs.
>>
>> Bests,
>> Dongjoon.
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Wenchen Fan
Do we have a limitation on the number of pre-built distributions? Seems
this time we need
1. hadoop 2.7 + hive 1.2
2. hadoop 2.7 + hive 2.3
3. hadoop 3 + hive 2.3

AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't
need to add JDK version to the combination.

On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
wrote:

> Thank you for suggestion.
>
> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
>
> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
>
> Bests,
> Dongjoon.
>
>
>
> On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian  wrote:
>
>> Cc Yuming, Steve, and Dongjoon
>>
>> On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian 
>> wrote:
>>
>>> Similar to Xiao, my major concern about making Hadoop 3.2 the default
>>> Hadoop version is quality control. The current hadoop-3.2 profile
>>> covers too many major component upgrades, i.e.:
>>>
>>>- Hadoop 3.2
>>>- Hive 2.3
>>>- JDK 11
>>>
>>> We have already found and fixed some feature and performance regressions
>>> related to these upgrades. Empirically, I’m not surprised at all if more
>>> regressions are lurking somewhere. On the other hand, we do want help from
>>> the community to help us to evaluate and stabilize these new changes.
>>> Following that, I’d like to propose:
>>>
>>>1.
>>>
>>>Introduce a new profile hive-2.3 to enable (hopefully) less risky
>>>Hadoop/Hive/JDK version combinations.
>>>
>>>This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>>>profile, so that users may try out some less risky Hadoop/Hive/JDK
>>>combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>>>face potential regressions introduced by the Hadoop 3.2 upgrade.
>>>
>>>Yuming Wang has already sent out PR #26533
>>> to exercise the Hadoop
>>>2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the
>>>hive-2.3 profile yet), and the result looks promising: the Kafka
>>>streaming and Arrow related test failures should be irrelevant to the 
>>> topic
>>>discussed here.
>>>
>>>After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a
>>>lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default
>>>Hadoop version. For users who are still using Hadoop 2.x in production,
>>>they will have to use a hadoop-provided prebuilt package or build
>>>Spark 3.0 against their own 2.x version anyway. It does make a difference
>>>for cloud users who don’t use Hadoop at all, though. And this probably 
>>> also
>>>helps to stabilize the Hadoop 3.2 code path faster since our PR builder
>>>will exercise it regularly.
>>>2.
>>>
>>>Defer Hadoop 2.x upgrade to Spark 3.1+
>>>
>>>I personally do want to bump our Hadoop 2.x version to 2.9 or even
>>>2.10. Steve has already stated the benefits very well. My worry here is
>>>still quality control: Spark 3.0 has already had tons of changes and 
>>> major
>>>component version upgrades that are subject to all kinds of known and
>>>hidden regressions. Having Hadoop 2.7 there provides us a safety net, 
>>> since
>>>it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 
>>> 2.7
>>>to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in 
>>> the
>>>next 1 or 2 Spark 3.x releases.
>>>
>>> Cheng
>>>
>>> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers  wrote:
>>>
 i get that cdh and hdp backport a lot and in that way left 2.7 behind.
 but they kept the public apis stable at the 2.7 level, because thats kind
 of the point. arent those the hadoop apis spark uses?

 On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
  wrote:

>
>
> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>  wrote:
>>
>>> It would be really good if the spark distributions shipped with
>>> later versions of the hadoop artifacts.
>>>
>>
>> I second this. If we need to keep a Hadoop 2.x profile around, why
>> not make it Hadoop 2.8 or something newer?
>>
>
> go for 2.9
>
>>
>> Koert Kuipers  wrote:
>>
>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2
>>> profile to latest would probably be an issue for us.
>>
>>
>> When was the last time HDP 2.x bumped their minor version of Hadoop?
>> Do we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>
>
> The internal builds of CDH and HDP are not those of ASF 2.7.x. A
> really large proportion of the later branch-2 patches are backported. 2,7
> was left behind a long time ago
>
>
>
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Dongjoon Hyun
Thank you for suggestion.

Having `hive-2.3` profile sounds good to me because it's orthogonal to
Hadoop 3.
IIRC, originally, it was proposed in that way, but we put it under
`hadoop-3.2` to avoid adding new profiles at that time.

And, I'm wondering if you are considering additional pre-built distribution
and Jenkins jobs.

Bests,
Dongjoon.



On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian  wrote:

> Cc Yuming, Steve, and Dongjoon
>
> On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian  wrote:
>
>> Similar to Xiao, my major concern about making Hadoop 3.2 the default
>> Hadoop version is quality control. The current hadoop-3.2 profile covers
>> too many major component upgrades, i.e.:
>>
>>- Hadoop 3.2
>>- Hive 2.3
>>- JDK 11
>>
>> We have already found and fixed some feature and performance regressions
>> related to these upgrades. Empirically, I’m not surprised at all if more
>> regressions are lurking somewhere. On the other hand, we do want help from
>> the community to help us to evaluate and stabilize these new changes.
>> Following that, I’d like to propose:
>>
>>1.
>>
>>Introduce a new profile hive-2.3 to enable (hopefully) less risky
>>Hadoop/Hive/JDK version combinations.
>>
>>This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>>profile, so that users may try out some less risky Hadoop/Hive/JDK
>>combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>>face potential regressions introduced by the Hadoop 3.2 upgrade.
>>
>>Yuming Wang has already sent out PR #26533
>> to exercise the Hadoop
>>2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the
>>hive-2.3 profile yet), and the result looks promising: the Kafka
>>streaming and Arrow related test failures should be irrelevant to the 
>> topic
>>discussed here.
>>
>>After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a
>>lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default
>>Hadoop version. For users who are still using Hadoop 2.x in production,
>>they will have to use a hadoop-provided prebuilt package or build
>>Spark 3.0 against their own 2.x version anyway. It does make a difference
>>for cloud users who don’t use Hadoop at all, though. And this probably 
>> also
>>helps to stabilize the Hadoop 3.2 code path faster since our PR builder
>>will exercise it regularly.
>>2.
>>
>>Defer Hadoop 2.x upgrade to Spark 3.1+
>>
>>I personally do want to bump our Hadoop 2.x version to 2.9 or even
>>2.10. Steve has already stated the benefits very well. My worry here is
>>still quality control: Spark 3.0 has already had tons of changes and major
>>component version upgrades that are subject to all kinds of known and
>>hidden regressions. Having Hadoop 2.7 there provides us a safety net, 
>> since
>>it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 
>> 2.7
>>to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>>next 1 or 2 Spark 3.x releases.
>>
>> Cheng
>>
>> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers  wrote:
>>
>>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>>> but they kept the public apis stable at the 2.7 level, because thats kind
>>> of the point. arent those the hadoop apis spark uses?
>>>
>>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>>>  wrote:
>>>


 On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>  wrote:
>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>
> I second this. If we need to keep a Hadoop 2.x profile around, why not
> make it Hadoop 2.8 or something newer?
>

 go for 2.9

>
> Koert Kuipers  wrote:
>
>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2
>> profile to latest would probably be an issue for us.
>
>
> When was the last time HDP 2.x bumped their minor version of Hadoop?
> Do we want to wait for them to bump to Hadoop 2.8 before we do the same?
>

 The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
 large proportion of the later branch-2 patches are backported. 2,7 was left
 behind a long time ago




>>>