Re: Adding JIRA ID as the prefix for the test case name

2019-11-14 Thread Felix Cheung
this is about test description and not test file name right?

if yes I don’t see a problem.


From: Hyukjin Kwon 
Sent: Thursday, November 14, 2019 6:03:02 PM
To: Shixiong(Ryan) Zhu 
Cc: dev ; Felix Cheung ; 
Shivaram Venkataraman 
Subject: Re: Adding JIRA ID as the prefix for the test case name

Yeah, sounds good to have it.

In case of R, it seems not quite common to write down JIRA ID [1] but looks 
some have the prefix in its test name in general.
In case of Python and Java, seems we time to time write a JIRA ID in the 
comment right under the test method [2][3].

Given this pattern, I would like to suggest use the same format but:

1. For Python and Java, write a single comment that starts with JIRA ID and 
short description, e.g. (SPARK-X: test blah blah)
2. For R, use JIRA ID as a prefix for its test name.

[1] git grep -r "SPARK-" -- '*test*.R'
[2] git grep -r "SPARK-" -- '*Suite.java'
[3] git grep -r "SPARK-" -- '*test*.py'

Does that make sense? Adding Felix and Shivaram too.


2019년 11월 15일 (금) 오전 3:13, Shixiong(Ryan) Zhu 
mailto:shixi...@databricks.com>>님이 작성:
Should we also add a guideline for non Scala tests? Other languages (Java, 
Python, R) don't support using string as a test name.

Best Regards,

Ryan


On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
I opened a PR - https://github.com/apache/spark-website/pull/231

2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 
mailto:gurwls...@gmail.com>>님이 작성:
> In general a test should be self descriptive and I don't think we should be 
> adding JIRA ticket references wholesale. Any action that the reader has to 
> take to understand why a test was introduced is one too many. However in some 
> cases the thing we are trying to test is very subtle and in that case a 
> reference to a JIRA ticket might be useful, I do still feel that this should 
> be a backstop and that properly documenting your tests is a much better way 
> of dealing with this.

Yeah, the test should be self-descriptive. I don't think adding a JIRA prefix 
harms this point. Probably I should add this sentence in the guidelines as well.
Adding a JIRA prefix just adds one extra hint to track down details. I think 
it's fine to stick to this practice and make it simpler and clear to follow.

> 1. what if multiple JIRA IDs relating to the same test? we just take the very 
> first JIRA ID?
Ideally one JIRA should describe one issue and one PR should fix one JIRA with 
a dedicated test.
Yeah, I think I would take the very first JIRA ID.

> 2. are we going to have a full scan of all existing tests and attach a JIRA 
> ID to it?
Yea, let's don't do this.

> It's a nice-to-have, not super essential, just because ...
It's been asked multiple times and each committer seems having a different 
understanding on this.
It's not a biggie but wanted to make it clear and conclude this.

> I'd add this only when a test specifically targets a certain issue.
Yes, so this one I am not sure. From what I heard, people adds the JIRA in 
cases below:

- Whenever the JIRA type is a bug
- When a PR adds a couple of tests
- Only when a test specifically targets a certain issue.
- ...

Which one do we prefer and simpler to follow?

Or I can combine as below (im gonna reword when I actually document this):
1. In general, we should add a JIRA ID as prefix of a test when a PR targets to 
fix a specific issue.
In practice, it usually happens when a JIRA type is a bug or a PR adds a 
couple of tests.
2. Uses "SPARK-: test name" format

If we have no objection with ^, let me go with this.

2019년 11월 13일 (수) 오전 8:14, Sean Owen 
mailto:sro...@gmail.com>>님이 작성:
Let's suggest "SPARK-12345:" but not go back and change a bunch of test cases.
I'd add this only when a test specifically targets a certain issue.
It's a nice-to-have, not super essential, just because in the rare
case you need to understand why a test asserts something, you can go
back and find what added it in the git history without much trouble.

On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
>
> Hi all,
>
> Maybe it's not a big deal but it brought some confusions time to time into 
> Spark dev and community. I think it's time to discuss about when/which format 
> to add a JIRA ID as a prefix for the test case name in Scala test cases.
>
> Currently we have many test case names with prefixes as below:
>
> test("SPARK-X blah blah")
> test("SPARK-X: blah blah")
> test("SPARK-X - blah blah")
> test("[SPARK-X] blah blah")
> …
>
> It is a good practice to have the JIRA ID in general because, for instance,
> it makes us put less efforts to track commit histories (or even when the files
> are totally moved), or to track related information of tests failed.
> Considering Spark's getting big, I think it's good to document.
>
> I would like to suggest this and document it in our guideline:
>
> 1. Add a prefix into a test name when a PR adds a 

Re: Does StreamingSymmetricHashJoinExec work with watermark? I don't think so

2019-11-14 Thread Jungtaek Lim
Jacek,

would you mind if I ask for the query to reproduce? Not sure I get you
without having the example of "not working".

Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Nov 12, 2019 at 12:04 AM Jacek Laskowski  wrote:

> Hi,
>
> I think watermark does not work for StreamingSymmetricHashJoinExec because
> of the following:
>
> 1. leftKeys and rightKeys have no spark.watermarkDelayMs metadata entry at
> planning [1]
> 2. Since the left and right keys had no watermark delay at planning the
> code [2] won't find it at execution
>
> Is my understanding correct? If not, can you point me at examples with
> watermark on 1) join keys and 2) values ? Merci beaucoup.
>
> [1]
> https://github.com/apache/spark/blob/v3.0.0-preview/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L477-L478
>
> [2]
> https://github.com/apache/spark/blob/v3.0.0-preview/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala#L156-L164
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> The Internals of Spark SQL https://bit.ly/spark-sql-internals
> The Internals of Spark Structured Streaming
> https://bit.ly/spark-structured-streaming
> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
> Follow me at https://twitter.com/jaceklaskowski
>
>


Re: [Structured Streaming] Robust watermarking calculation with future timestamps

2019-11-14 Thread Jungtaek Lim
(dropping user@ as cross-posting mailing lists for mail threads would
bother both lists, and it seems more appropriate to dev@)

AFAIK there's no API for custom watermark, and you're right picking max
timestamp would introduce the issues you provided. Other streaming
frameworks may pick min timestamp by default, which also has some tradeoff,
slower advancing watermark or being stuck in skewed data.

As a workaround for now, you can adjust timestamp column before calling
withWatermark so that future events can be adjusted, though that doesn't
provide functionality like 95th percentile which requires aggregated
calculation. I guess Spark community may consider adding the feature if the
community sees more requests on this.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Wed, Nov 13, 2019 at 6:58 PM Anastasios Zouzias 
wrote:

> Hi all,
>
> We currently have the following issue with a Spark Structured Streaming
> (SS) application. The application reads messages from thousands of source
> systems, stores them in Kafka and Spark aggregates them using SS and
> watermarking (15 minutes).
>
> The root problem is that a few of the source systems have a wrong timezone
> setup that makes them emit messages from the future, i.e., +1 hour ahead of
> current time (mis-configuration or winter/summer timezone change (yeah!) ).
> Since watermarking is calculated as
>
> (most latest timestamp value of all messages) - (watermarking threshold
> value, 15 mins),
>
> most of the messages are dropped due to the fact that are delayed by more
> than 45 minutes. To an even more extreme scenario, even a single "future" /
> adversarial message can make the structured streaming application to report
> zero messages (per mini-batch).
>
> Is there any user exposed SS API that allows a more robust calculation of
> watermarking, i.e., 95th percentile of timestamps instead of max timestamp?
> I understand that such calculation will be more expensive, but it will make
> the application more robust.
>
> Any suggestions/ideas?
>
> PS. Of course the best approach would be to fix the issue on all source
> systems but this might take time to do so (or perhaps drop future messages
> programmatically (yikes) ).
>
> Best regards,
> Anastasios
>


Re: [build system] Upgrading pyarrow, builds might be temporarily broken

2019-11-14 Thread Xiao Li
Hi, Bryan,

Thank you for your update!

Xiao

On Thu, Nov 14, 2019 at 8:48 PM Bryan Cutler  wrote:

> Update: #26133  has been
> merged and builds should be passing now, thanks all!
>
> On Thu, Nov 14, 2019 at 4:12 PM Bryan Cutler  wrote:
>
>> We are in the process of upgrading pyarrow in the testing environment,
>> which might cause pyspark test failures until
>> https://github.com/apache/spark/pull/26133 is merged. Apologies for the
>> lack of notice beforehand, but I jumped the gun a little and forgot this
>> would affect other builds too. The PR for the upgrade is all ready to go
>> and *should* pass current tests. I'll keep an eye on it and try to get it
>> resolved tonight. If it becomes a problem, we can try to downgrade pyarrow
>> to where it was.
>>
>> Thanks,
>> Bryan
>>
>

-- 
[image: Databricks Summit - Watch the talks]



Re: [build system] Upgrading pyarrow, builds might be temporarily broken

2019-11-14 Thread Bryan Cutler
Update: #26133  has been merged
and builds should be passing now, thanks all!

On Thu, Nov 14, 2019 at 4:12 PM Bryan Cutler  wrote:

> We are in the process of upgrading pyarrow in the testing environment,
> which might cause pyspark test failures until
> https://github.com/apache/spark/pull/26133 is merged. Apologies for the
> lack of notice beforehand, but I jumped the gun a little and forgot this
> would affect other builds too. The PR for the upgrade is all ready to go
> and *should* pass current tests. I'll keep an eye on it and try to get it
> resolved tonight. If it becomes a problem, we can try to downgrade pyarrow
> to where it was.
>
> Thanks,
> Bryan
>


Re: Ask for ARM CI for spark

2019-11-14 Thread bo zhaobo
Hi,

And I found Spark-3.0.0-preview had released, but there is no releasenotes
in [1]. So how about to add support ARM notes in the next
releasenotes(maybe the releasenotes of Spark-3.0.0-preview). So sorry to
raise this, I'm not familiar with this, if any bad from me please feel free
to correct.
Moreover, since the daily job has been stablely running for few weeks,
probably we can say that the release have some basic ARM support in the
next release note.
>From OpenSource project and Open Source eco system, this is a good chance
to tell peoples that Spark support ARM and support cross platforms. As
Spark can support more ARCH platforms, this is good to attract more users
to use Spark, eventhough they cannot use X86 in other uncontrollable
reasons.

[1]  https://github.com/apache/spark-website/tree/asf-site/releases/_posts



[image: Mailtrack]

Sender
notified by
Mailtrack

19/11/15
上午11:16:28

bo zhaobo  于2019年11月15日周五 上午11:00写道:

> Hi @Sean Owen  ,
>
> Thanks for reply. We know that Spark community has own release date and
> plan. We are happy to follow Spark community. But we think it's great if
> community could add a sentence into the next releasenotes and claim "Spark
> can support Arm from this release." after we finish the test work on ARM.
> That's all. We just want a community official caliber that spark support
> ARM for attracting more users to use spark.
>
> Thank you very much for your patient.
>
> BR
>
> ZhaoBo
>
>
> [image: Mailtrack]
> 
>  Sender
> notified by
> Mailtrack
> 
>  19/11/15
> 上午10:59:29
>
> Tianhua huang  于2019年11月15日周五 上午10:25写道:
>
>> @Sean,
>> Yes, you are right, we don't have to create a separate release of Spark
>> for ARM, it's enough to add a releasenote to say that Spark supports
>> arm architecture.
>> About the test failure, one or two tests will timeout on our poor
>> performance arm instance sometimes, now we donate a high performance arm
>> instance to amplab, and wait shane to build the jobs on it.
>>
>> On Fri, Nov 15, 2019 at 10:13 AM Sean Owen  wrote:
>>
>>> I don't quite understand. You are saying tests don't pass yet, so why
>>> would anyone yet run these tests regularly?
>>> If it's because the instances aren't fast enough, use bigger instances?
>>> I don't think anyone would create a separate release of Spark for ARM,
>>> no. But why would that be necessary?
>>>
>>> On Thu, Nov 14, 2019 at 7:28 PM bo zhaobo 
>>> wrote:
>>>
 Hi Spark team,

 Any ideas about the above email? Thank you.

 BR

 ZhaoBo


 [image: Mailtrack]
 
  Sender
 notified by
 Mailtrack
 
  19/11/15
 上午09:26:17

 Tianhua huang  于2019年11月12日周二 下午2:47写道:

> Hi all,
>
> Spark arm jobs have built for some time, and now there are two jobs[1]
> spark-master-test-maven-arm
> 
> and spark-master-test-python-arm
> ,
> we can see there are some build failures, but it because of the poor
> performance of the arm instance, and now we begin to build spark arm jobs
> on other high performance instances, and the build/test are all success, 
> we
> plan to donate the instance to amplab later.  According to the build
> history, we are very happy to say spark is supported on aarch64 platform,
> and I suggest to add this good news into spark-3.0.0 releasenotes. Maybe
> community could provide an arm-supported release of spark at the 
> meanwhile?
>
> [1]
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/
>
>
> ps: the jira https://issues.apache.org/jira/browse/SPARK-29106 trace
> the whole work, thank you very much Shane:)
>



Re: Ask for ARM CI for spark

2019-11-14 Thread bo zhaobo
Hi @Sean Owen  ,

Thanks for reply. We know that Spark community has own release date and
plan. We are happy to follow Spark community. But we think it's great if
community could add a sentence into the next releasenotes and claim "Spark
can support Arm from this release." after we finish the test work on ARM.
That's all. We just want a community official caliber that spark support
ARM for attracting more users to use spark.

Thank you very much for your patient.

BR

ZhaoBo


[image: Mailtrack]

Sender
notified by
Mailtrack

19/11/15
上午10:59:29

Tianhua huang  于2019年11月15日周五 上午10:25写道:

> @Sean,
> Yes, you are right, we don't have to create a separate release of Spark
> for ARM, it's enough to add a releasenote to say that Spark supports
> arm architecture.
> About the test failure, one or two tests will timeout on our poor
> performance arm instance sometimes, now we donate a high performance arm
> instance to amplab, and wait shane to build the jobs on it.
>
> On Fri, Nov 15, 2019 at 10:13 AM Sean Owen  wrote:
>
>> I don't quite understand. You are saying tests don't pass yet, so why
>> would anyone yet run these tests regularly?
>> If it's because the instances aren't fast enough, use bigger instances?
>> I don't think anyone would create a separate release of Spark for ARM,
>> no. But why would that be necessary?
>>
>> On Thu, Nov 14, 2019 at 7:28 PM bo zhaobo 
>> wrote:
>>
>>> Hi Spark team,
>>>
>>> Any ideas about the above email? Thank you.
>>>
>>> BR
>>>
>>> ZhaoBo
>>>
>>>
>>> [image: Mailtrack]
>>> 
>>>  Sender
>>> notified by
>>> Mailtrack
>>> 
>>>  19/11/15
>>> 上午09:26:17
>>>
>>> Tianhua huang  于2019年11月12日周二 下午2:47写道:
>>>
 Hi all,

 Spark arm jobs have built for some time, and now there are two jobs[1]
 spark-master-test-maven-arm
 
 and spark-master-test-python-arm
 ,
 we can see there are some build failures, but it because of the poor
 performance of the arm instance, and now we begin to build spark arm jobs
 on other high performance instances, and the build/test are all success, we
 plan to donate the instance to amplab later.  According to the build
 history, we are very happy to say spark is supported on aarch64 platform,
 and I suggest to add this good news into spark-3.0.0 releasenotes. Maybe
 community could provide an arm-supported release of spark at the meanwhile?

 [1]
 https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/

 https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/


 ps: the jira https://issues.apache.org/jira/browse/SPARK-29106 trace
 the whole work, thank you very much Shane:)

>>>


Re: Ask for ARM CI for spark

2019-11-14 Thread Tianhua huang
@Sean,
Yes, you are right, we don't have to create a separate release of Spark for
ARM, it's enough to add a releasenote to say that Spark supports
arm architecture.
About the test failure, one or two tests will timeout on our poor
performance arm instance sometimes, now we donate a high performance arm
instance to amplab, and wait shane to build the jobs on it.

On Fri, Nov 15, 2019 at 10:13 AM Sean Owen  wrote:

> I don't quite understand. You are saying tests don't pass yet, so why
> would anyone yet run these tests regularly?
> If it's because the instances aren't fast enough, use bigger instances?
> I don't think anyone would create a separate release of Spark for ARM, no.
> But why would that be necessary?
>
> On Thu, Nov 14, 2019 at 7:28 PM bo zhaobo 
> wrote:
>
>> Hi Spark team,
>>
>> Any ideas about the above email? Thank you.
>>
>> BR
>>
>> ZhaoBo
>>
>>
>> [image: Mailtrack]
>> 
>>  Sender
>> notified by
>> Mailtrack
>> 
>>  19/11/15
>> 上午09:26:17
>>
>> Tianhua huang  于2019年11月12日周二 下午2:47写道:
>>
>>> Hi all,
>>>
>>> Spark arm jobs have built for some time, and now there are two jobs[1]
>>> spark-master-test-maven-arm
>>> 
>>> and spark-master-test-python-arm
>>> ,
>>> we can see there are some build failures, but it because of the poor
>>> performance of the arm instance, and now we begin to build spark arm jobs
>>> on other high performance instances, and the build/test are all success, we
>>> plan to donate the instance to amplab later.  According to the build
>>> history, we are very happy to say spark is supported on aarch64 platform,
>>> and I suggest to add this good news into spark-3.0.0 releasenotes. Maybe
>>> community could provide an arm-supported release of spark at the meanwhile?
>>>
>>> [1]
>>> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/
>>> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/
>>>
>>>
>>> ps: the jira https://issues.apache.org/jira/browse/SPARK-29106 trace
>>> the whole work, thank you very much Shane:)
>>>
>>


Re: Ask for ARM CI for spark

2019-11-14 Thread Sean Owen
I don't quite understand. You are saying tests don't pass yet, so why would
anyone yet run these tests regularly?
If it's because the instances aren't fast enough, use bigger instances?
I don't think anyone would create a separate release of Spark for ARM, no.
But why would that be necessary?

On Thu, Nov 14, 2019 at 7:28 PM bo zhaobo 
wrote:

> Hi Spark team,
>
> Any ideas about the above email? Thank you.
>
> BR
>
> ZhaoBo
>
>
> [image: Mailtrack]
> 
>  Sender
> notified by
> Mailtrack
> 
>  19/11/15
> 上午09:26:17
>
> Tianhua huang  于2019年11月12日周二 下午2:47写道:
>
>> Hi all,
>>
>> Spark arm jobs have built for some time, and now there are two jobs[1]
>> spark-master-test-maven-arm
>> 
>> and spark-master-test-python-arm
>> ,
>> we can see there are some build failures, but it because of the poor
>> performance of the arm instance, and now we begin to build spark arm jobs
>> on other high performance instances, and the build/test are all success, we
>> plan to donate the instance to amplab later.  According to the build
>> history, we are very happy to say spark is supported on aarch64 platform,
>> and I suggest to add this good news into spark-3.0.0 releasenotes. Maybe
>> community could provide an arm-supported release of spark at the meanwhile?
>>
>> [1]
>> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/
>> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/
>>
>> ps: the jira https://issues.apache.org/jira/browse/SPARK-29106 trace the
>> whole work, thank you very much Shane:)
>>
>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-14 Thread Hyukjin Kwon
Yeah, sounds good to have it.

In case of R, it seems not quite common to write down JIRA ID [1] but looks
some have the prefix in its test name in general.
In case of Python and Java, seems we time to time write a JIRA ID in the
comment right under the test method [2][3].

Given this pattern, I would like to suggest use the same format but:

1. For Python and Java, write a single comment that starts with JIRA ID and
short description, e.g. (SPARK-X: test blah blah)
2. For R, use JIRA ID as a prefix for its test name.

[1] git grep -r "SPARK-" -- '*test*.R'
[2] git grep -r "SPARK-" -- '*Suite.java'
[3] git grep -r "SPARK-" -- '*test*.py'

Does that make sense? Adding Felix and Shivaram too.


2019년 11월 15일 (금) 오전 3:13, Shixiong(Ryan) Zhu 님이
작성:

> Should we also add a guideline for non Scala tests? Other languages (Java,
> Python, R) don't support using string as a test name.
>
> Best Regards,
> Ryan
>
>
> On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon  wrote:
>
>> I opened a PR - https://github.com/apache/spark-website/pull/231
>>
>> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>>
>>> > In general a test should be self descriptive and I don't think we
>>> should be adding JIRA ticket references wholesale. Any action that the
>>> reader has to take to understand why a test was introduced is one too many.
>>> However in some cases the thing we are trying to test is very subtle and in
>>> that case a reference to a JIRA ticket might be useful, I do still feel
>>> that this should be a backstop and that properly documenting your tests is
>>> a much better way of dealing with this.
>>>
>>> Yeah, the test should be self-descriptive. I don't think adding a JIRA
>>> prefix harms this point. Probably I should add this sentence in the
>>> guidelines as well.
>>> Adding a JIRA prefix just adds one extra hint to track down details. I
>>> think it's fine to stick to this practice and make it simpler and clear to
>>> follow.
>>>
>>> > 1. what if multiple JIRA IDs relating to the same test? we just take
>>> the very first JIRA ID?
>>> Ideally one JIRA should describe one issue and one PR should fix one
>>> JIRA with a dedicated test.
>>> Yeah, I think I would take the very first JIRA ID.
>>>
>>> > 2. are we going to have a full scan of all existing tests and attach a
>>> JIRA ID to it?
>>> Yea, let's don't do this.
>>>
>>> > It's a nice-to-have, not super essential, just because ...
>>> It's been asked multiple times and each committer seems having a
>>> different understanding on this.
>>> It's not a biggie but wanted to make it clear and conclude this.
>>>
>>> > I'd add this only when a test specifically targets a certain issue.
>>> Yes, so this one I am not sure. From what I heard, people adds the JIRA
>>> in cases below:
>>>
>>> - Whenever the JIRA type is a bug
>>> - When a PR adds a couple of tests
>>> - Only when a test specifically targets a certain issue.
>>> - ...
>>>
>>> Which one do we prefer and simpler to follow?
>>>
>>> Or I can combine as below (im gonna reword when I actually document
>>> this):
>>> 1. In general, we should add a JIRA ID as prefix of a test when a PR
>>> targets to fix a specific issue.
>>> In practice, it usually happens when a JIRA type is a bug or a PR
>>> adds a couple of tests.
>>> 2. Uses "SPARK-: test name" format
>>>
>>> If we have no objection with ^, let me go with this.
>>>
>>> 2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:
>>>
 Let's suggest "SPARK-12345:" but not go back and change a bunch of test
 cases.
 I'd add this only when a test specifically targets a certain issue.
 It's a nice-to-have, not super essential, just because in the rare
 case you need to understand why a test asserts something, you can go
 back and find what added it in the git history without much trouble.

 On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
 wrote:
 >
 > Hi all,
 >
 > Maybe it's not a big deal but it brought some confusions time to time
 into Spark dev and community. I think it's time to discuss about when/which
 format to add a JIRA ID as a prefix for the test case name in Scala test
 cases.
 >
 > Currently we have many test case names with prefixes as below:
 >
 > test("SPARK-X blah blah")
 > test("SPARK-X: blah blah")
 > test("SPARK-X - blah blah")
 > test("[SPARK-X] blah blah")
 > …
 >
 > It is a good practice to have the JIRA ID in general because, for
 instance,
 > it makes us put less efforts to track commit histories (or even when
 the files
 > are totally moved), or to track related information of tests failed.
 > Considering Spark's getting big, I think it's good to document.
 >
 > I would like to suggest this and document it in our guideline:
 >
 > 1. Add a prefix into a test name when a PR adds a couple of tests.
 > 2. Uses "SPARK-: test name" format which is used in our code base
 most
 >   

Re: Ask for ARM CI for spark

2019-11-14 Thread bo zhaobo
Hi Spark team,

Any ideas about the above email? Thank you.

BR

ZhaoBo


[image: Mailtrack]

Sender
notified by
Mailtrack

19/11/15
上午09:26:17

Tianhua huang  于2019年11月12日周二 下午2:47写道:

> Hi all,
>
> Spark arm jobs have built for some time, and now there are two jobs[1]
> spark-master-test-maven-arm
> 
> and spark-master-test-python-arm
> ,
> we can see there are some build failures, but it because of the poor
> performance of the arm instance, and now we begin to build spark arm jobs
> on other high performance instances, and the build/test are all success, we
> plan to donate the instance to amplab later.  According to the build
> history, we are very happy to say spark is supported on aarch64 platform,
> and I suggest to add this good news into spark-3.0.0 releasenotes. Maybe
> community could provide an arm-supported release of spark at the meanwhile?
>
> [1]
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/
>
> ps: the jira https://issues.apache.org/jira/browse/SPARK-29106 trace the
> whole work, thank you very much Shane:)
>
> On Thu, Oct 17, 2019 at 2:52 PM bo zhaobo 
> wrote:
>
>> Just Notes: The jira issue link is
>> https://issues.apache.org/jira/browse/SPARK-29106
>>
>>
>>
>> [image: Mailtrack]
>> 
>>  Sender
>> notified by
>> Mailtrack
>> 
>>  19/10/17
>> 下午02:50:01
>>
>> Tianhua huang  于2019年10月17日周四 上午10:47写道:
>>
>>> OK, let's update infos there. Thanks.
>>>
>>> On Thu, Oct 17, 2019 at 1:52 AM Shane Knapp  wrote:
>>>
 i totally missed the spark jira from earlier...  let's move the
 conversation there!

 On Tue, Oct 15, 2019 at 6:21 PM bo zhaobo 
 wrote:

> Shane, Awaresome! We will try the best to finish the test and the
> requests on the VM recently. Once we finish those things, we will send you
> an email , then we can continue the following things. Thank you very much.
>
> Best Regards,
>
> ZhaoBo
>
> Shane Knapp  于 2019年10月16日周三 上午3:47写道:
>
>> ok!  i'm able to successfully log in to the VM!
>>
>> i also have created a jenkins worker entry:
>> https://amplab.cs.berkeley.edu/jenkins/computer/spark-arm-vm/
>>
>> it's a pretty bare-bones VM, so i have some suggestions/requests
>> before we can actually proceed w/testing.  i will not be able to perform
>> any system configuration, as i don't have the cycles to reverse-engineer
>> the ansible setup and test it all out.
>>
>> * java is not installed, please install the following:
>>   - java8 min version 1.8.0_191
>>   - java11 min version 11.0.1
>>
>> * it appears from the ansible playbook that there are other deps that
>> need to be installed.
>>   - please install all deps
>>   - manually run the tests until they pass
>>
>> * the jenkins user should NEVER have sudo or any root-level access!
>>
>> * once the arm tests pass when manually run, take a snapshot of this
>> image so we can recreate it w/o needing to reinstall everything
>>
>> after that's done i can finish configuring the jenkins worker and set
>> up a build...
>>
>> thanks!
>>
>> shane
>>
>>
>> On Mon, Oct 14, 2019 at 8:34 PM Shane Knapp 
>> wrote:
>>
>>> yes, i will get to that tomorrow.  today was spent cleaning up the
>>> mess from last week.
>>>
>>> On Mon, Oct 14, 2019 at 6:18 PM bo zhaobo <
>>> bzhaojyathousa...@gmail.com> wrote:
>>>
 Hi shane,

 That's great news for Amplab is back. ;-) . If possible, could you
 please take several minutes to check the ARM VM is accessible from your
 side? And is there any plan for the whole ARM test integration from
 you?(how about we finish it this month?) Thanks.

 Best regards,

 ZhaoBo



 [image: Mailtrack]
 
  Sender
 notified by
 Mailtrack
 
  19/10/15
 上午09:13:33

 bo zhaobo  于2019年10月10日周四 上午8:29写道:

> Oh, sorry about we miss that email.  If possible, could you please
> take some minutes to test the ARM VM is accessible through your ssh 
> private

[build system] Upgrading pyarrow, builds might be temporarily broken

2019-11-14 Thread Bryan Cutler
We are in the process of upgrading pyarrow in the testing environment,
which might cause pyspark test failures until
https://github.com/apache/spark/pull/26133 is merged. Apologies for the
lack of notice beforehand, but I jumped the gun a little and forgot this
would affect other builds too. The PR for the upgrade is all ready to go
and *should* pass current tests. I'll keep an eye on it and try to get it
resolved tonight. If it becomes a problem, we can try to downgrade pyarrow
to where it was.

Thanks,
Bryan


Re: Adding JIRA ID as the prefix for the test case name

2019-11-14 Thread Shixiong(Ryan) Zhu
Should we also add a guideline for non Scala tests? Other languages (Java,
Python, R) don't support using string as a test name.

Best Regards,
Ryan


On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon  wrote:

> I opened a PR - https://github.com/apache/spark-website/pull/231
>
> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>
>> > In general a test should be self descriptive and I don't think we
>> should be adding JIRA ticket references wholesale. Any action that the
>> reader has to take to understand why a test was introduced is one too many.
>> However in some cases the thing we are trying to test is very subtle and in
>> that case a reference to a JIRA ticket might be useful, I do still feel
>> that this should be a backstop and that properly documenting your tests is
>> a much better way of dealing with this.
>>
>> Yeah, the test should be self-descriptive. I don't think adding a JIRA
>> prefix harms this point. Probably I should add this sentence in the
>> guidelines as well.
>> Adding a JIRA prefix just adds one extra hint to track down details. I
>> think it's fine to stick to this practice and make it simpler and clear to
>> follow.
>>
>> > 1. what if multiple JIRA IDs relating to the same test? we just take
>> the very first JIRA ID?
>> Ideally one JIRA should describe one issue and one PR should fix one JIRA
>> with a dedicated test.
>> Yeah, I think I would take the very first JIRA ID.
>>
>> > 2. are we going to have a full scan of all existing tests and attach a
>> JIRA ID to it?
>> Yea, let's don't do this.
>>
>> > It's a nice-to-have, not super essential, just because ...
>> It's been asked multiple times and each committer seems having a
>> different understanding on this.
>> It's not a biggie but wanted to make it clear and conclude this.
>>
>> > I'd add this only when a test specifically targets a certain issue.
>> Yes, so this one I am not sure. From what I heard, people adds the JIRA
>> in cases below:
>>
>> - Whenever the JIRA type is a bug
>> - When a PR adds a couple of tests
>> - Only when a test specifically targets a certain issue.
>> - ...
>>
>> Which one do we prefer and simpler to follow?
>>
>> Or I can combine as below (im gonna reword when I actually document this):
>> 1. In general, we should add a JIRA ID as prefix of a test when a PR
>> targets to fix a specific issue.
>> In practice, it usually happens when a JIRA type is a bug or a PR
>> adds a couple of tests.
>> 2. Uses "SPARK-: test name" format
>>
>> If we have no objection with ^, let me go with this.
>>
>> 2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:
>>
>>> Let's suggest "SPARK-12345:" but not go back and change a bunch of test
>>> cases.
>>> I'd add this only when a test specifically targets a certain issue.
>>> It's a nice-to-have, not super essential, just because in the rare
>>> case you need to understand why a test asserts something, you can go
>>> back and find what added it in the git history without much trouble.
>>>
>>> On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > Maybe it's not a big deal but it brought some confusions time to time
>>> into Spark dev and community. I think it's time to discuss about when/which
>>> format to add a JIRA ID as a prefix for the test case name in Scala test
>>> cases.
>>> >
>>> > Currently we have many test case names with prefixes as below:
>>> >
>>> > test("SPARK-X blah blah")
>>> > test("SPARK-X: blah blah")
>>> > test("SPARK-X - blah blah")
>>> > test("[SPARK-X] blah blah")
>>> > …
>>> >
>>> > It is a good practice to have the JIRA ID in general because, for
>>> instance,
>>> > it makes us put less efforts to track commit histories (or even when
>>> the files
>>> > are totally moved), or to track related information of tests failed.
>>> > Considering Spark's getting big, I think it's good to document.
>>> >
>>> > I would like to suggest this and document it in our guideline:
>>> >
>>> > 1. Add a prefix into a test name when a PR adds a couple of tests.
>>> > 2. Uses "SPARK-: test name" format which is used in our code base
>>> most
>>> >   often[1].
>>> >
>>> > We should make it simple and clear but closer to the actual practice.
>>> So, I would like to listen to what other people think. I would appreciate
>>> if you guys give some feedback about when to add the JIRA prefix. One
>>> alternative is that, we only add the prefix when the JIRA's type is bug.
>>> >
>>> > [1]
>>> > git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>>> >  923
>>> > git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>>> >  477
>>> > git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>>> >   16
>>> > git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>>> >   13
>>> >
>>> >
>>> >
>>>
>>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-14 Thread Hyukjin Kwon
I opened a PR - https://github.com/apache/spark-website/pull/231

2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:

> > In general a test should be self descriptive and I don't think we should
> be adding JIRA ticket references wholesale. Any action that the reader has
> to take to understand why a test was introduced is one too many. However in
> some cases the thing we are trying to test is very subtle and in that case
> a reference to a JIRA ticket might be useful, I do still feel that this
> should be a backstop and that properly documenting your tests is a much
> better way of dealing with this.
>
> Yeah, the test should be self-descriptive. I don't think adding a JIRA
> prefix harms this point. Probably I should add this sentence in the
> guidelines as well.
> Adding a JIRA prefix just adds one extra hint to track down details. I
> think it's fine to stick to this practice and make it simpler and clear to
> follow.
>
> > 1. what if multiple JIRA IDs relating to the same test? we just take the
> very first JIRA ID?
> Ideally one JIRA should describe one issue and one PR should fix one JIRA
> with a dedicated test.
> Yeah, I think I would take the very first JIRA ID.
>
> > 2. are we going to have a full scan of all existing tests and attach a
> JIRA ID to it?
> Yea, let's don't do this.
>
> > It's a nice-to-have, not super essential, just because ...
> It's been asked multiple times and each committer seems having a different
> understanding on this.
> It's not a biggie but wanted to make it clear and conclude this.
>
> > I'd add this only when a test specifically targets a certain issue.
> Yes, so this one I am not sure. From what I heard, people adds the JIRA in
> cases below:
>
> - Whenever the JIRA type is a bug
> - When a PR adds a couple of tests
> - Only when a test specifically targets a certain issue.
> - ...
>
> Which one do we prefer and simpler to follow?
>
> Or I can combine as below (im gonna reword when I actually document this):
> 1. In general, we should add a JIRA ID as prefix of a test when a PR
> targets to fix a specific issue.
> In practice, it usually happens when a JIRA type is a bug or a PR adds
> a couple of tests.
> 2. Uses "SPARK-: test name" format
>
> If we have no objection with ^, let me go with this.
>
> 2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:
>
>> Let's suggest "SPARK-12345:" but not go back and change a bunch of test
>> cases.
>> I'd add this only when a test specifically targets a certain issue.
>> It's a nice-to-have, not super essential, just because in the rare
>> case you need to understand why a test asserts something, you can go
>> back and find what added it in the git history without much trouble.
>>
>> On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
>> wrote:
>> >
>> > Hi all,
>> >
>> > Maybe it's not a big deal but it brought some confusions time to time
>> into Spark dev and community. I think it's time to discuss about when/which
>> format to add a JIRA ID as a prefix for the test case name in Scala test
>> cases.
>> >
>> > Currently we have many test case names with prefixes as below:
>> >
>> > test("SPARK-X blah blah")
>> > test("SPARK-X: blah blah")
>> > test("SPARK-X - blah blah")
>> > test("[SPARK-X] blah blah")
>> > …
>> >
>> > It is a good practice to have the JIRA ID in general because, for
>> instance,
>> > it makes us put less efforts to track commit histories (or even when
>> the files
>> > are totally moved), or to track related information of tests failed.
>> > Considering Spark's getting big, I think it's good to document.
>> >
>> > I would like to suggest this and document it in our guideline:
>> >
>> > 1. Add a prefix into a test name when a PR adds a couple of tests.
>> > 2. Uses "SPARK-: test name" format which is used in our code base
>> most
>> >   often[1].
>> >
>> > We should make it simple and clear but closer to the actual practice.
>> So, I would like to listen to what other people think. I would appreciate
>> if you guys give some feedback about when to add the JIRA prefix. One
>> alternative is that, we only add the prefix when the JIRA's type is bug.
>> >
>> > [1]
>> > git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>> >  923
>> > git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>> >  477
>> > git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>> >   16
>> > git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>> >   13
>> >
>> >
>> >
>>
>