Re: Random sampling in tests

2018-10-09 Thread Steve Loughran
Randomized testing can, in theory, help you explore a far larger area of the 
environment of an app than you could explicitly explore, such as "does 
everything work in the turkish locale where "I".toLower()!="i", etc.

Good: faster tests, especially on an essentially-non-finite set of options

bad: if you can't replicate it, you can't be sure you've fixed any failure


The Lucene team have led the way here; they've got a notion of a randomized 
context which can be regenerated -e.g. you use a random key as its foundation, 
and you can set that key to rerun the entire test suite in the same context, 

https://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/dawidweiss-randomizedtesting-pub.pdf

Talking to them is probably the best way to start in this world —the key point 
is you can make use of this safely, but it needs planning

-Steve



Re: Random sampling in tests

2018-10-08 Thread Dongjoon Hyun
Sean's approach looks much better to me (
https://github.com/apache/spark/pull/22672)

It achieves both contradictory goals simultaneously; keeping all test
coverages and reducing the time from 2:31 to 0:24.

Since we can remove test coverages anytime, can we proceed with Sean's
non-intrusive approach first before removing?

Bests,
Dongjoon.


On Mon, Oct 8, 2018 at 8:57 AM Xiao Li  wrote:

> Yes. Testing all the timezones is not needed.
>
> Xiao
>
> On Mon, Oct 8, 2018 at 8:36 AM Maxim Gekk 
> wrote:
>
>> Hi All,
>>
>> I believe we should also take into account what we test, for example, I
>> don't think it makes sense to check all timezones for JSON/CSV
>> functions/datasources because those timezones are just passed to external
>> libraries. So, the same code is involved into testing of each out of 650
>> timezones. We basically just spend time and resources on testing the
>> external libraries.
>>
>> I mean the PRs: https://github.com/apache/spark/pull/22657 and
>> https://github.com/apache/spark/pull/22379#discussion_r223039662
>>
>> Maxim Gekk
>>
>> Technical Solutions Lead
>>
>> Databricks B. V.  
>>
>>
>> On Mon, Oct 8, 2018 at 4:49 PM Sean Owen  wrote:
>>
>>> If the problem is simply reducing the wall-clock time of tests, then
>>> even before we get to this question, I'm advocating:
>>>
>>> 1) try simple parallelization of tests within the suite. In this
>>> instance there's no reason not to test these in parallel and get a 8x
>>> or 16x speedup from cores. This assumes, I believe correctly, that the
>>> machines aren't generally near 100% utilization
>>> 2) explicitly choose a smaller, more directed set of cases to test
>>>
>>> Randomly choosing test cases with a fixed seed is basically 2, but not
>>> choosing test cases for a particular reason. You can vary the seed but
>>> as a rule the same random subset of tests is always chosen. Could be
>>> fine if there's no reason at all to prefer some cases over others. But
>>> I am guessing any wild guess at the most important subset of cases to
>>> test is better than random.
>>>
>>> I'm trying 1) right now instead in these several cases.
>>> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li  wrote:
>>> >
>>> > For this specific case, I do not think we should test all the
>>> timezone. If this is fast, I am fine to leave it unchanged. However, this
>>> is very slow. Thus, I even prefer to reducing the tested timezone to a
>>> smaller number or just hardcoding some specific time zones.
>>> >
>>> > In general, I like Reynold’s idea by including the seed value and we
>>> add the seed name in the test case name. This can help us reproduce it.
>>> >
>>> > Xiao
>>> >
>>> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin 
>>> wrote:
>>> >>
>>> >> I'm personally not a big fan of doing it that way in the PR. It is
>>> perfectly fine to employ randomized tests, and in this case it might even
>>> be fine to just pick couple different timezones like the way it happened in
>>> the PR, but we should:
>>> >>
>>> >> 1. Document in the code comment why we did it that way.
>>> >>
>>> >> 2. Use a seed and log the seed, so any test failures can be
>>> reproduced deterministically. For this one, it'd be better to pick the seed
>>> from a seed environmental variable. If the env variable is not set, set to
>>> a random seed.
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>> >>>
>>> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>> >>> that tests a bunch of cases by randomly choosing different subsets of
>>> >>> cases to test on each Jenkins run.
>>> >>>
>>> >>> There's disagreement about whether this is good approach to improving
>>> >>> test runtime. Here's a discussion on one that was committed:
>>> >>> https://github.com/apache/spark/pull/22631/files#r223190476
>>> >>>
>>> >>> I'm flagging it for more input.
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: Random sampling in tests

2018-10-08 Thread Xiao Li
Yes. Testing all the timezones is not needed.

Xiao

On Mon, Oct 8, 2018 at 8:36 AM Maxim Gekk  wrote:

> Hi All,
>
> I believe we should also take into account what we test, for example, I
> don't think it makes sense to check all timezones for JSON/CSV
> functions/datasources because those timezones are just passed to external
> libraries. So, the same code is involved into testing of each out of 650
> timezones. We basically just spend time and resources on testing the
> external libraries.
>
> I mean the PRs: https://github.com/apache/spark/pull/22657 and
> https://github.com/apache/spark/pull/22379#discussion_r223039662
>
> Maxim Gekk
>
> Technical Solutions Lead
>
> Databricks B. V.  
>
>
> On Mon, Oct 8, 2018 at 4:49 PM Sean Owen  wrote:
>
>> If the problem is simply reducing the wall-clock time of tests, then
>> even before we get to this question, I'm advocating:
>>
>> 1) try simple parallelization of tests within the suite. In this
>> instance there's no reason not to test these in parallel and get a 8x
>> or 16x speedup from cores. This assumes, I believe correctly, that the
>> machines aren't generally near 100% utilization
>> 2) explicitly choose a smaller, more directed set of cases to test
>>
>> Randomly choosing test cases with a fixed seed is basically 2, but not
>> choosing test cases for a particular reason. You can vary the seed but
>> as a rule the same random subset of tests is always chosen. Could be
>> fine if there's no reason at all to prefer some cases over others. But
>> I am guessing any wild guess at the most important subset of cases to
>> test is better than random.
>>
>> I'm trying 1) right now instead in these several cases.
>> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li  wrote:
>> >
>> > For this specific case, I do not think we should test all the timezone.
>> If this is fast, I am fine to leave it unchanged. However, this is very
>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>> number or just hardcoding some specific time zones.
>> >
>> > In general, I like Reynold’s idea by including the seed value and we
>> add the seed name in the test case name. This can help us reproduce it.
>> >
>> > Xiao
>> >
>> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>> >>
>> >> I'm personally not a big fan of doing it that way in the PR. It is
>> perfectly fine to employ randomized tests, and in this case it might even
>> be fine to just pick couple different timezones like the way it happened in
>> the PR, but we should:
>> >>
>> >> 1. Document in the code comment why we did it that way.
>> >>
>> >> 2. Use a seed and log the seed, so any test failures can be reproduced
>> deterministically. For this one, it'd be better to pick the seed from a
>> seed environmental variable. If the env variable is not set, set to a
>> random seed.
>> >>
>> >>
>> >>
>> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>> >>>
>> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
>> >>> that tests a bunch of cases by randomly choosing different subsets of
>> >>> cases to test on each Jenkins run.
>> >>>
>> >>> There's disagreement about whether this is good approach to improving
>> >>> test runtime. Here's a discussion on one that was committed:
>> >>> https://github.com/apache/spark/pull/22631/files#r223190476
>> >>>
>> >>> I'm flagging it for more input.
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Random sampling in tests

2018-10-08 Thread Maxim Gekk
Hi All,

I believe we should also take into account what we test, for example, I
don't think it makes sense to check all timezones for JSON/CSV
functions/datasources because those timezones are just passed to external
libraries. So, the same code is involved into testing of each out of 650
timezones. We basically just spend time and resources on testing the
external libraries.

I mean the PRs: https://github.com/apache/spark/pull/22657 and
https://github.com/apache/spark/pull/22379#discussion_r223039662

Maxim Gekk

Technical Solutions Lead

Databricks B. V.  


On Mon, Oct 8, 2018 at 4:49 PM Sean Owen  wrote:

> If the problem is simply reducing the wall-clock time of tests, then
> even before we get to this question, I'm advocating:
>
> 1) try simple parallelization of tests within the suite. In this
> instance there's no reason not to test these in parallel and get a 8x
> or 16x speedup from cores. This assumes, I believe correctly, that the
> machines aren't generally near 100% utilization
> 2) explicitly choose a smaller, more directed set of cases to test
>
> Randomly choosing test cases with a fixed seed is basically 2, but not
> choosing test cases for a particular reason. You can vary the seed but
> as a rule the same random subset of tests is always chosen. Could be
> fine if there's no reason at all to prefer some cases over others. But
> I am guessing any wild guess at the most important subset of cases to
> test is better than random.
>
> I'm trying 1) right now instead in these several cases.
> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li  wrote:
> >
> > For this specific case, I do not think we should test all the timezone.
> If this is fast, I am fine to leave it unchanged. However, this is very
> slow. Thus, I even prefer to reducing the tested timezone to a smaller
> number or just hardcoding some specific time zones.
> >
> > In general, I like Reynold’s idea by including the seed value and we add
> the seed name in the test case name. This can help us reproduce it.
> >
> > Xiao
> >
> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
> >>
> >> I'm personally not a big fan of doing it that way in the PR. It is
> perfectly fine to employ randomized tests, and in this case it might even
> be fine to just pick couple different timezones like the way it happened in
> the PR, but we should:
> >>
> >> 1. Document in the code comment why we did it that way.
> >>
> >> 2. Use a seed and log the seed, so any test failures can be reproduced
> deterministically. For this one, it'd be better to pick the seed from a
> seed environmental variable. If the env variable is not set, set to a
> random seed.
> >>
> >>
> >>
> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
> >>>
> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
> >>> that tests a bunch of cases by randomly choosing different subsets of
> >>> cases to test on each Jenkins run.
> >>>
> >>> There's disagreement about whether this is good approach to improving
> >>> test runtime. Here's a discussion on one that was committed:
> >>> https://github.com/apache/spark/pull/22631/files#r223190476
> >>>
> >>> I'm flagging it for more input.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Random sampling in tests

2018-10-08 Thread Sean Owen
If the problem is simply reducing the wall-clock time of tests, then
even before we get to this question, I'm advocating:

1) try simple parallelization of tests within the suite. In this
instance there's no reason not to test these in parallel and get a 8x
or 16x speedup from cores. This assumes, I believe correctly, that the
machines aren't generally near 100% utilization
2) explicitly choose a smaller, more directed set of cases to test

Randomly choosing test cases with a fixed seed is basically 2, but not
choosing test cases for a particular reason. You can vary the seed but
as a rule the same random subset of tests is always chosen. Could be
fine if there's no reason at all to prefer some cases over others. But
I am guessing any wild guess at the most important subset of cases to
test is better than random.

I'm trying 1) right now instead in these several cases.
On Mon, Oct 8, 2018 at 9:24 AM Xiao Li  wrote:
>
> For this specific case, I do not think we should test all the timezone. If 
> this is fast, I am fine to leave it unchanged. However, this is very slow. 
> Thus, I even prefer to reducing the tested timezone to a smaller number or 
> just hardcoding some specific time zones.
>
> In general, I like Reynold’s idea by including the seed value and we add the 
> seed name in the test case name. This can help us reproduce it.
>
> Xiao
>
> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>>
>> I'm personally not a big fan of doing it that way in the PR. It is perfectly 
>> fine to employ randomized tests, and in this case it might even be fine to 
>> just pick couple different timezones like the way it happened in the PR, but 
>> we should:
>>
>> 1. Document in the code comment why we did it that way.
>>
>> 2. Use a seed and log the seed, so any test failures can be reproduced 
>> deterministically. For this one, it'd be better to pick the seed from a seed 
>> environmental variable. If the env variable is not set, set to a random seed.
>>
>>
>>
>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>>
>>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>> that tests a bunch of cases by randomly choosing different subsets of
>>> cases to test on each Jenkins run.
>>>
>>> There's disagreement about whether this is good approach to improving
>>> test runtime. Here's a discussion on one that was committed:
>>> https://github.com/apache/spark/pull/22631/files#r223190476
>>>
>>> I'm flagging it for more input.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Random sampling in tests

2018-10-08 Thread Marco Gaido
Yes, I see. It makes sense.
Thanks.

Il giorno lun 8 ott 2018 alle ore 16:35 Reynold Xin 
ha scritto:

> Marco - the issue is to reproduce. It is much more annoying for somebody
> else who might not have touched this test case to be able to reproduce the
> error, just given a timezone. It is much easier to just follow some
> documentation saying "please run TEST_SEED=5 build/sbt ~ ".
>
>
> On Mon, Oct 8, 2018 at 4:33 PM Marco Gaido  wrote:
>
>> Hi all,
>>
>> thanks for bringing up the topic Sean. I agree too with Reynold's idea,
>> but in the specific case, if there is an error the timezone is part of the
>> error message.
>> So we know exactly which timezone caused the failure. Hence I thought
>> that logging the seed is not necessary, as we can directly use the failing
>> timezone.
>>
>> Thanks,
>> Marco
>>
>> Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li 
>> ha scritto:
>>
>>> For this specific case, I do not think we should test all the timezone.
>>> If this is fast, I am fine to leave it unchanged. However, this is very
>>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>>> number or just hardcoding some specific time zones.
>>>
>>> In general, I like Reynold’s idea by including the seed value and we add
>>> the seed name in the test case name. This can help us reproduce it.
>>>
>>> Xiao
>>>
>>> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>>>
 I'm personally not a big fan of doing it that way in the PR. It is
 perfectly fine to employ randomized tests, and in this case it might even
 be fine to just pick couple different timezones like the way it happened in
 the PR, but we should:

 1. Document in the code comment why we did it that way.

 2. Use a seed and log the seed, so any test failures can be reproduced
 deterministically. For this one, it'd be better to pick the seed from a
 seed environmental variable. If the env variable is not set, set to a
 random seed.



 On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:

> Recently, I've seen 3 pull requests that try to speed up a test suite
> that tests a bunch of cases by randomly choosing different subsets of
> cases to test on each Jenkins run.
>
> There's disagreement about whether this is good approach to improving
> test runtime. Here's a discussion on one that was committed:
> https://github.com/apache/spark/pull/22631/files#r223190476
>
> I'm flagging it for more input.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Random sampling in tests

2018-10-08 Thread Reynold Xin
Marco - the issue is to reproduce. It is much more annoying for somebody
else who might not have touched this test case to be able to reproduce the
error, just given a timezone. It is much easier to just follow some
documentation saying "please run TEST_SEED=5 build/sbt ~ ".


On Mon, Oct 8, 2018 at 4:33 PM Marco Gaido  wrote:

> Hi all,
>
> thanks for bringing up the topic Sean. I agree too with Reynold's idea,
> but in the specific case, if there is an error the timezone is part of the
> error message.
> So we know exactly which timezone caused the failure. Hence I thought that
> logging the seed is not necessary, as we can directly use the failing
> timezone.
>
> Thanks,
> Marco
>
> Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li 
> ha scritto:
>
>> For this specific case, I do not think we should test all the timezone.
>> If this is fast, I am fine to leave it unchanged. However, this is very
>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>> number or just hardcoding some specific time zones.
>>
>> In general, I like Reynold’s idea by including the seed value and we add
>> the seed name in the test case name. This can help us reproduce it.
>>
>> Xiao
>>
>> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>>
>>> I'm personally not a big fan of doing it that way in the PR. It is
>>> perfectly fine to employ randomized tests, and in this case it might even
>>> be fine to just pick couple different timezones like the way it happened in
>>> the PR, but we should:
>>>
>>> 1. Document in the code comment why we did it that way.
>>>
>>> 2. Use a seed and log the seed, so any test failures can be reproduced
>>> deterministically. For this one, it'd be better to pick the seed from a
>>> seed environmental variable. If the env variable is not set, set to a
>>> random seed.
>>>
>>>
>>>
>>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>>
 Recently, I've seen 3 pull requests that try to speed up a test suite
 that tests a bunch of cases by randomly choosing different subsets of
 cases to test on each Jenkins run.

 There's disagreement about whether this is good approach to improving
 test runtime. Here's a discussion on one that was committed:
 https://github.com/apache/spark/pull/22631/files#r223190476

 I'm flagging it for more input.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: Random sampling in tests

2018-10-08 Thread Marco Gaido
Hi all,

thanks for bringing up the topic Sean. I agree too with Reynold's idea, but
in the specific case, if there is an error the timezone is part of the
error message.
So we know exactly which timezone caused the failure. Hence I thought that
logging the seed is not necessary, as we can directly use the failing
timezone.

Thanks,
Marco

Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li  ha
scritto:

> For this specific case, I do not think we should test all the timezone. If
> this is fast, I am fine to leave it unchanged. However, this is very slow.
> Thus, I even prefer to reducing the tested timezone to a smaller number or
> just hardcoding some specific time zones.
>
> In general, I like Reynold’s idea by including the seed value and we add
> the seed name in the test case name. This can help us reproduce it.
>
> Xiao
>
> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>
>> I'm personally not a big fan of doing it that way in the PR. It is
>> perfectly fine to employ randomized tests, and in this case it might even
>> be fine to just pick couple different timezones like the way it happened in
>> the PR, but we should:
>>
>> 1. Document in the code comment why we did it that way.
>>
>> 2. Use a seed and log the seed, so any test failures can be reproduced
>> deterministically. For this one, it'd be better to pick the seed from a
>> seed environmental variable. If the env variable is not set, set to a
>> random seed.
>>
>>
>>
>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>
>>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>> that tests a bunch of cases by randomly choosing different subsets of
>>> cases to test on each Jenkins run.
>>>
>>> There's disagreement about whether this is good approach to improving
>>> test runtime. Here's a discussion on one that was committed:
>>> https://github.com/apache/spark/pull/22631/files#r223190476
>>>
>>> I'm flagging it for more input.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: Random sampling in tests

2018-10-08 Thread Xiao Li
For this specific case, I do not think we should test all the timezone. If
this is fast, I am fine to leave it unchanged. However, this is very slow.
Thus, I even prefer to reducing the tested timezone to a smaller number or
just hardcoding some specific time zones.

In general, I like Reynold’s idea by including the seed value and we add
the seed name in the test case name. This can help us reproduce it.

Xiao

On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:

> I'm personally not a big fan of doing it that way in the PR. It is
> perfectly fine to employ randomized tests, and in this case it might even
> be fine to just pick couple different timezones like the way it happened in
> the PR, but we should:
>
> 1. Document in the code comment why we did it that way.
>
> 2. Use a seed and log the seed, so any test failures can be reproduced
> deterministically. For this one, it'd be better to pick the seed from a
> seed environmental variable. If the env variable is not set, set to a
> random seed.
>
>
>
> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>
>> Recently, I've seen 3 pull requests that try to speed up a test suite
>> that tests a bunch of cases by randomly choosing different subsets of
>> cases to test on each Jenkins run.
>>
>> There's disagreement about whether this is good approach to improving
>> test runtime. Here's a discussion on one that was committed:
>> https://github.com/apache/spark/pull/22631/files#r223190476
>>
>> I'm flagging it for more input.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Random sampling in tests

2018-10-08 Thread Sean Owen
Recently, I've seen 3 pull requests that try to speed up a test suite
that tests a bunch of cases by randomly choosing different subsets of
cases to test on each Jenkins run.

There's disagreement about whether this is good approach to improving
test runtime. Here's a discussion on one that was committed:
https://github.com/apache/spark/pull/22631/files#r223190476

I'm flagging it for more input.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org