Re: Random sampling in tests

Reynold Xin Mon, 08 Oct 2018 07:36:01 -0700

Marco - the issue is to reproduce. It is much more annoying for somebody
else who might not have touched this test case to be able to reproduce the
error, just given a timezone. It is much easier to just follow some
documentation saying "please run TEST_SEED=5 build/sbt ~.... ".



On Mon, Oct 8, 2018 at 4:33 PM Marco Gaido <marcogaid...@gmail.com> wrote:

> Hi all,
>
> thanks for bringing up the topic Sean. I agree too with Reynold's idea,
> but in the specific case, if there is an error the timezone is part of the
> error message.
> So we know exactly which timezone caused the failure. Hence I thought that
> logging the seed is not necessary, as we can directly use the failing
> timezone.
>
> Thanks,
> Marco
>
> Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li <lix...@databricks.com>
> ha scritto:
>
>> For this specific case, I do not think we should test all the timezone.
>> If this is fast, I am fine to leave it unchanged. However, this is very
>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>> number or just hardcoding some specific time zones.
>>
>> In general, I like Reynold’s idea by including the seed value and we add
>> the seed name in the test case name. This can help us reproduce it.
>>
>> Xiao
>>
>> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin <r...@databricks.com> wrote:
>>
>>> I'm personally not a big fan of doing it that way in the PR. It is
>>> perfectly fine to employ randomized tests, and in this case it might even
>>> be fine to just pick couple different timezones like the way it happened in
>>> the PR, but we should:
>>>
>>> 1. Document in the code comment why we did it that way.
>>>
>>> 2. Use a seed and log the seed, so any test failures can be reproduced
>>> deterministically. For this one, it'd be better to pick the seed from a
>>> seed environmental variable. If the env variable is not set, set to a
>>> random seed.
>>>
>>>
>>>
>>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>>> that tests a bunch of cases by randomly choosing different subsets of
>>>> cases to test on each Jenkins run.
>>>>
>>>> There's disagreement about whether this is good approach to improving
>>>> test runtime. Here's a discussion on one that was committed:
>>>> https://github.com/apache/spark/pull/22631/files#r223190476
>>>>
>>>> I'm flagging it for more input.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

Re: Random sampling in tests

Reply via email to