Re: Random sampling in tests

Dongjoon Hyun Mon, 08 Oct 2018 11:22:05 -0700

Sean's approach looks much better to me (
https://github.com/apache/spark/pull/22672)


It achieves both contradictory goals simultaneously; keeping all test
coverages and reducing the time from 2:31 to 0:24.

Since we can remove test coverages anytime, can we proceed with Sean's
non-intrusive approach first before removing?

Bests,
Dongjoon.


On Mon, Oct 8, 2018 at 8:57 AM Xiao Li <lix...@databricks.com> wrote:

> Yes. Testing all the timezones is not needed.
>
> Xiao
>
> On Mon, Oct 8, 2018 at 8:36 AM Maxim Gekk <maxim.g...@databricks.com>
> wrote:
>
>> Hi All,
>>
>> I believe we should also take into account what we test, for example, I
>> don't think it makes sense to check all timezones for JSON/CSV
>> functions/datasources because those timezones are just passed to external
>> libraries. So, the same code is involved into testing of each out of 650
>> timezones. We basically just spend time and resources on testing the
>> external libraries.
>>
>> I mean the PRs: https://github.com/apache/spark/pull/22657 and
>> https://github.com/apache/spark/pull/22379#discussion_r223039662
>>
>> Maxim Gekk
>>
>> Technical Solutions Lead
>>
>> Databricks B. V.  <http://databricks.com/>
>>
>>
>> On Mon, Oct 8, 2018 at 4:49 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> If the problem is simply reducing the wall-clock time of tests, then
>>> even before we get to this question, I'm advocating:
>>>
>>> 1) try simple parallelization of tests within the suite. In this
>>> instance there's no reason not to test these in parallel and get a 8x
>>> or 16x speedup from cores. This assumes, I believe correctly, that the
>>> machines aren't generally near 100% utilization
>>> 2) explicitly choose a smaller, more directed set of cases to test
>>>
>>> Randomly choosing test cases with a fixed seed is basically 2, but not
>>> choosing test cases for a particular reason. You can vary the seed but
>>> as a rule the same random subset of tests is always chosen. Could be
>>> fine if there's no reason at all to prefer some cases over others. But
>>> I am guessing any wild guess at the most important subset of cases to
>>> test is better than random.
>>>
>>> I'm trying 1) right now instead in these several cases.
>>> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li <lix...@databricks.com> wrote:
>>> >
>>> > For this specific case, I do not think we should test all the
>>> timezone. If this is fast, I am fine to leave it unchanged. However, this
>>> is very slow. Thus, I even prefer to reducing the tested timezone to a
>>> smaller number or just hardcoding some specific time zones.
>>> >
>>> > In general, I like Reynold’s idea by including the seed value and we
>>> add the seed name in the test case name. This can help us reproduce it.
>>> >
>>> > Xiao
>>> >
>>> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin <r...@databricks.com>
>>> wrote:
>>> >>
>>> >> I'm personally not a big fan of doing it that way in the PR. It is
>>> perfectly fine to employ randomized tests, and in this case it might even
>>> be fine to just pick couple different timezones like the way it happened in
>>> the PR, but we should:
>>> >>
>>> >> 1. Document in the code comment why we did it that way.
>>> >>
>>> >> 2. Use a seed and log the seed, so any test failures can be
>>> reproduced deterministically. For this one, it'd be better to pick the seed
>>> from a seed environmental variable. If the env variable is not set, set to
>>> a random seed.
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen <sro...@gmail.com> wrote:
>>> >>>
>>> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>> >>> that tests a bunch of cases by randomly choosing different subsets of
>>> >>> cases to test on each Jenkins run.
>>> >>>
>>> >>> There's disagreement about whether this is good approach to improving
>>> >>> test runtime. Here's a discussion on one that was committed:
>>> >>> https://github.com/apache/spark/pull/22631/files#r223190476
>>> >>>
>>> >>> I'm flagging it for more input.
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: Random sampling in tests

Reply via email to