Sean's approach looks much better to me ( https://github.com/apache/spark/pull/22672)
It achieves both contradictory goals simultaneously; keeping all test coverages and reducing the time from 2:31 to 0:24. Since we can remove test coverages anytime, can we proceed with Sean's non-intrusive approach first before removing? Bests, Dongjoon. On Mon, Oct 8, 2018 at 8:57 AM Xiao Li <lix...@databricks.com> wrote: > Yes. Testing all the timezones is not needed. > > Xiao > > On Mon, Oct 8, 2018 at 8:36 AM Maxim Gekk <maxim.g...@databricks.com> > wrote: > >> Hi All, >> >> I believe we should also take into account what we test, for example, I >> don't think it makes sense to check all timezones for JSON/CSV >> functions/datasources because those timezones are just passed to external >> libraries. So, the same code is involved into testing of each out of 650 >> timezones. We basically just spend time and resources on testing the >> external libraries. >> >> I mean the PRs: https://github.com/apache/spark/pull/22657 and >> https://github.com/apache/spark/pull/22379#discussion_r223039662 >> >> Maxim Gekk >> >> Technical Solutions Lead >> >> Databricks B. V. <http://databricks.com/> >> >> >> On Mon, Oct 8, 2018 at 4:49 PM Sean Owen <sro...@gmail.com> wrote: >> >>> If the problem is simply reducing the wall-clock time of tests, then >>> even before we get to this question, I'm advocating: >>> >>> 1) try simple parallelization of tests within the suite. In this >>> instance there's no reason not to test these in parallel and get a 8x >>> or 16x speedup from cores. This assumes, I believe correctly, that the >>> machines aren't generally near 100% utilization >>> 2) explicitly choose a smaller, more directed set of cases to test >>> >>> Randomly choosing test cases with a fixed seed is basically 2, but not >>> choosing test cases for a particular reason. You can vary the seed but >>> as a rule the same random subset of tests is always chosen. Could be >>> fine if there's no reason at all to prefer some cases over others. But >>> I am guessing any wild guess at the most important subset of cases to >>> test is better than random. >>> >>> I'm trying 1) right now instead in these several cases. >>> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li <lix...@databricks.com> wrote: >>> > >>> > For this specific case, I do not think we should test all the >>> timezone. If this is fast, I am fine to leave it unchanged. However, this >>> is very slow. Thus, I even prefer to reducing the tested timezone to a >>> smaller number or just hardcoding some specific time zones. >>> > >>> > In general, I like Reynold’s idea by including the seed value and we >>> add the seed name in the test case name. This can help us reproduce it. >>> > >>> > Xiao >>> > >>> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin <r...@databricks.com> >>> wrote: >>> >> >>> >> I'm personally not a big fan of doing it that way in the PR. It is >>> perfectly fine to employ randomized tests, and in this case it might even >>> be fine to just pick couple different timezones like the way it happened in >>> the PR, but we should: >>> >> >>> >> 1. Document in the code comment why we did it that way. >>> >> >>> >> 2. Use a seed and log the seed, so any test failures can be >>> reproduced deterministically. For this one, it'd be better to pick the seed >>> from a seed environmental variable. If the env variable is not set, set to >>> a random seed. >>> >> >>> >> >>> >> >>> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen <sro...@gmail.com> wrote: >>> >>> >>> >>> Recently, I've seen 3 pull requests that try to speed up a test suite >>> >>> that tests a bunch of cases by randomly choosing different subsets of >>> >>> cases to test on each Jenkins run. >>> >>> >>> >>> There's disagreement about whether this is good approach to improving >>> >>> test runtime. Here's a discussion on one that was committed: >>> >>> https://github.com/apache/spark/pull/22631/files#r223190476 >>> >>> >>> >>> I'm flagging it for more input. >>> >>> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>