Hey All,

Thanks for the loads of ideas.

@Stanislav, @Sonke
I probably left it out from my email but I really imagined this as a
case-by-case basis change. If we think that it wouldn't cause problems,
then it might be applied. That way we'd limit the blast radius somewhat.
The 1 hour gain is really just the most optimistic scenario, I'm almost
sure that not every test could be transformed to use a common cluster.
We internally have an improvement for a half a year now which reruns the
flaky test classes at the end of the test gradle task, lets you know that
they were rerun and probably flaky. It fails the build only if the second
run of the test class was also unsuccessful. I think it works pretty good,
we mostly have green builds. If there is interest, I can try to contribute
that.

>I am also extremely annoyed at times by the amount of coffee I have to
drink before tests finish
Just please don't get a heart attack :)

@Ron, @Colin
You bring up a very good point that it is easier and frees up more
resources if we just run change specific tests and it's good to know that a
similar solution (meaning using a shared resource for testing) have failed
elsewhere. I second Ron on the test categorization though, although as a
first attempt I think using a flaky retry + running only the necessary
tests would help in both time saving and effectiveness. Also it would be
easier to achieve.

@Ismael
Yea, it'd be interesting to profile the startup/shutdown, I've never done
that. Perhaps I'll set some time apart for that :). It's definitely true
though that if we see a significant delay there we wouldn't just improve
the efficiency of the tests but also customer experience.

Best,
Viktor



On Thu, Feb 28, 2019 at 8:12 AM Ismael Juma <isma...@gmail.com> wrote:

> It's an idea that has come up before and worth exploring eventually.
> However, I'd first try to optimize the server startup/shutdown process. If
> we measure where the time is going, maybe some opportunities will present
> themselves.
>
> Ismael
>
> On Wed, Feb 27, 2019, 3:09 AM Viktor Somogyi-Vass <viktorsomo...@gmail.com
> >
> wrote:
>
> > Hi Folks,
> >
> > I've been observing lately that unit tests usually take 2.5 hours to run
> > and a very big portion of these are the core tests where a new cluster is
> > spun up for every test. This takes most of the time. I ran a test
> > (TopicCommandWithAdminClient with 38 test inside) through the profiler
> and
> > it shows for instance that running the whole class itself took 10 minutes
> > and 37 seconds where the useful time was 5 minutes 18 seconds. That's a
> > 100% overhead. Without profiler the whole class takes 7 minutes and 48
> > seconds, so the useful time would be between 3-4 minutes. This is a
> bigger
> > test though, most of them won't take this much.
> > There are 74 classes that implement KafkaServerTestHarness and just
> running
> > :core:integrationTest takes almost 2 hours.
> >
> > I think we could greatly speed up these integration tests by just
> creating
> > the cluster once per class and perform the tests on separate methods. I
> > know that this a little bit contradicts to the principle that tests
> should
> > be independent but it seems like recreating clusters for each is a very
> > expensive operation. Also if the tests are acting on different resources
> > (different topics, etc.) then it might not hurt their independence. There
> > might be cases of course where this is not possible but I think there
> could
> > be a lot where it is.
> >
> > In the optimal case we could cut the testing time back by approximately
> an
> > hour. This would save resources and give quicker feedback for PR builds.
> >
> > What are your thoughts?
> > Has anyone thought about this or were there any attempts made?
> >
> > Best,
> > Viktor
> >
>

Reply via email to