Hey All, Thanks for the loads of ideas.
@Stanislav, @Sonke I probably left it out from my email but I really imagined this as a case-by-case basis change. If we think that it wouldn't cause problems, then it might be applied. That way we'd limit the blast radius somewhat. The 1 hour gain is really just the most optimistic scenario, I'm almost sure that not every test could be transformed to use a common cluster. We internally have an improvement for a half a year now which reruns the flaky test classes at the end of the test gradle task, lets you know that they were rerun and probably flaky. It fails the build only if the second run of the test class was also unsuccessful. I think it works pretty good, we mostly have green builds. If there is interest, I can try to contribute that. >I am also extremely annoyed at times by the amount of coffee I have to drink before tests finish Just please don't get a heart attack :) @Ron, @Colin You bring up a very good point that it is easier and frees up more resources if we just run change specific tests and it's good to know that a similar solution (meaning using a shared resource for testing) have failed elsewhere. I second Ron on the test categorization though, although as a first attempt I think using a flaky retry + running only the necessary tests would help in both time saving and effectiveness. Also it would be easier to achieve. @Ismael Yea, it'd be interesting to profile the startup/shutdown, I've never done that. Perhaps I'll set some time apart for that :). It's definitely true though that if we see a significant delay there we wouldn't just improve the efficiency of the tests but also customer experience. Best, Viktor On Thu, Feb 28, 2019 at 8:12 AM Ismael Juma <isma...@gmail.com> wrote: > It's an idea that has come up before and worth exploring eventually. > However, I'd first try to optimize the server startup/shutdown process. If > we measure where the time is going, maybe some opportunities will present > themselves. > > Ismael > > On Wed, Feb 27, 2019, 3:09 AM Viktor Somogyi-Vass <viktorsomo...@gmail.com > > > wrote: > > > Hi Folks, > > > > I've been observing lately that unit tests usually take 2.5 hours to run > > and a very big portion of these are the core tests where a new cluster is > > spun up for every test. This takes most of the time. I ran a test > > (TopicCommandWithAdminClient with 38 test inside) through the profiler > and > > it shows for instance that running the whole class itself took 10 minutes > > and 37 seconds where the useful time was 5 minutes 18 seconds. That's a > > 100% overhead. Without profiler the whole class takes 7 minutes and 48 > > seconds, so the useful time would be between 3-4 minutes. This is a > bigger > > test though, most of them won't take this much. > > There are 74 classes that implement KafkaServerTestHarness and just > running > > :core:integrationTest takes almost 2 hours. > > > > I think we could greatly speed up these integration tests by just > creating > > the cluster once per class and perform the tests on separate methods. I > > know that this a little bit contradicts to the principle that tests > should > > be independent but it seems like recreating clusters for each is a very > > expensive operation. Also if the tests are acting on different resources > > (different topics, etc.) then it might not hurt their independence. There > > might be cases of course where this is not possible but I think there > could > > be a lot where it is. > > > > In the optimal case we could cut the testing time back by approximately > an > > hour. This would save resources and give quicker feedback for PR builds. > > > > What are your thoughts? > > Has anyone thought about this or were there any attempts made? > > > > Best, > > Viktor > > >