Great discussion!
Greg, that was a good call out regarding the two long-running builds. I missed that 90d view. My takeaway from that is that our average build time for tests is between 3-4 hours. Which in of itself seems large. But then reconciling this with Sophie's statement - is it possible that these timed-out 8-hour builds don't get captured in that view? It is weird that people are reporting these things and Gradle Enterprise isn't showing them. --- > I think that these particularly nasty builds could be explained by long-tail slowdowns causing arbitrary tests to take an excessive time to execute. I'm not sure I understood that. If the tests have timeouts, where would the slowdown come from? Problems in tearing down the test? --- David, thanks for the great work in identifying and even fixing those two top offenders! And thank you for cherry-picking to 3.7 -- All in all, from this thread I can summarize a few potential solutions: S-1. Dedicated work identifying and fixing some of the issues (e.g. what David did). - Should help alleviate the issues as it can be speculated that it's frequently 1 or 2 tests causing the majority of issues. - With regards to that, KAFKA-16045 seems open for taking if there are any volunteers - Sophie's list also contains good candidates S-2. Global 10-minute timeout for tests. - Should lay the foundation for a strong catch-all for any misbehaving tests. I like this idea since it's guaranteed to save each contributor many hours of waiting for an 8hr+ time out build. - Luke already has a PR out for this: https://github.com/apache/kafka/pull/15065 S-3. Separate infrastructure for our CI - This would help with Greg's comment about the developer machine being 2-20 times faster than the CI. - Requires volunteer funding from external companies. If every contributor would bring up the idea with their employer, we may be able to stitch something together. S-4. Separate tests ran depending on what module is changed. - This makes sense although is tricky to implement successfully, as unrelated tests may expose problems in an unrelated change (e.g changing core stuff like clients, the server, etc) S-5. Greater committer diligence when merging PRs - This should always be there. Unfortunately it is a bit of a self-perpetuating effect in that when the builds get worse, people are incentivized to be less diligent (slowed down while in a rush to merge, recency bias of failed builds, etc.) On Fri, Dec 22, 2023 at 4:16 PM Justine Olshan <jols...@confluent.io.invalid> wrote: > Thanks David! I think this should help a lot! > > While we should include these improvements, I think it is also good to > remind folks that a lot of these issues come from merging on builds that > regress the CI. > I know I'm not perfect at this (and have merged on flaky and failing > tests), but let's all be super careful going forward. There were a few > times I retried the build 10+ times and thought it was other issues with > the CI but the failed builds were actually due to the changes I wrote/was > reviewing. > > We all need to work together on this to ensure the builds stay healthy! > Thanks all for being concerned about our builds! > > Justine > > On Fri, Dec 22, 2023 at 6:02 AM David Jacot <david.ja...@gmail.com> wrote: > > > I just merged both PRs. > > > > Cheers, > > David > > > > Le ven. 22 déc. 2023 à 14:38, David Jacot <david.ja...@gmail.com> a > écrit > > : > > > > > Hey folks, > > > > > > I believe that my two PRs will fix most of the issues. I have also > > tweaked > > > the configuration of Jenkins to fix the issues relating to cloning the > > > repo. There may be other issues but the overall situation should be > much > > > better when I merge those two. > > > > > > I will update this thread when I merge them. > > > > > > Cheers, > > > David > > > > > > Le ven. 22 déc. 2023 à 14:22, Divij Vaidya <divijvaidy...@gmail.com> a > > > écrit : > > > > > >> Hey folks > > >> > > >> I think David (dajac) has some fixes lined-up to improve CI such as > > >> https://github.com/apache/kafka/pull/15063 and > > >> https://github.com/apache/kafka/pull/15062. > > >> > > >> I have some bandwidth for the next two days to work on fixing the CI. > > Let > > >> me start by taking a look at the list that Sophie shared here. > > >> > > >> -- > > >> Divij Vaidya > > >> > > >> > > >> > > >> On Fri, Dec 22, 2023 at 2:05 PM Luke Chen <show...@gmail.com> wrote: > > >> > > >> > Hi Sophie and Philip and all, > > >> > > > >> > I share the same pain as you. > > >> > I've been waiting for a CI build result in a PR for days. > > >> Unfortunately, I > > >> > can only get 1 result each day because it takes 8 hours for each > run, > > >> and > > >> > with failed results. :( > > >> > > > >> > I've looked into the 8 hour timeout build issue and would like to > > >> propose > > >> > to set a global test timeout as 10 mins using the junit5 feature > > >> > < > > >> > > > >> > > > https://junit.org/junit5/docs/current/user-guide/#writing-tests-declarative-timeouts-default-timeouts > > >> > > > > >> > . > > >> > This way, we can fail those long running tests quickly without > > impacting > > >> > other tests. > > >> > PR: https://github.com/apache/kafka/pull/15065 > > >> > I've tested in my local environment and it works as expected. > > >> > > > >> > Any feedback is welcome. > > >> > > > >> > Thanks. > > >> > Luke > > >> > > > >> > On Fri, Dec 22, 2023 at 8:08 AM Philip Nee <philip...@gmail.com> > > wrote: > > >> > > > >> > > Hey Sophie - I've gotten 2 inflight PRs each with more than 15 > > >> retries... > > >> > > Namely: https://github.com/apache/kafka/pull/15023 and > > >> > > https://github.com/apache/kafka/pull/15035 > > >> > > > > >> > > justin filed a flaky test report here though: > > >> > > https://issues.apache.org/jira/browse/KAFKA-16045 > > >> > > > > >> > > P > > >> > > > > >> > > On Thu, Dec 21, 2023 at 3:18 PM Sophie Blee-Goldman < > > >> > sop...@responsive.dev > > >> > > > > > >> > > wrote: > > >> > > > > >> > > > On a related note, has anyone else had trouble getting even a > > single > > >> > run > > >> > > > with no build failures lately? I've had multiple pure-docs PRs > > >> blocked > > >> > > for > > >> > > > days or even weeks because of miscellaneous infra, test, and > > timeout > > >> > > > failures. I know we just had a discussion about whether it's > > >> acceptable > > >> > > to > > >> > > > ever merge with a failing build, and the consensus (which I > agree > > >> with) > > >> > > was > > >> > > > NO -- but seriously, this is getting ridiculous. The build might > > be > > >> the > > >> > > > worst I've ever seen it, and it just makes it really difficult > to > > >> > > maintain > > >> > > > good will with external contributors. > > >> > > > > > >> > > > Take for example this small docs PR: > > >> > > > https://github.com/apache/kafka/pull/14949 > > >> > > > > > >> > > > It's on its 7th replay, with the first 6 runs all having (at > > least) > > >> one > > >> > > > build that failed completely. The issues I saw on this one PR > are > > a > > >> > good > > >> > > > summary of what I've been seeing elsewhere, so here's the > > briefing: > > >> > > > > > >> > > > 1. gradle issue: > > >> > > > > > >> > > > > * What went wrong: > > >> > > > > > > >> > > > > Gradle could not start your build. > > >> > > > > > > >> > > > > > Cannot create service of type BuildSessionActionExecutor > using > > >> > method > > >> > > > > > > >> > > > > >> > LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor() > > >> > > > as > > >> > > > > there is a problem with parameter #21 of type > > >> > > > FileSystemWatchingInformation. > > >> > > > > > > >> > > > > > Cannot create service of type > > >> > BuildLifecycleAwareVirtualFileSystem > > >> > > > > using method > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem() > > >> > > > > as there is a problem with parameter #7 of type > > >> GlobalCacheLocations. > > >> > > > > > Cannot create service of type GlobalCacheLocations > using > > >> > method > > >> > > > > GradleUserHomeScopeServices.createGlobalCacheLocations() as > > there > > >> is > > >> > a > > >> > > > > problem with parameter #1 of type List<GlobalCache>. > > >> > > > > > Could not create service of type > > FileAccessTimeJournal > > >> > using > > >> > > > > GradleUserHomeScopeServices.createFileAccessTimeJournal(). > > >> > > > > > Timeout waiting to lock journal cache > > >> > > > > (/home/jenkins/.gradle/caches/journal-1). It is currently in > use > > >> by > > >> > > > another > > >> > > > > Gradle instance. > > >> > > > > > > >> > > > > > >> > > > 2. git issue: > > >> > > > > > >> > > > > ERROR: Error cloning remote repo 'origin' > > >> > > > > hudson.plugins.git.GitException: java.io.IOException: Remote > > call > > >> on > > >> > > > > builds43 failed > > >> > > > > > >> > > > > > >> > > > 3. storage test calling System.exit (I think) > > >> > > > > > >> > > > > * What went wrong: > > >> > > > > Execution failed for task ':storage:test'. > > >> > > > > > Process 'Gradle Test Executor 73' finished with non-zero > exit > > >> > value > > >> > > 1 > > >> > > > > > >> > > > This problem might be caused by incorrect test process > > >> > configuration. > > >> > > > > > >> > > > > > >> > > > 4. 3/4 builds aborted suddenly for no clear reason > > >> > > > > > >> > > > 5. 1 build was aborted, 1 build failed due to a gradle(?) issue > > >> with a > > >> > > > storage test: > > >> > > > > > >> > > > Failed to map supported failure > > >> 'org.opentest4j.AssertionFailedError: > > >> > > > > Failed to observe commit callback before timeout' with mapper > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea > > >> > > > ': > > >> > > > > null > > >> > > > > > >> > > > > > >> > > > > > >> > > > * What went wrong: > > >> > > > > Execution failed for task ':storage:test'. > > >> > > > > > Process 'Gradle Test Executor 73' finished with non-zero > exit > > >> > value 1 > > >> > > > > This problem might be caused by incorrect test process > > >> > configuration. > > >> > > > > > > >> > > > > > >> > > > 6. Unknown issue with a core test: > > >> > > > > > >> > > > > Unexpected exception thrown. > > >> > > > > org.gradle.internal.remote.internal.MessageIOException: Could > > not > > >> > read > > >> > > > > message from '/127.0.0.1:46952'. > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) > > >> > > > > at java.base/java.lang.Thread.run(Thread.java:1583) > > >> > > > > Caused by: java.lang.IllegalArgumentException > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81) > > >> > > > > ... 6 more > > >> > > > > org.gradle.internal.remote.internal.ConnectException: Could > not > > >> > connect > > >> > > > to > > >> > > > > server [1d62bf97-6a3e-441d-93b6-093617cbbea9 port:41289, > > >> addresses:[/ > > >> > > > > 127.0.0.1]]. Tried addresses: [/127.0.0.1]. > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) > > >> > > > > Caused by: java.net.ConnectException: Connection refused > > >> > > > > at java.base/sun.nio.ch.Net.pollConnect(Native Method) > > >> > > > > at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682) > > >> > > > > at > > >> > > > > java.base/sun.nio.ch > > >> > > > > .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191) > > >> > > > > at > > >> > > > > java.base/sun.nio.ch > > >> > > > .SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233) > > >> > > > > at java.base/sun.nio.ch > > >> > > .SocketAdaptor.connect(SocketAdaptor.java:102) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81) > > >> > > > > at > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54) > > >> > > > > ... 5 more > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > * What went wrong: > > >> > > > > > >> > > > Execution failed for task ':core:test'. > > >> > > > > > >> > > > > Process 'Gradle Test Executor 104' finished with non-zero exit > > >> value > > >> > 1 > > >> > > > > > >> > > > This problem might be caused by incorrect test process > > >> configuration. > > >> > > > > > >> > > > > > >> > > > I've seen almost all of the above issues multiple times, so it > > might > > >> > be a > > >> > > > good list to start with to focus any efforts on improving the > > build. > > >> > That > > >> > > > said, I'm not sure what we can really do about most of these, > and > > >> not > > >> > > sure > > >> > > > how to narrow down the root cause in the more mysterious cases > of > > >> > aborted > > >> > > > builds and the builds that end with "finished with non-zero exit > > >> value > > >> > 1 > > >> > > " > > >> > > > with no additional context (that I could find) > > >> > > > > > >> > > > If nothing else, there seems to be something happening in one > (or > > >> more) > > >> > > of > > >> > > > the storage tests, because by far the most common failure I've > > seen > > >> is > > >> > > that > > >> > > > in 3 & 5. Unfortunately it's not really clear to me how to tell > > >> which > > >> > is > > >> > > > the offending test, so I'm not even sure what to file a ticket > for > > >> > > > > > >> > > > On Tue, Dec 19, 2023 at 11:55 PM David Jacot > > >> > <dja...@confluent.io.invalid > > >> > > > > > >> > > > wrote: > > >> > > > > > >> > > > > The slowness of the CI is definitely causing us a lot of > pain. I > > >> > wonder > > >> > > > if > > >> > > > > we should move to a dedicated CI infrastructure for Kafka. Our > > >> > > > integration > > >> > > > > tests are quite heavy and ASF's CI is not really tuned for > them. > > >> We > > >> > > could > > >> > > > > tune it for our needs and this would also allow external > > >> companies to > > >> > > > > sponsor more workers. I heard that we have a few cloud > providers > > >> in > > >> > > > > the community ;). I think that we should consider this. What > do > > >> you > > >> > > > think? > > >> > > > > I already discussed this with the INFRA team. I could continue > > if > > >> we > > >> > > > > believe that it is a way forward. > > >> > > > > > > >> > > > > Best, > > >> > > > > David > > >> > > > > > > >> > > > > On Wed, Dec 20, 2023 at 12:17 AM Stanislav Kozlovski > > >> > > > > <stanis...@confluent.io.invalid> wrote: > > >> > > > > > > >> > > > > > Hey Николай, > > >> > > > > > > > >> > > > > > Apologies about this - I wasn't aware of this behavior. I > have > > >> made > > >> > > all > > >> > > > > the > > >> > > > > > gists public. > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > On Wed, Dec 20, 2023 at 12:09 AM Greg Harris > > >> > > > > <greg.har...@aiven.io.invalid > > >> > > > > > > > > >> > > > > > wrote: > > >> > > > > > > > >> > > > > > > Hey Stan, > > >> > > > > > > > > >> > > > > > > Thanks for opening the discussion. I haven't been looking > at > > >> > > overall > > >> > > > > > > build duration recently, so it's good that you are calling > > it > > >> > out. > > >> > > > > > > > > >> > > > > > > I worry about us over-indexing on this one build, which > > itself > > >> > > > appears > > >> > > > > > > to be an outlier. I only see one other build [1] above 6h > > >> overall > > >> > > in > > >> > > > > > > the last 90 days in this view: [2] > > >> > > > > > > And I don't see any overlap of failed tests in these two > > >> builds, > > >> > > > which > > >> > > > > > > makes it less likely that these particular failed tests > are > > >> the > > >> > > > causes > > >> > > > > > > of long build times. > > >> > > > > > > > > >> > > > > > > Separately, I've been investigating build environment > > >> slowness, > > >> > and > > >> > > > > > > trying to connect it with test failures [3]. I observed > that > > >> the > > >> > CI > > >> > > > > > > build environment is 2-20 times slower than my developer > > >> machine > > >> > > (M1 > > >> > > > > > > mac). > > >> > > > > > > When I simulate a similar slowdown locally, there are > tests > > >> which > > >> > > > > > > become significantly more flakey, often due to hard-coded > > >> > timeouts. > > >> > > > > > > I think that these particularly nasty builds could be > > >> explained > > >> > by > > >> > > > > > > long-tail slowdowns causing arbitrary tests to take an > > >> excessive > > >> > > time > > >> > > > > > > to execute. > > >> > > > > > > > > >> > > > > > > Rather than trying to find signals in these rare test > > >> failures, I > > >> > > > > > > think we should find tests that have these sorts of > failures > > >> more > > >> > > > > > > regularly. > > >> > > > > > > There are lots of builds in the 5-6h duration bracket, > which > > >> is > > >> > > > > > > certainly unacceptably long. We should look into these > > builds > > >> to > > >> > > find > > >> > > > > > > improvements and optimizations. > > >> > > > > > > > > >> > > > > > > [1] https://ge.apache.org/s/ygh4gbz4uma6i/ > > >> > > > > > > [2] > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York > > >> > > > > > > [3] https://github.com/apache/kafka/pull/15008 > > >> > > > > > > > > >> > > > > > > Thanks for looking into this! > > >> > > > > > > Greg > > >> > > > > > > > > >> > > > > > > On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков < > > >> > > nizhi...@apache.org> > > >> > > > > > > wrote: > > >> > > > > > > > > > >> > > > > > > > Hello, Stanislav. > > >> > > > > > > > > > >> > > > > > > > Can you, please, make the gist public. > > >> > > > > > > > Private gists not available for some GitHub users even > if > > >> link > > >> > > are > > >> > > > > > known. > > >> > > > > > > > > > >> > > > > > > > > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski < > > >> > > > > > stanis...@confluent.io.INVALID> > > >> > > > > > > написал(а): > > >> > > > > > > > > > > >> > > > > > > > > Hey everybody, > > >> > > > > > > > > I've heard various complaints that build times in > trunk > > >> are > > >> > > > taking > > >> > > > > > too > > >> > > > > > > > > long, some taking as much as 8 hours (the timeout) - > and > > >> this > > >> > > is > > >> > > > > > > slowing us > > >> > > > > > > > > down from being able to meet the code freeze deadline > > for > > >> > 3.7. > > >> > > > > > > > > > > >> > > > > > > > > I took it upon myself to gather up some data in Gradle > > >> > > Enterprise > > >> > > > > to > > >> > > > > > > see if > > >> > > > > > > > > there are any outlier tests that are causing this > > >> slowness. > > >> > > Turns > > >> > > > > out > > >> > > > > > > there > > >> > > > > > > > > are a few, in this particular build - > > >> > > > > > > https://ge.apache.org/s/un2hv7n6j374k/ > > >> > > > > > > > > - which took 10 hours and 29 minutes in total. > > >> > > > > > > > > > > >> > > > > > > > > I have compiled the tests that took a > disproportionately > > >> > large > > >> > > > > amount > > >> > > > > > > of > > >> > > > > > > > > time (20m+), alongside their time, error message and a > > >> link > > >> > to > > >> > > > > their > > >> > > > > > > full > > >> > > > > > > > > log output here - > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2 > > >> > > > > > > > > > > >> > > > > > > > > It includes failures from core, streams, storage and > > >> clients. > > >> > > > > > > > > Interestingly, some other tests that don't fail also > > take > > >> a > > >> > > long > > >> > > > > time > > >> > > > > > > in > > >> > > > > > > > > what is apparently the test harness framework. See the > > >> gist > > >> > for > > >> > > > > more > > >> > > > > > > > > information. > > >> > > > > > > > > > > >> > > > > > > > > I am starting this thread with the intention of > getting > > >> the > > >> > > > > > discussion > > >> > > > > > > > > started and brainstorming what we can do to get the > > build > > >> > times > > >> > > > > back > > >> > > > > > > under > > >> > > > > > > > > control. > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > -- > > >> > > > > > > > > Best, > > >> > > > > > > > > Stanislav > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > -- > > >> > > > > > Best, > > >> > > > > > Stanislav > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > -- Best, Stanislav