Hi Kirk

I have been using this new tool to analyze the trends of test
failures: 
https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin
and general build failures:
https://ge.apache.org/scans/failures?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin

About the classes of build failure, if we look at the last 28 days, I
do not observe an increasing trend. The top causes of failure are:
(link [2])
1. Failures due to checkstyle (193 builds)
2. Timeout waiting to lock cache. It is currently in-use by another
Gradle instance.
3. Compilation failures (116 builds)
4. "Gradle Test Executor" finished with a non-zero exit value. Process
'Gradle Test Executor 180' finished with non-zero exit value 1

#4 is caused by a test failure that causes a crash of the Gradle
process. To debug this, I usually go to complete test output and try
to figure out which was the last test that 'Gradle Test Executor 180'
was running. As an example, consider
https://ge.apache.org/s/luizhogirob4e. We observe that this fails for
PR-14094. Now, we need to see the complete system out. To find that, I
will go to Kafka PR builder at
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/view/change-requests/
and find the build page for PR-14094. That page is
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14094/.
Next, find last failed build at
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14094/lastFailedBuild/
, observe that we have a failure for "Gradle Test Executor 177", click
on view as plain text (it takes a long time to load), find what the
GradleTest Executor was doing. In this case, it failed with the
following error. I strongly believe that it is due to
https://github.com/apache/kafka/pull/13572 but unfortunately, this was
reverted and never fixed after that. Perhaps you might want to re

Gradle Test Run :core:integrationTest > Gradle Test Executor 177 >
ProducerFailureHandlingTest > testTooLargeRecordWithAckZero() STARTED

> Task :clients:integrationTest FAILED
org.gradle.internal.remote.internal.ConnectException: Could not
connect to server [bd7b0504-7491-43f8-a716-513adb302c92 port:43321,
addresses:[/127.0.0.1]]. Tried addresses: [/127.0.0.1].
at 
org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
at 
org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
at 
org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
at 
org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
at 
worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
at 
worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
at 
java.base/sun.nio.ch.SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1141)
at 
java.base/sun.nio.ch.SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1183)
at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:98)
at 
org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
at 
org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
... 5 more




About the classes of test failure problems, if we look at the last 28
days, the following tests are the biggest culprits. If we fix just
these two, our CI would be in a much better shape. (link [1])
1. https://issues.apache.org/jira/browse/KAFKA-15197 (this test passes
only 53% of the time)
2. https://issues.apache.org/jira/browse/KAFKA-15052 (this test passes
only 49% of the time)


[1] 
https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin
[2] 
https://ge.apache.org/scans/failures?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=Europe/Berlin


--
Divij Vaidya

On Tue, Jul 25, 2023 at 8:09 PM Kirk True <k...@kirktrue.pro> wrote:
>
> Hi all!
>
> I’ve noticed that we’re back in the state where it’s tough to get a clean PR 
> Jenkins test run. Spot checking the top ~10 pull request runs show this 
> doesn’t appear to be an issue with just my PRs :P
>
> I know we have some chronic flaky tests, but I’ve seen at least two other 
> classes of problems:
>
> 1. Jenkins test runners hanging and eventually timing out
> 2. Intra Jenkins-container/pod/VM/machine/turtle communication issues
>
> How do we go about diagnosing test runs that fail in such an opaque fashion?
>
> Thanks!
> Kirk

Reply via email to