[
https://issues.apache.org/jira/browse/KAFKA-15891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899244#comment-17899244
]
Will Perlichek commented on KAFKA-15891:
----------------------------------------
[~ChrisEgerton] [~yash.mayya] [~gregharris73]
Hi all,
I have been working on this ticket for about a week now, specifically diving
into OffsetsApiIntegrationTest.java and trying to resolve flakiness.
I want to touch base now and make sure I am on the right track, and I saw that
all three of you had worked on or contributed to discussions on this file.
I'll attempt a very targeted question to keep it brief:
I strongly think that _some _flakiness for
testResetSinkConnectorOffsetsOverriddenConsumerGroupId could be reduced by
restarting the connect cluster if we're sure the problem was due to a zombie
sink task.
We can be reasonably confident that zombie sink tasks caused this method not to
finish, by looking at this example CI failure stack trace shows message
Devlocity reference:
[https://ge.apache.org/s/r4f5opmfmls54/tests/task/:connect:runtime:quarantinedTest/details/org.apache.kafka.connect.integration.OffsetsApiIntegrationTest/testResetSinkConnectorOffsetsOverriddenConsumerGroupId()?top-execution=1]
Stack trace:
ERROR Failed to reset consumer group offsets for connector
testResetSinkConnectorOffsetsOverriddenConsumerGroupId either because its tasks
haven't stopped completely yet or the connector was resumed before the request
to reset its offsets could be successfully completed. If the connector is in a
stopped state, this operation can be safely retried. If it doesn't eventually
succeed, the Connect cluster may need to be restarted to get rid of the zombie
sink tasks.
We retried for 30 seconds so to me the evidence suggests it's the zombie sink
task problem...
It is actually the helper method modifySinkConnectorOffsetsWithRetry that times
out when using waitForCondition. My assumption is that waitForCondition never
succeeds because this zombie task causes a GroupNotEmptyException every time we
try to use the Offset API because we can’t delete offsets due to the zombie
sink task.
My question:
In the test code here
[https://github.com/apache/kafka/blob/50c15b94c94fbe8f964703c057963b38100b0bd6/connect/runtime/src/test/java/org/apache/kafka/connect/integration/OffsetsApiIntegrationTest.java#L775]
I can restart the connect cluster as advised by the exception message we get...
I think this would reduce flakiness in this test, and a similar approach could
be adopted to reduce the flakiness of other tests in the class such as
testAlterSinkConnectorOffsetsDifferentKafkaClusterTargeted that also appears to
be flaky due to zombie sink tasks.
I'd like to attempt a solution on this if you think this approach is correct.
Or, Is this too much of a band-aid and not addressing the core problem? If the
latter, can you suggest a more robust approach to handle zombie sink tasks in
the context of this class? My goal here is to reduce test flakiness overall in
this class.
Thanks,
Will
> Flaky test: testResetSinkConnectorOffsetsOverriddenConsumerGroupId –
> org.apache.kafka.connect.integration.OffsetsApiIntegrationTest
> -----------------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-15891
> URL: https://issues.apache.org/jira/browse/KAFKA-15891
> Project: Kafka
> Issue Type: Bug
> Components: connect
> Reporter: Apoorv Mittal
> Assignee: Will Perlichek
> Priority: Major
> Labels: flaky-test
>
> h4. Error
> org.opentest4j.AssertionFailedError: Condition not met within timeout 30000.
> Sink connector consumer group offsets should catch up to the topic end
> offsets ==> expected: <true> but was: <false>
> h4. Stacktrace
> org.opentest4j.AssertionFailedError: Condition not met within timeout 30000.
> Sink connector consumer group offsets should catch up to the topic end
> offsets ==> expected: <true> but was: <false>
> at
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> at
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
> at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:210)
> at
> app//org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331)
> at
> app//org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379)
> at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328)
> at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312)
> at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302)
> at
> app//org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.verifyExpectedSinkConnectorOffsets(OffsetsApiIntegrationTest.java:917)
> at
> app//org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.resetAndVerifySinkConnectorOffsets(OffsetsApiIntegrationTest.java:725)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)