[
https://issues.apache.org/jira/browse/KAFKA-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881173#comment-17881173
]
Sagar Rao edited comment on KAFKA-17493 at 9/12/24 5:22 AM:
------------------------------------------------------------
[~ChrisEgerton] , sorry my bad. Yes I do see that the ListOffsets call keeps
returning empty offsets till the timeout happens. I grepped the Group
Coordinator logs for the flaky and non flaky cases and what I notice is that in
the flaky case, the consumer group of the sink task never got to 2 members in
the group. These are the lines from the flaky test:
{code:java}
[2024-09-06 21:59:57,843] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Dynamic member with unknown member id joins group
connect-testGetSinkConnectorOffsets in Empty state. Created a new member id
connector-consumer-testGetSinkConnectorOffsets-0-a47aa5b3-d9d8-4aa6-ab30-7f79c971b6ee
and requesting the member to rejoin with this id.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4111)
[2024-09-06 21:59:57,844] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Dynamic member with unknown member id joins group
connect-testGetSinkConnectorOffsets in Empty state. Created a new member id
connector-consumer-testGetSinkConnectorOffsets-1-4c065deb-6771-427d-902b-be788543b7bd
and requesting the member to rejoin with this id.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4111)
[2024-09-06 21:59:57,844] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Preparing to rebalance group connect-testGetSinkConnectorOffsets
in state PreparingRebalance with old generation 0 (reason: Adding new member
connector-consumer-testGetSinkConnectorOffsets-0-a47aa5b3-d9d8-4aa6-ab30-7f79c971b6ee
with group instance id null; client reason: need to re-join with the given
member-id:
connector-consumer-testGetSinkConnectorOffsets-0-a47aa5b3-d9d8-4aa6-ab30-7f79c971b6ee).
(org.apache.kafka.coordinator.group.GroupMetadataManager:4673)
[2024-09-06 21:59:57,844] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Stabilized group connect-testGetSinkConnectorOffsets generation 1
with 1 members.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4383){code}
Even though 2 members tried to join , eventually the group never saw the stable
group with 2 members. If we contrast this with the passing case:
{code:java}
[2024-09-06 22:22:47,577] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Dynamic member with unknown member id joins group
connect-testGetSinkConnectorOffsets in Empty state. Created a new member id
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd
and requesting the member to rejoin with this id.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4111)
[2024-09-06 22:22:47,579] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Preparing to rebalance group connect-testGetSinkConnectorOffsets
in state PreparingRebalance with old generation 0 (reason: Adding new member
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd
with group instance id null; client reason: need to re-join with the given
member-id:
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd).
(org.apache.kafka.coordinator.group.GroupMetadataManager:4673)
[2024-09-06 22:22:47,580] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Stabilized group connect-testGetSinkConnectorOffsets generation 1
with 1 members. (org.apache.kafka.coordinator.group.GroupMetadataManager:4383)
[2024-09-06 22:22:47,580] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Dynamic member with unknown member id joins group
connect-testGetSinkConnectorOffsets in CompletingRebalance state. Created a new
member id
connector-consumer-testGetSinkConnectorOffsets-0-b5112649-3008-432e-a1eb-a10593d049b3
and requesting the member to rejoin with this id.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4111)
[2024-09-06 22:22:47,581] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Assignment received from leader
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd
for group connect-testGetSinkConnectorOffsets for generation 1. The group has
1 members, 0 of which are static.
(org.apache.kafka.coordinator.group.GroupMetadataManager:5142)
[2024-09-06 22:22:47,582] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Preparing to rebalance group connect-testGetSinkConnectorOffsets
in state PreparingRebalance with old generation 1 (reason: Adding new member
connector-consumer-testGetSinkConnectorOffsets-0-b5112649-3008-432e-a1eb-a10593d049b3
with group instance id null; client reason: need to re-join with the given
member-id:
connector-consumer-testGetSinkConnectorOffsets-0-b5112649-3008-432e-a1eb-a10593d049b3).
(org.apache.kafka.coordinator.group.GroupMetadataManager:4673)
[2024-09-06 22:22:47,583] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Stabilized group connect-testGetSinkConnectorOffsets generation 2
with 2 members. (org.apache.kafka.coordinator.group.GroupMetadataManager:4383)
[2024-09-06 22:22:47,584] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Assignment received from leader
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd
for group connect-testGetSinkConnectorOffsets for generation 2. The group has
2 members, 0 of which are static.
(org.apache.kafka.coordinator.group.GroupMetadataManager:5142) {code}
So this seems in line with what Chris mentioned above. One difference between
the 2 cases is that as I had mentioned in the above note as well, for the flaky
test, we are reusing an existing connect/kafka cluster where we need to delete
the existing topic etc while in the passing test, everything is afresh. I am
attaching the grepped Group Coordinator logs for reference
[^flaky-tests-gc.txt]
[^passing-tests-gc.txt]
was (Author: sagarrao):
[~ChrisEgerton] , sorry my bad. Yes I do see that the ListOffsets call keeps
returning empty offsets till the timeout happens. I grepped the Group
Coordinator logs for the flaky and non flaky cases and what I notice is that in
the flaky case, the consumer group of the sink task never got to 2 members in
the group. These are the lines from the flaky test:
{code:java}
[2024-09-06 21:59:57,843] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Dynamic member with unknown member id joins group
connect-testGetSinkConnectorOffsets in Empty state. Created a new member id
connector-consumer-testGetSinkConnectorOffsets-0-a47aa5b3-d9d8-4aa6-ab30-7f79c971b6ee
and requesting the member to rejoin with this id.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4111)
[2024-09-06 21:59:57,844] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Dynamic member with unknown member id joins group
connect-testGetSinkConnectorOffsets in Empty state. Created a new member id
connector-consumer-testGetSinkConnectorOffsets-1-4c065deb-6771-427d-902b-be788543b7bd
and requesting the member to rejoin with this id.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4111)
[2024-09-06 21:59:57,844] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Preparing to rebalance group connect-testGetSinkConnectorOffsets
in state PreparingRebalance with old generation 0 (reason: Adding new member
connector-consumer-testGetSinkConnectorOffsets-0-a47aa5b3-d9d8-4aa6-ab30-7f79c971b6ee
with group instance id null; client reason: need to re-join with the given
member-id:
connector-consumer-testGetSinkConnectorOffsets-0-a47aa5b3-d9d8-4aa6-ab30-7f79c971b6ee).
(org.apache.kafka.coordinator.group.GroupMetadataManager:4673)
[2024-09-06 21:59:57,844] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Stabilized group connect-testGetSinkConnectorOffsets generation 1
with 1 members.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4383){code}
Even though 2 members tried to join , eventually the group never saw the stable
group with 2 members. If we contrast this with the passing case:
{code:java}
[2024-09-06 22:22:47,577] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Dynamic member with unknown member id joins group
connect-testGetSinkConnectorOffsets in Empty state. Created a new member id
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd
and requesting the member to rejoin with this id.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4111)
[2024-09-06 22:22:47,579] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Preparing to rebalance group connect-testGetSinkConnectorOffsets
in state PreparingRebalance with old generation 0 (reason: Adding new member
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd
with group instance id null; client reason: need to re-join with the given
member-id:
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd).
(org.apache.kafka.coordinator.group.GroupMetadataManager:4673)
[2024-09-06 22:22:47,580] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Stabilized group connect-testGetSinkConnectorOffsets generation 1
with 1 members. (org.apache.kafka.coordinator.group.GroupMetadataManager:4383)
[2024-09-06 22:22:47,580] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Dynamic member with unknown member id joins group
connect-testGetSinkConnectorOffsets in CompletingRebalance state. Created a new
member id
connector-consumer-testGetSinkConnectorOffsets-0-b5112649-3008-432e-a1eb-a10593d049b3
and requesting the member to rejoin with this id.
(org.apache.kafka.coordinator.group.GroupMetadataManager:4111)
[2024-09-06 22:22:47,581] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Assignment received from leader
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd
for group connect-testGetSinkConnectorOffsets for generation 1. The group has
1 members, 0 of which are static.
(org.apache.kafka.coordinator.group.GroupMetadataManager:5142)
[2024-09-06 22:22:47,582] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Preparing to rebalance group connect-testGetSinkConnectorOffsets
in state PreparingRebalance with old generation 1 (reason: Adding new member
connector-consumer-testGetSinkConnectorOffsets-0-b5112649-3008-432e-a1eb-a10593d049b3
with group instance id null; client reason: need to re-join with the given
member-id:
connector-consumer-testGetSinkConnectorOffsets-0-b5112649-3008-432e-a1eb-a10593d049b3).
(org.apache.kafka.coordinator.group.GroupMetadataManager:4673)
[2024-09-06 22:22:47,583] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Stabilized group connect-testGetSinkConnectorOffsets generation 2
with 2 members. (org.apache.kafka.coordinator.group.GroupMetadataManager:4383)
[2024-09-06 22:22:47,584] INFO [GroupCoordinator id=0 topic=__consumer_offsets
partition=45] Assignment received from leader
connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd
for group connect-testGetSinkConnectorOffsets for generation 2. The group has
2 members, 0 of which are static.
(org.apache.kafka.coordinator.group.GroupMetadataManager:5142) {code}
So this seems in line with what Chris mentioned above. I am attaching the
grepped Group Coordinator logs for reference. One difference between the 2
cases[^flaky-tests-gc.txt] is that as I had mentioned in the above note as
well, for the flaky test, we are reusing an existing connect/kafka cluster
where we need to delete the existing topic etc while in the passing test,
everything is afresh.
> Sink connector-related OffsetsApiIntegrationTest suite test cases failing
> more frequently with new consumer/group coordinator
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-17493
> URL: https://issues.apache.org/jira/browse/KAFKA-17493
> Project: Kafka
> Issue Type: Test
> Components: connect, consumer, group-coordinator
> Reporter: Chris Egerton
> Priority: Major
> Attachments: flaky-tests-gc.txt, passing-tests-gc.txt
>
>
> We recently updated trunk to use the new KIP-848 consumer/group coordinator
> by default, which appears to have led to an uptick in flakiness for the
> OffsetsApiIntegrationTest suite for Connect (specifically, the test cases
> that use sink connectors, which makes sense since they're the type of
> connector that uses a consumer group under the hood).
> Gradle Enterprise shows that in the week before that update was made, the
> test suite had a flakiness rate of about 4%
> (https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1724558400000&search.startTimeMin=1723953600000&search.tags=trunk&search.timeZoneId=America%2FNew_York&tests.container=org.apache.kafka.connect.integration.*&tests.sortField=FLAKY),
> and in the week and a half since, the flakiness rate has jumped to 17%
> (https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1725681599999&search.startTimeMin=1724731200000&search.tags=trunk&search.timeZoneId=America%2FNew_York&tests.container=org.apache.kafka.connect.integration.*&tests.sortField=FLAKY).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)