[ https://issues.apache.org/jira/browse/KAFKA-16389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833704#comment-17833704 ]
Philip Nee commented on KAFKA-16389: ------------------------------------ Hi [~kirktrue] Thanks for the initial investigation. I think your approach makes sense but I do think we need to rewrite the verifiable_consumer.py's event handler. As the states transition doesn't necessary match the behavior of the current consumer. And I think that's why there's still some flakiness in the patch you submitted. See my notes below: I'm still occasionally getting errors like: "ducktape.errors.TimeoutError: expected valid assignments of 6 partitions when num_started 1: [('ducker@ducker11', [])]" This seems to be caused by some weird reconciliation state. For example: Here We can see consumer1 got assigned 6 partitions and then immediately gave up all of them. It is unclear why onPartitionsRevoke is triggered. {code:java} 1 node <ducktape.cluster.cluster.ClusterNode object at 0xffff9fe617c0> wait for member idx 1 partiton assigned [{'topic': 'test_topic', 'partition': 0}, {'topic': 'test_topic', 'partition': 1}, {'topic': 'test_topic', 'partition': 2}, {'topic': 'test_topic', 'partition': 3}, {'topic': 'test_topic', 'partition': 4}, {'topic': 'test_topic', 'partition': 5}] idx 1 partiton revoked [{'topic': 'test_topic', 'partition': 0}, {'topic': 'test_topic', 'partition': 1}, {'topic': 'test_topic', 'partition': 2}, {'topic': 'test_topic', 'partition': 3}, {'topic': 'test_topic', 'partition': 4}, {'topic': 'test_topic', 'partition': 5}] node: ducker11 Current assignment: {<ducktape.cluster.cluster.ClusterNode object at 0xffff9fe617c0>: []} idx 1 partiton assigned [] [WARNING - 2024-04-03 11:05:34,587 - service_registry - stop_all - lineno:53]: Error stopping service <VerifiableConsumer-0-281473364398912: num_nodes: 3, nodes: ['ducker11', 'ducker12', 'ducker13']>: <ducktape.cluster.cluster.ClusterNode object at 0xffff9fe61640> [WARNING - 2024-04-03 11:06:09,128 - service_registry - clean_all - lineno:67]: Error cleaning service <VerifiableConsumer-0-281473364398912: num_nodes: 3, nodes: ['ducker11', 'ducker12', 'ducker13']>: <ducktape.cluster.cluster.ClusterNode object at 0xffff9fe61640> [INFO:2024-04-03 11:06:09,134]: RunnerClient: kafkatest.tests.client.consumer_test.AssignmentValidationTest.test_valid_assignment.metadata_quorum=ISOLATED_KRAFT.use_new_coordinator=True.group_protocol=consumer.group_remote_assignor=range: FAIL: TimeoutError("expected valid assignments of 6 partitions when num_started 1: [('ducker@ducker11', [])]") Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 186, in _do_run data = self.run_test() File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 246, in run_test return self.test_context.function(self.test) File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 433, in wrapper return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) File "/opt/kafka-dev/tests/kafkatest/tests/client/consumer_test.py", line 583, in test_valid_assignment wait_until(lambda: self.valid_assignment(self.TOPIC, self.NUM_PARTITIONS, consumer.current_assignment()), File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception ducktape.errors.TimeoutError: expected valid assignments of 6 partitions when num_started 1: [('ducker@ducker11', [])] {code} > consumer_test.py’s test_valid_assignment fails with new consumer > ---------------------------------------------------------------- > > Key: KAFKA-16389 > URL: https://issues.apache.org/jira/browse/KAFKA-16389 > Project: Kafka > Issue Type: Bug > Components: clients, consumer, system tests > Affects Versions: 3.7.0 > Reporter: Kirk True > Assignee: Philip Nee > Priority: Blocker > Labels: kip-848-client-support, system-tests > Fix For: 3.8.0 > > Attachments: KAFKA-16389.patch > > > The following error is reported when running the {{test_valid_assignment}} > test from {{consumer_test.py}}: > {code} > Traceback (most recent call last): > File > "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", > line 186, in _do_run > data = self.run_test() > File > "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", > line 246, in run_test > return self.test_context.function(self.test) > File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line > 433, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File "/opt/kafka-dev/tests/kafkatest/tests/client/consumer_test.py", line > 584, in test_valid_assignment > wait_until(lambda: self.valid_assignment(self.TOPIC, self.NUM_PARTITIONS, > consumer.current_assignment()), > File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line > 58, in wait_until > raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from > last_exception > ducktape.errors.TimeoutError: expected valid assignments of 6 partitions when > num_started 2: [('ducker@ducker05', []), ('ducker@ducker06', [])] > {code} > To reproduce, create a system test suite file named > {{test_valid_assignment.yml}} with these contents: > {code:yaml} > failures: > - > 'kafkatest/tests/client/consumer_test.py::AssignmentValidationTest.test_valid_assignment@{"metadata_quorum":"ISOLATED_KRAFT","use_new_coordinator":true,"group_protocol":"consumer","group_remote_assignor":"range"}' > {code} > Then set the the {{TC_PATHS}} environment variable to include that test suite > file. -- This message was sent by Atlassian Jira (v8.20.10#820010)