[
https://issues.apache.org/jira/browse/KAFKA-18194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Philip Nee updated KAFKA-18194:
-------------------------------
Description:
This is introduced after KIP-1102 I believe this is a critical bug because
bounced broker will caused rebalance.
When examining the consumer log you will see:
# When we bounce a broker, sometimes it triggers rebootstrapping
# Rebootstrapping forces metadata update
# Request joining group due to: cached metadata has changed from (version20:
\{test_topic=[NO_RACKS]}) at the beginning of the rebalance to (ver sion21:
\{test_topic=[]})
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
# Causing the consumer to rejoin every time the server is bounded
Here's the snippet of the log
{code:java}
8156 [2024-12-10 04:44:37,227] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Sending API_VERSIONS
request with header RequestHeader(apiKey=API_VERSIONS, apiVersion=4,
clientId=consumer-test_group_id-1, correlationId=1 22, headerVersion=2) and
timeout 30000 to node -1:
ApiVersionsRequestData(clientSoftwareName='apache-kafka-java',
clientSoftwareVersion='8.0.0-0-ce') (org.apache.kafka.clients.NetworkClient)
8157 [2024-12-10 04:44:37,227] TRACE [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Polling for fetches
with timeout 1600
(org.apache.kafka.clients.consumer.internals.ClassicKafkaConsumer)
8158 [2024-12-10 04:44:37,227] TRACE [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Found least loaded
connecting node ducker03:9092 (id: -1 rack: null)
(org.apache.kafka.clients.NetworkClient)
8159 [2024-12-10 04:44:37,227] TRACE For telemetry state SUBSCRIPTION_NEEDED,
returning the value 156047 ms;
(org.apache.kafka.common.telemetry.internals.ClientTelemetryReporter)
8160 [2024-12-10 04:44:37,227] INFO [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Request joining group
due to: cached metadata has changed from (version20: {test_topic=[NO_RACKS]}
) at the beginning of the rebalance to (ver sion21:
{test_topic=[]}
) (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
8161 [2024-12-10 04:44:37,227] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Executing
onJoinPrepare with generation 7 and memberId
consumer-test_group_id-1-61dfefb9-acdd-4050-ba0a-56f5d5ed1c3d
(org.apache.kafka.clien ts.consumer.internals.ConsumerCoordinator)
8162 [2024-12-10 04:44:37,227] INFO [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Revoke previously
assigned partitions
(org.apache.kafka.clients.consumer.internals.ConsumerRebalanceListenerInvoker)
8163 [2024-12-10 04:44:37,228] INFO [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] (Re-)joining group
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
8164 [2024-12-10 04:44:37,228] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Joining group with
current subscription: [test_topic]
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
8165 [2024-12-10 04:44:37,228] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Sending JoinGroup
(JoinGroupRequestData(groupId='test_group_id', sessionTimeoutMs=30000,
rebalanceTimeoutMs=300000, memberId='consumer-test_
group_id-1-61dfefb9-acdd-4050-ba0a-56f5d5ed1c3d', groupInstanceId=null,
protocolType='consumer', protocols=[JoinGroupRequestProtocol(name='range',
metadata=[0, 3, 0, 0, 0, 1, 0, 10, 116, 101, 115, 116, 95, 116, 111, 112, 105,
99, -1, -1, -1 , -1, 0, 0, 0, 0, 0, 0, 0, 7, -1, -1])], reason='cached
metadata has changed')) to coordinator ducker03:9092 (id: 2147483646 rack:
null) (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
8166 [2024-12-10 04:44:37,228] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Sending JOIN_GROUP
request with header RequestHeader(apiKey=JOIN_GROUP, apiVersion=9,
clientId=consumer-test_group_id-1, correlationId=123, headerVersion=2) and
timeout 305000 to node 2147483646:
JoinGroupRequestData(groupId='test_group_id', sessionTimeoutMs=30000,
rebalanceTimeoutMs=300000,
memberId='consumer-test_group_id-1-61dfefb9-acdd-4050-ba0a-56f5d5ed1c3d',
groupInstance Id=null, protocolType='consumer',
protocols=[JoinGroupRequestProtocol(name='range', metadata=[0, 3, 0, 0, 0, 1,
0, 10, 116, 101, 115, 116, 95, 116, 111, 112, 105, 99, -1, -1, -1, -1, 0, 0, 0,
0, 0, 0, 0, 7, -1, -1])], reason='cached metadat a has changed')
(org.apache.kafka.clients.NetworkClient)
{code}
# Please reproduce this running
{code:java}
TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_broker_failure"
bash tests/docker/run_tests.sh {code}
was:
This is introduced after KIP-1102 I believe this is a critical bug because
bounced broker will caused rebalance.
When examining the consumer log you will see:
# When we bounce a broker, sometimes it triggers rebootstrapping
# Rebootstrapping forces metadata update
# Request joining group due to: cached metadata has changed from (version20:
{test_topic=[NO_RACKS]}) at the beginning of the rebalance to (ver sion21:
\{test_topic=[]})
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
# Causing the consumer to rejoin every time the server is bounded
Here's the snippet of the log
{code:java}
8156 [2024-12-10 04:44:37,227] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Sending API_VERSIONS
request with header RequestHeader(apiKey=API_VERSIONS, apiVersion=4,
clientId=consumer-test_group_id-1, correlationId=1 22, headerVersion=2) and
timeout 30000 to node -1:
ApiVersionsRequestData(clientSoftwareName='apache-kafka-java',
clientSoftwareVersion='8.0.0-0-ce') (org.apache.kafka.clients.NetworkClient)
8157 [2024-12-10 04:44:37,227] TRACE [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Polling for fetches
with timeout 1600
(org.apache.kafka.clients.consumer.internals.ClassicKafkaConsumer)
8158 [2024-12-10 04:44:37,227] TRACE [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Found least loaded
connecting node ducker03:9092 (id: -1 rack: null)
(org.apache.kafka.clients.NetworkClient)
8159 [2024-12-10 04:44:37,227] TRACE For telemetry state SUBSCRIPTION_NEEDED,
returning the value 156047 ms;
(org.apache.kafka.common.telemetry.internals.ClientTelemetryReporter)
8160 [2024-12-10 04:44:37,227] INFO [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Request joining group
due to: cached metadata has changed from (version20: {test_topic=[NO_RACKS]})
at the beginning of the rebalance to (ver sion21: {test_topic=[]})
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
8161 [2024-12-10 04:44:37,227] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Executing
onJoinPrepare with generation 7 and memberId
consumer-test_group_id-1-61dfefb9-acdd-4050-ba0a-56f5d5ed1c3d
(org.apache.kafka.clien ts.consumer.internals.ConsumerCoordinator)
8162 [2024-12-10 04:44:37,227] INFO [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Revoke previously
assigned partitions
(org.apache.kafka.clients.consumer.internals.ConsumerRebalanceListenerInvoker)
8163 [2024-12-10 04:44:37,228] INFO [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] (Re-)joining group
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
8164 [2024-12-10 04:44:37,228] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Joining group with
current subscription: [test_topic]
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
8165 [2024-12-10 04:44:37,228] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Sending JoinGroup
(JoinGroupRequestData(groupId='test_group_id', sessionTimeoutMs=30000,
rebalanceTimeoutMs=300000, memberId='consumer-test_
group_id-1-61dfefb9-acdd-4050-ba0a-56f5d5ed1c3d', groupInstanceId=null,
protocolType='consumer', protocols=[JoinGroupRequestProtocol(name='range',
metadata=[0, 3, 0, 0, 0, 1, 0, 10, 116, 101, 115, 116, 95, 116, 111, 112, 105,
99, -1, -1, -1 , -1, 0, 0, 0, 0, 0, 0, 0, 7, -1, -1])], reason='cached
metadata has changed')) to coordinator ducker03:9092 (id: 2147483646 rack:
null) (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
8166 [2024-12-10 04:44:37,228] DEBUG [Consumer
clientId=consumer-test_group_id-1, groupId=test_group_id] Sending JOIN_GROUP
request with header RequestHeader(apiKey=JOIN_GROUP, apiVersion=9,
clientId=consumer-test_group_id-1, correlationId=123, headerVersion=2) and
timeout 305000 to node 2147483646:
JoinGroupRequestData(groupId='test_group_id', sessionTimeoutMs=30000,
rebalanceTimeoutMs=300000,
memberId='consumer-test_group_id-1-61dfefb9-acdd-4050-ba0a-56f5d5ed1c3d',
groupInstance Id=null, protocolType='consumer',
protocols=[JoinGroupRequestProtocol(name='range', metadata=[0, 3, 0, 0, 0, 1,
0, 10, 116, 101, 115, 116, 95, 116, 111, 112, 105, 99, -1, -1, -1, -1, 0, 0, 0,
0, 0, 0, 0, 7, -1, -1])], reason='cached metadat a has changed')
(org.apache.kafka.clients.NetworkClient) {code}
Please reproduce this running
{code:java}
TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_broker_failure"
bash tests/docker/run_tests.sh {code}
> Flaky test_broker_rolling_bounce due to frequent rebootstrapping
> ----------------------------------------------------------------
>
> Key: KAFKA-18194
> URL: https://issues.apache.org/jira/browse/KAFKA-18194
> Project: Kafka
> Issue Type: Bug
> Components: consumer, system tests
> Reporter: Philip Nee
> Priority: Critical
>
> This is introduced after KIP-1102 I believe this is a critical bug because
> bounced broker will caused rebalance.
> When examining the consumer log you will see:
> # When we bounce a broker, sometimes it triggers rebootstrapping
> # Rebootstrapping forces metadata update
> # Request joining group due to: cached metadata has changed from (version20:
> \{test_topic=[NO_RACKS]}) at the beginning of the rebalance to (ver
> sion21: \{test_topic=[]})
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> # Causing the consumer to rejoin every time the server is bounded
> Here's the snippet of the log
>
> {code:java}
> 8156 [2024-12-10 04:44:37,227] DEBUG [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] Sending
> API_VERSIONS request with header RequestHeader(apiKey=API_VERSIONS,
> apiVersion=4, clientId=consumer-test_group_id-1, correlationId=1 22,
> headerVersion=2) and timeout 30000 to node -1:
> ApiVersionsRequestData(clientSoftwareName='apache-kafka-java',
> clientSoftwareVersion='8.0.0-0-ce') (org.apache.kafka.clients.NetworkClient)
> 8157 [2024-12-10 04:44:37,227] TRACE [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] Polling for fetches
> with timeout 1600
> (org.apache.kafka.clients.consumer.internals.ClassicKafkaConsumer)
> 8158 [2024-12-10 04:44:37,227] TRACE [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] Found least loaded
> connecting node ducker03:9092 (id: -1 rack: null)
> (org.apache.kafka.clients.NetworkClient)
> 8159 [2024-12-10 04:44:37,227] TRACE For telemetry state SUBSCRIPTION_NEEDED,
> returning the value 156047 ms;
> (org.apache.kafka.common.telemetry.internals.ClientTelemetryReporter)
> 8160 [2024-12-10 04:44:37,227] INFO [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] Request joining
> group due to: cached metadata has changed from (version20:
> {test_topic=[NO_RACKS]}
> ) at the beginning of the rebalance to (ver sion21:
> {test_topic=[]}
> ) (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> 8161 [2024-12-10 04:44:37,227] DEBUG [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] Executing
> onJoinPrepare with generation 7 and memberId
> consumer-test_group_id-1-61dfefb9-acdd-4050-ba0a-56f5d5ed1c3d
> (org.apache.kafka.clien ts.consumer.internals.ConsumerCoordinator)
> 8162 [2024-12-10 04:44:37,227] INFO [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] Revoke previously
> assigned partitions
> (org.apache.kafka.clients.consumer.internals.ConsumerRebalanceListenerInvoker)
> 8163 [2024-12-10 04:44:37,228] INFO [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] (Re-)joining group
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> 8164 [2024-12-10 04:44:37,228] DEBUG [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] Joining group with
> current subscription: [test_topic]
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> 8165 [2024-12-10 04:44:37,228] DEBUG [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] Sending JoinGroup
> (JoinGroupRequestData(groupId='test_group_id', sessionTimeoutMs=30000,
> rebalanceTimeoutMs=300000, memberId='consumer-test_
> group_id-1-61dfefb9-acdd-4050-ba0a-56f5d5ed1c3d', groupInstanceId=null,
> protocolType='consumer', protocols=[JoinGroupRequestProtocol(name='range',
> metadata=[0, 3, 0, 0, 0, 1, 0, 10, 116, 101, 115, 116, 95, 116, 111, 112,
> 105, 99, -1, -1, -1 , -1, 0, 0, 0, 0, 0, 0, 0, 7, -1, -1])],
> reason='cached metadata has changed')) to coordinator ducker03:9092 (id:
> 2147483646 rack: null)
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> 8166 [2024-12-10 04:44:37,228] DEBUG [Consumer
> clientId=consumer-test_group_id-1, groupId=test_group_id] Sending JOIN_GROUP
> request with header RequestHeader(apiKey=JOIN_GROUP, apiVersion=9,
> clientId=consumer-test_group_id-1, correlationId=123, headerVersion=2)
> and timeout 305000 to node 2147483646:
> JoinGroupRequestData(groupId='test_group_id', sessionTimeoutMs=30000,
> rebalanceTimeoutMs=300000,
> memberId='consumer-test_group_id-1-61dfefb9-acdd-4050-ba0a-56f5d5ed1c3d',
> groupInstance Id=null, protocolType='consumer',
> protocols=[JoinGroupRequestProtocol(name='range', metadata=[0, 3, 0, 0, 0, 1,
> 0, 10, 116, 101, 115, 116, 95, 116, 111, 112, 105, 99, -1, -1, -1, -1, 0, 0,
> 0, 0, 0, 0, 0, 7, -1, -1])], reason='cached metadat a has changed')
> (org.apache.kafka.clients.NetworkClient)
> {code}
> # Please reproduce this running
> {code:java}
> TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_broker_failure"
> bash tests/docker/run_tests.sh {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)