[ https://issues.apache.org/jira/browse/KAFKA-15832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lianet Magrans updated KAFKA-15832: ----------------------------------- Description: Currently the reconciliation logic on the client is triggered when a new target assignment is received and resolved, or when new unresolved target assignments are discovered in metadata. This could be improved by triggering the reconciliation logic on each poll iteration, to reconcile whatever is ready to be reconciled. This would require changes to support poll on the MembershipManager, and integrate it with the current polling logic in the background thread. Receiving a new target assignment from the broker, or resolving new topic names via a metadata update could only ensure that the #assignmentReadyToReconcile are properly updated (currently done), but wouldn't trigger the #reconcile() logic, leaving that to the #poll() operation. As a result of this task, we should validate that the client always reconciles whatever it has pending that has not been removed by the coordinator. This should address edge cases where a client might get stuck JOINING/RECONCILING, with a pending reconciliation, where null assignments are exchanged between the client and the coordinator, while the long-running reconciliation completes. Note that currently, the MembershipManager relies on assignment != null to trigger the reconciliation of pending assignments. With the current logic, the following sequence would let the client stuck JOINING: - client joins, epoch 0 - client receives assignment tp1, stuck RECONCILING, epoch 1 - member gets FENCED on the coord, coord bumps epoch to 2 - client tries to rejoin (JOINING), epoch 0 provided by the client - client added to the group (group epoch bumped to 2), client receives same assignment that is currently trying to reconcile (tp1) - reconciliation completes, will discard the reconciliation result if it completes after the fencing, because it will notice that the memberHasRejoined (memberEpochOnReconciliationStart != memberEpoch). was: Currently the reconciliation logic on the client is triggered when a new target assignment is received and resolved, or when new unresolved target assignments are discovered in metadata. This could be improved by triggering the reconciliation logic on each poll iteration, to reconcile whatever is ready to be reconciled. This would required changes to support poll on the MembershipManager, and integrate it with the current polling logic in the background thread. As a result of this task, it should be ensured that the client always reconciles whatever it has pending that has not been removed by the coordinator. (This should address edge cases where a client might get stuck JOINING/RECONCILING, with a pending reconciliation, where null assignments are exchanged between the client and the coordinator, while the long-running reconciliation completes. Note that currently, the MembershipManager relies on assignment != null to trigger the reconciliation of pending assignments) > Trigger client reconciliation based on manager poll > --------------------------------------------------- > > Key: KAFKA-15832 > URL: https://issues.apache.org/jira/browse/KAFKA-15832 > Project: Kafka > Issue Type: Sub-task > Components: clients, consumer > Reporter: Lianet Magrans > Assignee: Lianet Magrans > Priority: Major > Labels: kip-848, kip-848-client-support, kip-848-e2e, > kip-848-preview > Fix For: 3.8.0 > > > Currently the reconciliation logic on the client is triggered when a new > target assignment is received and resolved, or when new unresolved target > assignments are discovered in metadata. > This could be improved by triggering the reconciliation logic on each poll > iteration, to reconcile whatever is ready to be reconciled. This would > require changes to support poll on the MembershipManager, and integrate it > with the current polling logic in the background thread. Receiving a new > target assignment from the broker, or resolving new topic names via a > metadata update could only ensure that the #assignmentReadyToReconcile are > properly updated (currently done), but wouldn't trigger the #reconcile() > logic, leaving that to the #poll() operation. > As a result of this task, we should validate that the client always > reconciles whatever it has pending that has not been removed by the > coordinator. This should address edge cases where a client might get stuck > JOINING/RECONCILING, with a pending reconciliation, where null assignments > are exchanged between the client and the coordinator, while the long-running > reconciliation completes. Note that currently, the MembershipManager relies > on assignment != null to trigger the reconciliation of pending assignments. > With the current logic, the following sequence would let the client stuck > JOINING: > - client joins, epoch 0 > - client receives assignment tp1, stuck RECONCILING, epoch 1 > - member gets FENCED on the coord, coord bumps epoch to 2 > - client tries to rejoin (JOINING), epoch 0 provided by the client > - client added to the group (group epoch bumped to 2), client receives same > assignment that is currently trying to reconcile (tp1) > - reconciliation completes, will discard the reconciliation result if it > completes after the fencing, because it will notice that the > memberHasRejoined (memberEpochOnReconciliationStart != memberEpoch). -- This message was sent by Atlassian Jira (v8.20.10#820010)