Shekhar Prasad Rajak created KAFKA-20710:
--------------------------------------------
Summary: ShareCoordinatorService periodic jobs can duplicate after
share feature disable/re-enable
Key: KAFKA-20710
URL: https://issues.apache.org/jira/browse/KAFKA-20710
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 4.3.0
Reporter: Shekhar Prasad Rajak
Assignee: Shekhar Prasad Rajak
Fix For: 4.3.2
ShareCoordinatorService schedules self-recursing timer jobs. Disable/re-enable
only flips shouldRunPeriodicJob; old queued or in-flight tasks can recurse
after re-enable and create
duplicate job chains.
void setupRecordPruning() {
timer.add(new TimerTask(...) {
public void run() {
...
CompletableFuture.allOf(...).whenComplete((res, exp) -> {
...
setupRecordPruning(); // schedules next prune job
});
}
});
}
1. setupRecordPruning() adds timer task A.
2. Timer fires task A.
3. Task A does prune work.
4. When prune work completes, task A calls setupRecordPruning() again.
5. That adds timer task B.
6. Task B later does the same.
The bug risk is: if feature is disabled/re-enabled while task A is still
queued or in-flight, re-enable schedules a new task chain, but old task A may
also resume and schedule another
chain. Then you can have duplicate periodic jobs running.
So Expected behavior is one active prune chain and one active snapshot chain
per service instance. Add a generation guard /epoch fencing; so stale timer
tasks and stale async completions cannot reschedule.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)