Shekhar Prasad Rajak created KAFKA-20710:
--------------------------------------------

             Summary: ShareCoordinatorService periodic jobs can duplicate after 
share feature disable/re-enable
                 Key: KAFKA-20710
                 URL: https://issues.apache.org/jira/browse/KAFKA-20710
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 4.3.0
            Reporter: Shekhar Prasad Rajak
            Assignee: Shekhar Prasad Rajak
             Fix For: 4.3.2


 ShareCoordinatorService schedules self-recursing timer jobs. Disable/re-enable 
only flips shouldRunPeriodicJob; old queued or in-flight tasks can recurse 
after re-enable and create
  duplicate job chains.

 

void setupRecordPruning() {
      timer.add(new TimerTask(...) {
          public void run() {
              ...
              CompletableFuture.allOf(...).whenComplete((res, exp) -> {
                  ...
                  setupRecordPruning(); // schedules next prune job
              });
          }
      });
  }

 

1. setupRecordPruning() adds timer task A.
  2. Timer fires task A.
  3. Task A does prune work.
  4. When prune work completes, task A calls setupRecordPruning() again.
  5. That adds timer task B.
  6. Task B later does the same.

  The bug risk is: if feature is disabled/re-enabled while task A is still 
queued or in-flight, re-enable schedules a new task chain, but old task A may 
also resume and schedule another
  chain. Then you can have duplicate periodic jobs running.

 

So Expected behavior is one active prune chain and one active snapshot chain 
per service instance. Add a  generation guard /epoch fencing; so stale timer 
tasks and stale async completions cannot reschedule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to