Greg Harris created KAFKA-15905:
-----------------------------------

             Summary: Restarts of MirrorCheckpointTask should not permanently 
interrupt offset translation
                 Key: KAFKA-15905
                 URL: https://issues.apache.org/jira/browse/KAFKA-15905
             Project: Kafka
          Issue Type: Improvement
          Components: mirrormaker
    Affects Versions: 3.6.0
            Reporter: Greg Harris


Executive summary: When the MirrorCheckpointTask restarts, it loses the state 
of checkpointsPerConsumerGroup, which limits offset translation to records 
mirrored after the latest restart.

For example, if 1000 records are mirrored and the OffsetSyncs are read by 
MirrorCheckpointTask, the emitted checkpoints are cached, and translation can 
happen at the ~500th record. If MirrorCheckpointTask restarts, and 1000 more 
records are mirrored, translation can happen at the ~1500th record, but no 
longer at the ~500th record.

Context:

Before KAFKA-13659, MM2 made translation decisions based on the 
incompletely-initialized OffsetSyncStore, and the checkpoint could appear to go 
backwards temporarily during restarts. To fix this, we forced the 
OffsetSyncStore to initialize completely before translation could take place, 
ensuring that the latest OffsetSync had been read, and thus providing the most 
accurate translation.

Before KAFKA-14666, MM2 translated offsets only off of the latest OffsetSync. 
Afterwards, an in-memory sparse cache of historical OffsetSyncs was kept, to 
allow for translation of earlier offsets. This came with the caveat that the 
cache's sparseness allowed translations to go backwards permanently. To prevent 
this behavior, a cache of the latest Checkpoints was kept in the 
MirrorCheckpointTask#checkpointsPerConsumerGroup variable, and offset 
translation remained restricted to the fully-initialized OffsetSyncStore.

Effectively, the MirrorCheckpointTask ensures that it translates based on an 
OffsetSync emitted during it's lifetime, to ensure that no previous 
MirrorCheckpointTask emitted a later sync. If we can read the checkpoints 
emitted by previous generations of MirrorCheckpointTask, we can still ensure 
that checkpoints are monotonic, while allowing translation further back in 
history.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to