[ https://issues.apache.org/jira/browse/KAFKA-12693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Guozhang Wang updated KAFKA-12693: ---------------------------------- Labels: new-streams-runtime-should-fix streams (was: streams) > Consecutive rebalances with zombie instances may cause corrupted changelogs > --------------------------------------------------------------------------- > > Key: KAFKA-12693 > URL: https://issues.apache.org/jira/browse/KAFKA-12693 > Project: Kafka > Issue Type: Bug > Reporter: Guozhang Wang > Priority: Major > Labels: new-streams-runtime-should-fix, streams > > When an instance (or thread within an instance) of Kafka Streams has a soft > failure and the group coordinator triggers a rebalance, that instance would > temporarily become a "zombie writer". That is, this instance does not know > there's already a new rebalance and hence its partitions have been migrated > out, until it tries to commit and then got notified of the illegal-generation > error and realize itself is the "zombie" already. During this period until > the commit, this zombie may still be writing data to the changelogs of the > migrated tasks as the new owner has already taken over and also writing to > the changelogs. > When EOS is enabled, this would not be a problem: when the zombie tries to > commit and got notified that it's fenced, its zombie appends would be > aborted. With EOS disabled, though, such shared writes would be interleaved > on the changelogs where a zombie append may arrive later after the new > writer's append, effectively overwriting that new append. > Note that such interleaving writes do not necessarily cause corrupted data: > as long as the new producer keep appending after the old zombie stops, and > all the corrupted keys are overwritten again by the new values, then it is > fine. However, if there are consecutive rebalances where right after the > changelogs are corrupted by zombie writers, and before the new writer can > overwrite them again, the task gets migrated again and needs to be restored > from changelogs, the old values would be restored instead of the new values, > effectively causing data loss. > Although this should be a rare event, we should fix it asap still. One idea > is to have producers get a PID even under ALOS: that is, we set the > transactional id in the producer config, but did not trigger any txn APIs; > when there are zombie producers, they would then be immediately fenced on > appends and hence there's no interleaved appends. I think this may require a > KIP still, since today one has to call initTxn in order to register and get > the PID. -- This message was sent by Atlassian Jira (v8.20.1#820001)