[ 
https://issues.apache.org/jira/browse/KAFKA-12693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-12693:
----------------------------------
    Labels: new-streams-runtime-should-fix streams  (was: streams)

> Consecutive rebalances with zombie instances may cause corrupted changelogs
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-12693
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12693
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Guozhang Wang
>            Priority: Major
>              Labels: new-streams-runtime-should-fix, streams
>
> When an instance (or thread within an instance) of Kafka Streams has a soft 
> failure and the group coordinator triggers a rebalance, that instance would 
> temporarily become a "zombie writer". That is, this instance does not know 
> there's already a new rebalance and hence its partitions have been migrated 
> out, until it tries to commit and then got notified of the illegal-generation 
> error and realize itself is the "zombie" already. During this period until 
> the commit, this zombie may still be writing data to the changelogs of the 
> migrated tasks as the new owner has already taken over and also writing to 
> the changelogs.
> When EOS is enabled, this would not be a problem: when the zombie tries to 
> commit and got notified that it's fenced, its zombie appends would be 
> aborted. With EOS disabled, though, such shared writes would be interleaved 
> on the changelogs where a zombie append may arrive later after the new 
> writer's append, effectively overwriting that new append.
> Note that such interleaving writes do not necessarily cause corrupted data: 
> as long as the new producer keep appending after the old zombie stops, and 
> all the corrupted keys are overwritten again by the new values, then it is 
> fine. However, if there are consecutive rebalances where right after the 
> changelogs are corrupted by zombie writers, and before the new writer can 
> overwrite them again, the task gets migrated again and needs to be restored 
> from changelogs, the old values would be restored instead of the new values, 
> effectively causing data loss.
> Although this should be a rare event, we should fix it asap still. One idea 
> is to have producers get a PID even under ALOS: that is, we set the 
> transactional id in the producer config, but did not trigger any txn APIs; 
> when there are zombie producers, they would then be immediately fenced on 
> appends and hence there's no interleaved appends. I think this may require a 
> KIP still, since today one has to call initTxn in order to register and get 
> the PID.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to