lucasbru opened a new pull request, #13667: URL: https://github.com/apache/kafka/pull/13667
When an instance (or thread within an instance) of Kafka Streams has a soft failure and the group coordinator triggers a rebalance, that instance would temporarily become a "zombie writer". That is, this instance does not know there's already a new rebalance and hence its partitions have been migrated out, until it tries to commit and then got notified of the illegal-generation error and realize itself is the "zombie" already. During this period until the commit, this zombie may still be writing data to the changelogs of the migrated tasks as the new owner has already taken over and also writing to the changelogs. When EOS is enabled, this would not be a problem: when the zombie tries to commit and got notified that it's fenced, its zombie appends would be aborted. With EOS disabled, though, such shared writes would be interleaved on the changelogs where a zombie append may arrive later after the new writer's append, effectively overwriting that new append. Note that such interleaving writes do not necessarily cause corrupted data: as long as the new producer keep appending after the old zombie stops, and all the corrupted keys are overwritten again by the new values, then it is fine. However, if there are consecutive rebalances where right after the changelogs are corrupted by zombie writers, and before the new writer can overwrite them again, the task gets migrated again and needs to be restored from changelogs, the old values would be restored instead of the new values, effectively causing data loss. This change adds a new producer configuration to enable registering a transaction ID, but without requiring the client to use any of the transactional operations. This enables using the transaction ID to fence the zombie producer, just as with EOS, but when in ALOS mode. The new configuration is used in the streams producer when ALOS is enabled. The change includes an ALOS integration test that reliably fails without the `transaction.id`, and reliably passes with the `transaction.id`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org