lucasbru opened a new pull request, #13667:
URL: https://github.com/apache/kafka/pull/13667

   When an instance (or thread within an instance) of Kafka Streams has a soft 
failure and the group coordinator triggers a rebalance, that instance would 
temporarily become a "zombie writer". That is, this instance does not know 
there's already a new rebalance and hence its partitions have been migrated 
out, until it tries to commit and then got notified of the illegal-generation 
error and realize itself is the "zombie" already. During this period until the 
commit, this zombie may still be writing data to the changelogs of the migrated 
tasks as the new owner has already taken over and also writing to the 
changelogs.
   
   When EOS is enabled, this would not be a problem: when the zombie tries to 
commit and got notified that it's fenced, its zombie appends would be aborted. 
With EOS disabled, though, such shared writes would be interleaved on the 
changelogs where a zombie append may arrive later after the new writer's 
append, effectively overwriting that new append.
   
   Note that such interleaving writes do not necessarily cause corrupted data: 
as long as the new producer keep appending after the old zombie stops, and all 
the corrupted keys are overwritten again by the new values, then it is fine. 
However, if there are consecutive rebalances where right after the changelogs 
are corrupted by zombie writers, and before the new writer can overwrite them 
again, the task gets migrated again and needs to be restored from changelogs, 
the old values would be restored instead of the new values, effectively causing 
data loss.
   
   This change adds a new producer configuration to enable registering a 
transaction ID, but without requiring the client to use any of the 
transactional operations. This enables using the transaction ID to fence the 
zombie producer, just as with EOS, but when in ALOS mode. The new configuration 
is used in the streams producer when ALOS is enabled.
   
   The change includes an ALOS integration test that reliably fails without the 
`transaction.id`, and reliably passes with the `transaction.id`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to