[ 
https://issues.apache.org/jira/browse/KAFKA-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-9144.
------------------------------------
    Fix Version/s: 2.4.1
       Resolution: Fixed

> Early expiration of producer state can cause coordinator epoch to regress
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-9144
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9144
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>             Fix For: 2.4.1
>
>
> Transaction markers are written by the transaction coordinator. In order to 
> fence zombie coordinators, we use the leader epoch associated with the 
> coordinator partition. Partition leaders verify the epoch in the 
> WriteTxnMarker request and ensure that it can only increase. However, when 
> producer state expires, we stop tracking the epoch and it is possible for 
> monotonicity to be violated. Generally we expect expiration to be on the 
> order of days, so it should be unlikely for this to be a problem.
> At least that is the theory. We observed a case where a coordinator epoch 
> decreased between nearly consecutive writes within a couple minutes of each 
> other. Upon investigation, we found that producer state had been incorrectly 
> expired. We believe the sequence of events is the following:
>  # Producer writes transactional data and fails before committing
>  # Coordinator times out the transaction and writes ABORT markers
>  # Upon seeing the ABORT and the bumped epoch, the partition leader deletes 
> state from the last epoch, which effectively resets the last timestamp for 
> the producer to -1.
>  # The coordinator becomes a zombie before getting a successful response and 
> continues trying to send
>  # The new coordinator notices the incomplete transaction and also sends 
> markers
>  # The partition leader accepts the write from the new coordinator
>  # The producer state is expired because the last timestamp was -1
>  # The partition leader accepts the write from the old coordinator
> Basically it takes an alignment of planets to hit this bug, but it is 
> possible. If you hit it, then the broker may be unable to start because we 
> validate epoch monotonicity during log recovery. The problem is in 3 when the 
> timestamp gets reset. We should use the timestamp from the marker instead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to