[ 
https://issues.apache.org/jira/browse/KAFKA-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-13727.
-------------------------------------
    Fix Version/s: 2.8.2
                   3.1.1
                   3.0.2
       Resolution: Fixed

> Edge case in cleaner can result in premature removal of ABORT marker
> --------------------------------------------------------------------
>
>                 Key: KAFKA-13727
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13727
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>             Fix For: 2.8.2, 3.1.1, 3.0.2
>
>
> The log cleaner works by first building a map of the active keys beginning 
> from the dirty offset, and then scanning forward from the beginning of the 
> log to decide which records should be retained based on whether they are 
> included in the map. The map of keys has a limited size. As soon as it fills 
> up, we stop building it. The offset corresponding to the last record that was 
> included in the map becomes the next dirty offset. Then when we are cleaning, 
> we stop scanning forward at the dirty offset. Or to be more precise, we 
> continue scanning until the end of the segment which includes the dirty 
> offset, but all records above that offset are coped as is without checking 
> the map of active keys. 
> Compaction is complicated by the presence of transactions. The cleaner must 
> keep track of which transactions have data remaining so that it can tell when 
> it is safe to remove the respective markers. It works a bit like the 
> consumer. Before scanning a segment, the cleaner consults the aborted 
> transaction index to figure out which transactions have been aborted. All 
> other transactions are considered committed.
> The problem we have found is that the cleaner does not take into account the 
> range of offsets between the dirty offset and the end offset of the segment 
> containing it when querying ahead for aborted transactions. This means that 
> when the cleaner is scanning forward from the dirty offset, it does not have 
> the complete set of aborted transactions. The main consequence of this is 
> that abort markers associated with transactions which start within this range 
> of offsets become eligible for deletion even before the corresponding data 
> has been removed from the log.
> Here is an example. Suppose that the log contains the following entries:
> offset=0, key=a
> offset=1, key=b
> offset=2, COMMIT
> offset=3, key=c
> offset=4, key=d
> offset=5, COMMIT
> offset=6, key=b
> offset=7, ABORT
> Suppose we have an offset map which can only contain 2 keys and the dirty 
> offset starts at 0. The first time we scan forward, we will build a map with 
> keys a and b, which will allow us to move the dirty offset up to 3. Due to 
> the issue documented here, we will not detect the aborted transaction 
> starting at offset 6. But it will not be eligible for deletion on this round 
> of cleaning because it is bound by `delete.retention.ms`. Instead, our new 
> logic will set the deletion horizon for this batch based to the current time 
> plus the configured `delete.retention.ms`.
> offset=0, key=a
> offset=1, key=b
> offset=2, COMMIT
> offset=3, key=c
> offset=4, key=d
> offset=5, COMMIT
> offset=6, key=b
> offset=7, ABORT (deleteHorizon: N)
> Suppose that the time reaches N+1 before the next cleaning. We will begin 
> from the dirty offset of 3 and collect keys c and d before stopping at offset 
> 6. Again, we will not detect the aborted transaction beginning at offset 6 
> since it is out of the range. This time when we scan, the marker at offset 7 
> will be deleted because the transaction will be seen as empty and now the 
> deletion horizon has passed. So we end up with this state:
> offset=0, key=a
> offset=1, key=b
> offset=2, COMMIT
> offset=3, key=c
> offset=4, key=d
> offset=5, COMMIT
> offset=6, key=b
> Effectively it becomes a hanging transaction. The interesting thing is that 
> we might not even detect it. As far as the leader is concerned, it had 
> already completed that transaction, so it is not expecting any additional 
> markers. The transaction index would have been rewritten without the aborted 
> transaction when the log was cleaned, so any consumer fetching the data would 
> see the transaction as committed. On the other hand, if we did a reassignment 
> to a new replica, or if we had to rebuild the full log state during recovery, 
> then we would suddenly detect it.
> I am not sure how likely this scenario is in practice. I think it's fair to 
> say it is an extremely rare case. The cleaner has to fail to clean a full 
> segment at least two times and you still need enough time to pass for the 
> marker's deletion horizon to be reached. Perhaps it is possible if the 
> cardinality of keys is very high and the configured memory limit for the 
> cleaner is low.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to