Jason Gustafson created KAFKA-13727:
---------------------------------------

             Summary: Edge case in cleaner can result in premature removal of 
ABORT marker
                 Key: KAFKA-13727
                 URL: https://issues.apache.org/jira/browse/KAFKA-13727
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


The log cleaner works by first building a map of the active keys beginning from 
the dirty offset, and then scanning forward from the beginning of the log to 
decide which records should be retained based on whether they are included in 
the map. The map of keys has a limited size. As soon as it fills up, we stop 
building it. The offset corresponding to the last record that was included in 
the map becomes the next dirty offset. Then when we are cleaning, we stop 
scanning forward at the dirty offset. Or to be more precise, we continue 
scanning until the end of the segment which includes the dirty offset, but all 
records above that offset are coped as is without checking the map of active 
keys. 

Compaction is complicated by the presence of transactions. The cleaner must 
keep track of which transactions have data remaining so that it can tell when 
it is safe to remove the respective markers. It works a bit like the consumer. 
Before scanning a segment, the cleaner consults the aborted transaction index 
to figure out which transactions have been aborted. All other transactions are 
considered committed.

The problem we have found is that the cleaner does not take into account the 
range of offsets between the dirty offset and the end offset of the segment 
containing it when querying ahead for aborted transactions. This means that 
when the cleaner is scanning forward from the dirty offset, it does not have 
the complete set of aborted transactions. The main consequence of this is that 
abort markers associated with transactions which start within this range of 
offsets become eligible for deletion even before the corresponding data has 
been removed from the log.

Here is an example. Suppose that the log contains the following entries:

offset=0, key=1

offset=1, key=2

offset=2, COMMIT

offset=3, key=3

offset=4, key=4

offset=5, COMMIT

offset=6, key=2

offset=7, ABORT

Suppose we have an offset map which can only contain 2 keys and the dirty 
offset starts at 0. The first time we scan forward, we will build a map with 
keys 1 and 2, which will allow us to move the dirty offset up to 3. Due to the 
issue documented here, we will not detect the aborted transaction starting at 
offset 6. But it will not be eligible for deletion on this round of cleaning 
because it is bound by `delete.retention.ms`. Instead, our new logic will set 
the deletion horizon for this batch based to the current time plus the 
configured `delete.retention.ms`.

offset=0, key=a

offset=1, key=b

offset=2, COMMIT

offset=3, key=c

offset=4, key=d

offset=5, COMMIT

offset=6, key=b

offset=7, ABORT (deleteHorizon: N)

Suppose that the time reaches N+1 before the next cleaning. We will begin from 
the dirty offset of 3 and collect keys c and d before stopping at offset 6. 
Again, we will not detect the aborted transaction beginning at offset 6 since 
it is out of the range. This time when we scan, the marker at offset 7 will be 
deleted because the transaction will be seen as empty and now the deletion 
horizon has passed. So we end up with this state:

offset=0, key=a

offset=1, key=b

offset=2, COMMIT

offset=3, key=c

offset=4, key=d

offset=5, COMMIT

offset=6, key=b

Effectively it becomes a hanging transaction. The interesting thing is that we 
might not even detect it. As far as the leader is concerned, it had already 
completed that transaction, so it is not expecting any additional markers. The 
transaction index would have been rewritten without the aborted transaction 
when the log was cleaned, so any consumer fetching the data would see the 
transaction as committed. On the other hand, if we did a reassignment to a new 
replica, or if we had to rebuild the full log state during recovery, then we 
would suddenly detect it.

I am not sure how likely this scenario is in practice. I think it's fair to say 
it is an extremely rare case. The cleaner has to fail to clean a full segment 
at least two times and you still need enough time to pass for the marker's 
deletion horizon to be reached. Perhaps it is possible if the cardinality of 
keys is very high and the configured memory limit for the cleaner is low.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to