Jason Gustafson created KAFKA-13727:
---------------------------------------
Summary: Edge case in cleaner can result in premature removal of
ABORT marker
Key: KAFKA-13727
URL: https://issues.apache.org/jira/browse/KAFKA-13727
Project: Kafka
Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jason Gustafson
The log cleaner works by first building a map of the active keys beginning from
the dirty offset, and then scanning forward from the beginning of the log to
decide which records should be retained based on whether they are included in
the map. The map of keys has a limited size. As soon as it fills up, we stop
building it. The offset corresponding to the last record that was included in
the map becomes the next dirty offset. Then when we are cleaning, we stop
scanning forward at the dirty offset. Or to be more precise, we continue
scanning until the end of the segment which includes the dirty offset, but all
records above that offset are coped as is without checking the map of active
keys.
Compaction is complicated by the presence of transactions. The cleaner must
keep track of which transactions have data remaining so that it can tell when
it is safe to remove the respective markers. It works a bit like the consumer.
Before scanning a segment, the cleaner consults the aborted transaction index
to figure out which transactions have been aborted. All other transactions are
considered committed.
The problem we have found is that the cleaner does not take into account the
range of offsets between the dirty offset and the end offset of the segment
containing it when querying ahead for aborted transactions. This means that
when the cleaner is scanning forward from the dirty offset, it does not have
the complete set of aborted transactions. The main consequence of this is that
abort markers associated with transactions which start within this range of
offsets become eligible for deletion even before the corresponding data has
been removed from the log.
Here is an example. Suppose that the log contains the following entries:
offset=0, key=1
offset=1, key=2
offset=2, COMMIT
offset=3, key=3
offset=4, key=4
offset=5, COMMIT
offset=6, key=2
offset=7, ABORT
Suppose we have an offset map which can only contain 2 keys and the dirty
offset starts at 0. The first time we scan forward, we will build a map with
keys 1 and 2, which will allow us to move the dirty offset up to 3. Due to the
issue documented here, we will not detect the aborted transaction starting at
offset 6. But it will not be eligible for deletion on this round of cleaning
because it is bound by `delete.retention.ms`. Instead, our new logic will set
the deletion horizon for this batch based to the current time plus the
configured `delete.retention.ms`.
offset=0, key=a
offset=1, key=b
offset=2, COMMIT
offset=3, key=c
offset=4, key=d
offset=5, COMMIT
offset=6, key=b
offset=7, ABORT (deleteHorizon: N)
Suppose that the time reaches N+1 before the next cleaning. We will begin from
the dirty offset of 3 and collect keys c and d before stopping at offset 6.
Again, we will not detect the aborted transaction beginning at offset 6 since
it is out of the range. This time when we scan, the marker at offset 7 will be
deleted because the transaction will be seen as empty and now the deletion
horizon has passed. So we end up with this state:
offset=0, key=a
offset=1, key=b
offset=2, COMMIT
offset=3, key=c
offset=4, key=d
offset=5, COMMIT
offset=6, key=b
Effectively it becomes a hanging transaction. The interesting thing is that we
might not even detect it. As far as the leader is concerned, it had already
completed that transaction, so it is not expecting any additional markers. The
transaction index would have been rewritten without the aborted transaction
when the log was cleaned, so any consumer fetching the data would see the
transaction as committed. On the other hand, if we did a reassignment to a new
replica, or if we had to rebuild the full log state during recovery, then we
would suddenly detect it.
I am not sure how likely this scenario is in practice. I think it's fair to say
it is an extremely rare case. The cleaner has to fail to clean a full segment
at least two times and you still need enough time to pass for the marker's
deletion horizon to be reached. Perhaps it is possible if the cardinality of
keys is very high and the configured memory limit for the cleaner is low.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)