[ https://issues.apache.org/jira/browse/KAFKA-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson resolved KAFKA-13727. ------------------------------------- Fix Version/s: 2.8.2 3.1.1 3.0.2 Resolution: Fixed > Edge case in cleaner can result in premature removal of ABORT marker > -------------------------------------------------------------------- > > Key: KAFKA-13727 > URL: https://issues.apache.org/jira/browse/KAFKA-13727 > Project: Kafka > Issue Type: Bug > Reporter: Jason Gustafson > Assignee: Jason Gustafson > Priority: Major > Fix For: 2.8.2, 3.1.1, 3.0.2 > > > The log cleaner works by first building a map of the active keys beginning > from the dirty offset, and then scanning forward from the beginning of the > log to decide which records should be retained based on whether they are > included in the map. The map of keys has a limited size. As soon as it fills > up, we stop building it. The offset corresponding to the last record that was > included in the map becomes the next dirty offset. Then when we are cleaning, > we stop scanning forward at the dirty offset. Or to be more precise, we > continue scanning until the end of the segment which includes the dirty > offset, but all records above that offset are coped as is without checking > the map of active keys. > Compaction is complicated by the presence of transactions. The cleaner must > keep track of which transactions have data remaining so that it can tell when > it is safe to remove the respective markers. It works a bit like the > consumer. Before scanning a segment, the cleaner consults the aborted > transaction index to figure out which transactions have been aborted. All > other transactions are considered committed. > The problem we have found is that the cleaner does not take into account the > range of offsets between the dirty offset and the end offset of the segment > containing it when querying ahead for aborted transactions. This means that > when the cleaner is scanning forward from the dirty offset, it does not have > the complete set of aborted transactions. The main consequence of this is > that abort markers associated with transactions which start within this range > of offsets become eligible for deletion even before the corresponding data > has been removed from the log. > Here is an example. Suppose that the log contains the following entries: > offset=0, key=a > offset=1, key=b > offset=2, COMMIT > offset=3, key=c > offset=4, key=d > offset=5, COMMIT > offset=6, key=b > offset=7, ABORT > Suppose we have an offset map which can only contain 2 keys and the dirty > offset starts at 0. The first time we scan forward, we will build a map with > keys a and b, which will allow us to move the dirty offset up to 3. Due to > the issue documented here, we will not detect the aborted transaction > starting at offset 6. But it will not be eligible for deletion on this round > of cleaning because it is bound by `delete.retention.ms`. Instead, our new > logic will set the deletion horizon for this batch based to the current time > plus the configured `delete.retention.ms`. > offset=0, key=a > offset=1, key=b > offset=2, COMMIT > offset=3, key=c > offset=4, key=d > offset=5, COMMIT > offset=6, key=b > offset=7, ABORT (deleteHorizon: N) > Suppose that the time reaches N+1 before the next cleaning. We will begin > from the dirty offset of 3 and collect keys c and d before stopping at offset > 6. Again, we will not detect the aborted transaction beginning at offset 6 > since it is out of the range. This time when we scan, the marker at offset 7 > will be deleted because the transaction will be seen as empty and now the > deletion horizon has passed. So we end up with this state: > offset=0, key=a > offset=1, key=b > offset=2, COMMIT > offset=3, key=c > offset=4, key=d > offset=5, COMMIT > offset=6, key=b > Effectively it becomes a hanging transaction. The interesting thing is that > we might not even detect it. As far as the leader is concerned, it had > already completed that transaction, so it is not expecting any additional > markers. The transaction index would have been rewritten without the aborted > transaction when the log was cleaned, so any consumer fetching the data would > see the transaction as committed. On the other hand, if we did a reassignment > to a new replica, or if we had to rebuild the full log state during recovery, > then we would suddenly detect it. > I am not sure how likely this scenario is in practice. I think it's fair to > say it is an extremely rare case. The cleaner has to fail to clean a full > segment at least two times and you still need enough time to pass for the > marker's deletion horizon to be reached. Perhaps it is possible if the > cardinality of keys is very high and the configured memory limit for the > cleaner is low. -- This message was sent by Atlassian Jira (v8.20.1#820001)