ocadaruma opened a new pull request, #14242: URL: https://github.com/apache/kafka/pull/14242
JIRA ticket: https://issues.apache.org/jira/browse/KAFKA-15046 - While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock * In the meantime the lock is held, all subsequent produces against the partition may block * This easily causes all request-handlers to be busy on bad disk performance * Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status) - This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock: * (1) ProducerStateManager.takeSnapshot at UnifiedLog.roll - I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point) - Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem * (2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion - This method calls `Utils.atomicMoveWithFallback` with `needFlushParentDir = true` internally, which calls fsync. - I changed it to call `Utils.atomicMoveWithFallback` with `needFlushParentDir = false` (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir) - This change shouldn't cause problems neither. * (3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset - This path is called from deleteRecords on request-handler threads. - Here, we don't need fsync(2) either actually. - On unclean shutdown, few leader epochs might be remained in the file but it will be [handled by LogLoader](https://github.com/apache/kafka/blob/3f4816dd3eafaf1a0636d3ee689069f897c99e28/core/src/main/scala/kafka/log/LogLoader.scala#L185) on start-up so not a problem * (4) LeaderEpochFileCache.truncateFromEnd as part of log truncation - Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure - Please refer JIRA ticket for the further details and the performance experiment result To check if these changes don't cause a problem, below consistency expectation table will be helpful: | No | File | Consistency expectation | Description | Note | |:---|:----------------------|:--------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 | ProducerStateSnapshot | Snapshot files on the disk before the recovery point should be consistent with the log segments | - On restart after unclean shutdown, Kafka will skip the snapshot recovery procedure before the recovery point.<br>- If the snapshot content before recovery point is not consistent with the log, it will cause a problem like idempotency violation due to the missing producer state. | Hence, the inconsistency after the recovery point is acceptable because it will be recovered to the consistent state on the log loading procedure | | 2 | ProducerStateSnapshot | Deleted snapshot files on the disk should be eventually consistent with log segments | - On log segment deletion by any reasons (e.g. retention, topic deletion), corresponding snapshot files will be deleted.<br>- Even when the broker crashes by power failure before the files are deleted from the actual disk, they should be eventually deleted from the disk. | | | 3 | LeaderEpochCheckpoint | All leader epoch entry (i.e. epoch and the start offset) in the log segments have to exist also in leader-epoch checkpoint file on the disk | - If some epoch entries are missing from the checkpoint file, upon restart after power failure, Kafka may restore stale leader epoch cache.<br>- It will return wrong entry when reading the leader epoch cache (e.g. in list offsets request handling) | On the other hand, surplus entries (prefixes or suffixes) in the checkpoint file on the disk are acceptable, because even on restart after power failure, it will be truncated anyways on log loading procedure. | We can confirm the changes are valid based on above table like this: - Change (1): fsync for ProducerStateManager.takeSnapshot is moved to the async scheduler thread * This preserves consistency expectation No.1, since we still increment recovery point only after fsync is performed - Change (2): ProducerStateManager.removeAndMarkSnapshotForDeletion no longer flushes parent dir * This preserves consistency expectation No.2, since snapshot files will be deleted eventually even after power failure. - ScenarioA: Snapshot files are renamed to `-deleted` for segment deletion by log retention, but broker crashes by power failure before the rename is persisted to the disk * In this case, some snapshot files's name would be reverted to `-deleted` suffix stripped, then ProducerStateManager would load these snapshot files unnecessarily. * However, since producer state will be truncated based on log-segment upon log loading procedure, it won't be a problem. - ScenarioB: Snapshot files are renamed to `-deleted` for topic deletion, but broker crashes by power failure before the rename is persisted to the disk * Also, in this case, snapshot file's name would be reverted to `-deleted` suffix stripped. * However, on topic-deletion, parent log-dir is already renamed to `-delete` and it's fsynced anyways, the revert of snapshot file wouldn't be a problem. Parent log dirs will be deleted anyways after resuming topic deletion procedure. - Change (3): LeaderEpochFileCache.truncateFromStart doesn't call fsync * This preserves consistency expectation No.3, since we still call fsync on LeaderEpochFileCache.assign. - Change (4): LeaderEpochFileCache. truncateFromEnd doesn't call fsync * Same explanation as Change (3) ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org