ocadaruma opened a new pull request, #14242:
URL: https://github.com/apache/kafka/pull/14242

   JIRA ticket: https://issues.apache.org/jira/browse/KAFKA-15046
   
   - While any blocking operation under holding the UnifiedLog.lock could lead 
to serious performance (even availability) issues, currently there are several 
paths that calls fsync(2) inside the lock
     * In the meantime the lock is held, all subsequent produces against the 
partition may block
     * This easily causes all request-handlers to be busy on bad disk 
performance
     * Even worse, when a disk experiences tens of seconds of glitch (it's not 
rare in spinning drives), it makes the broker to unable to process any requests 
with unfenced from the cluster (i.e. "zombie" like status)
   - This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls 
performed under the lock:
     * (1) ProducerStateManager.takeSnapshot at UnifiedLog.roll
         - I moved fsync(2) call to the scheduler thread as part of existing 
"flush-log" job (before incrementing recovery point)
         - Since it's still ensured that the snapshot is flushed before 
incrementing recovery point, this change shouldn't cause any problem
     * (2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log 
segment deletion
         - This method calls `Utils.atomicMoveWithFallback` with 
`needFlushParentDir = true` internally, which calls fsync.
         - I changed it to call `Utils.atomicMoveWithFallback` with 
`needFlushParentDir = false` (which is consistent behavior with index files 
deletion. index files deletion also doesn't flush parent dir)
         - This change shouldn't cause problems neither.
     * (3) LeaderEpochFileCache.truncateFromStart when incrementing 
log-start-offset
         - This path is called from deleteRecords on request-handler threads.
         - Here, we don't need fsync(2) either actually.
         - On unclean shutdown, few leader epochs might be remained in the file 
but it will be [handled by 
LogLoader](https://github.com/apache/kafka/blob/3f4816dd3eafaf1a0636d3ee689069f897c99e28/core/src/main/scala/kafka/log/LogLoader.scala#L185)
 on start-up so not a problem
     * (4) LeaderEpochFileCache.truncateFromEnd as part of log truncation
         - Likewise, we don't need fsync(2) here, since any epochs which are 
untruncated on unclean shutdown will be handled on log loading procedure
   - Please refer JIRA ticket for the further details and the performance 
experiment result
   
   To check if these changes don't cause a problem, below consistency 
expectation table will be helpful:
   
   | No | File                  | Consistency expectation                       
                                                                                
              | Description                                                     
                                                                                
                                                                                
                                                         | Note                 
                                                                                
                                                                                
                            |
   
|:---|:----------------------|:--------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   | 1  | ProducerStateSnapshot | Snapshot files on the disk before the 
recovery point should be consistent with the log segments                       
                      | - On restart after unclean shutdown, Kafka will skip 
the snapshot recovery procedure before the recovery point.<br>- If the snapshot 
content before recovery point is not consistent with the log, it will cause a 
problem like idempotency violation due to the missing producer state. | Hence, 
the inconsistency after the recovery point is acceptable because it will be 
recovered to the consistent state on the log loading procedure                  
                                              |
   | 2  | ProducerStateSnapshot | Deleted snapshot files on the disk should be 
eventually consistent with log segments                                         
               | - On log segment deletion by any reasons (e.g. retention, 
topic deletion), corresponding snapshot files will be deleted.<br>- Even when 
the broker crashes by power failure before the files are deleted from the 
actual disk, they should be eventually deleted from the disk.          |        
                                                                                
                                                                                
                                          |
   | 3  | LeaderEpochCheckpoint | All leader epoch entry (i.e. epoch and the 
start offset) in the log segments have to exist also in leader-epoch checkpoint 
file on the disk | - If some epoch entries are missing from the checkpoint 
file, upon restart after power failure, Kafka may restore stale leader epoch 
cache.<br>- It will return wrong entry when reading the leader epoch cache 
(e.g. in list offsets request handling)                                  | On 
the other hand, surplus entries (prefixes or suffixes) in the checkpoint file 
on the disk are acceptable, because even on restart after power failure, it 
will be truncated anyways on log loading procedure. |
   
   We can confirm the changes are valid based on above table like this:
   
   - Change (1): fsync for ProducerStateManager.takeSnapshot is moved to the 
async scheduler thread
     * This preserves consistency expectation No.1, since we still increment 
recovery point only after fsync is performed
   - Change (2): ProducerStateManager.removeAndMarkSnapshotForDeletion no 
longer flushes parent dir
     * This preserves consistency expectation No.2, since snapshot files will 
be deleted eventually even after power failure.
         - ScenarioA: Snapshot files are renamed to `-deleted` for segment 
deletion by log retention, but broker crashes by power failure before the 
rename is persisted to the disk
             * In this case, some snapshot files's name would be reverted to 
`-deleted` suffix stripped, then ProducerStateManager would load these snapshot 
files unnecessarily.
             * However, since producer state will be truncated based on 
log-segment upon log loading procedure, it won't be a problem.
         - ScenarioB: Snapshot files are renamed to `-deleted` for topic 
deletion, but broker crashes by power failure before the rename is persisted to 
the disk
             * Also, in this case, snapshot file's name would be reverted to 
`-deleted` suffix stripped.
             * However, on topic-deletion, parent log-dir is already renamed to 
`-delete` and it's fsynced anyways, the revert of snapshot file wouldn't be a 
problem. Parent log dirs will be deleted anyways after resuming topic deletion 
procedure.
   - Change (3): LeaderEpochFileCache.truncateFromStart doesn't call fsync
     * This preserves consistency expectation No.3, since we still call fsync 
on LeaderEpochFileCache.assign.
   - Change (4): LeaderEpochFileCache. truncateFromEnd doesn't call fsync
     * Same explanation as Change (3)
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to