[GitHub] [kafka] junrao commented on a change in pull request #8936: KAFKA-10207: Fixed padded timeindex causing premature data deletion

GitBox Tue, 07 Jul 2020 10:26:23 -0700


junrao commented on a change in pull request #8936:
URL: https://github.com/apache/kafka/pull/8936#discussion_r451024109




##########
File path: core/src/main/scala/kafka/log/LogSegment.scala
##########
@@ -87,6 +87,12 @@ class LogSegment private[log] (val log: FileRecords,
       // we will recover the segments above the recovery point in recoverLog()
       // in any case so sanity checking them here is redundant.
       txnIndex.sanityCheck()
+      // Failing to sanity check the timeIndex can result in a scenario where 
log segments are
+      // prematurely deleted (before breaching retention periods) if the index 
file was not resized
+      // to disk successfully.
+      // KAFKA-10207
+      timeIndex.sanityCheck()
+      offsetIndex.sanityCheck()

Review comment:
       @Johnny-Malizia : Thanks for reporting this. A couple of comments on 
this.
   
   (1) It seems that there are cases where the index file is not trimmed 
properly. Every rolled segment calls LogSegment.onBecomeInactiveSegment(), 
which does trimming. So, we probably should just fix the trim logic there.
   
   (2) The sanity checking of index may have merit itself if it's cheap. The 
intention of KIP-263 is to avoid unnecessary loading (and the sanity check) of 
indexes never used. This is mostly achieved through lazy indexes (actual index 
file only opened lazily on first use). Since the biggest cost of loading an 
index file is probably the I/O part, and the sanity check is computational only 
and relatively cheap, we can probably just do the index sanity check on index 
opening. This way, in the common case, we only pay the index sanity checking 
cost on active segments.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] junrao commented on a change in pull request #8936: KAFKA-10207: Fixed padded timeindex causing premature data deletion

Reply via email to