chia7712 commented on PR #15634: URL: https://github.com/apache/kafka/pull/15634#issuecomment-2063435567
> HWM is set to to localLogStartOffset in [UnifiedLog#updateLocalLogStartOffset](https://sourcegraph.com/github.com/apache/kafka@f895ab5145077c5efa10a4a898628d901b01e2c2/-/blob/core/src/main/scala/kafka/log/UnifiedLog.scala?L162), then we load the HWM from the checkpoint file in [Partition#createLog](https://sourcegraph.com/github.com/apache/kafka@f895ab5145077c5efa10a4a898628d901b01e2c2/-/blob/core/src/main/scala/kafka/cluster/Partition.scala?L495). If the HWM checkpoint file is missing / does not contain the entry for partition, then the default value of 0 is taken. If 0 < LogStartOffset (LSO), then LSO is assumed as HWM . Thus, the non-monotonic update of highwatermark from LLSO to LSO can happen. Pardon me. I'm a bit confused about this. Please feel free to correct me to help me catch up :smile: ### case 0: the checkpoint file is missing and the remote storage is **disabled** The LSO is initialized to LLSO https://github.com/apache/kafka/blob/aee9724ee15ed539ae73c09cc2c2eda83ae3c864/core/src/main/scala/kafka/log/LogLoader.scala#L180 so I can't understand why the non-monotonic update happens? After all, LLSO and LSO are the same in this scenario. ### case 1: the checkpoint file is missing and the remote storage is **enabled** The LSO is initialzied to `logStartOffsetCheckpoint` which is 0 since there are no checkpoint files. https://github.com/apache/kafka/blob/aee9724ee15ed539ae73c09cc2c2eda83ae3c864/core/src/main/scala/kafka/log/LogLoader.scala#L178 And then HWM will be update to LLSO which is larger than zero. https://github.com/apache/kafka/blob/aee9724ee15ed539ae73c09cc2c2eda83ae3c864/core/src/main/scala/kafka/log/UnifiedLog.scala#L172 And this could be a problem when [Partition#createLog](https://sourcegraph.com/github.com/apache/kafka@f895ab5145077c5efa10a4a898628d901b01e2c2/-/blob/core/src/main/scala/kafka/cluster/Partition.scala?L495) get called since the HWM is changed from LLSO (non-zero) to LSO (zero). Also, the incorrect HWM causes error in `convertToOffsetMetadataOrThrow`. If I understand correctly, it seems the root cause is that "when the checkpoint files are not working, we will initialize a `UnifiedLog` with incorrect LSO". and so could we fix that by re-build `logStartOffsets` according remote storage when checkpoint is not working (https://github.com/apache/kafka/blob/aee9724ee15ed539ae73c09cc2c2eda83ae3c864/core/src/main/scala/kafka/log/LogManager.scala#L459)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org