Ashwani Raina has posted comments on this change. ( http://gerrit.cloudera.org:8080/24092 )
Change subject: [fs] add metrics for untracked orphaned blocks ...................................................................... Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/fs/log_block_manager.cc@2529 PS2, Line 2529: help > nit: held ? Done http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/fs/log_block_manager.cc@2530 PS2, Line 2530: lyring > nit: lying ? Done http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/fs/log_block_manager.cc@3363 PS2, Line 3363: // RemoveBlockIdsFromMetadata writes records in the same order as clbs, > Does this assumption hold for LogBlockContainerRdbMeta? Yes, the assumption should hold for RockDB variant as well, given the logic in there is similar to native container. However, I found one bug in RocksDB function that could result in it send back incorrect deleted_block_ids.size() if there was some intermediate error in writing to persistent storage that resulted in partial deletion. It seems to be missing cleanup routine that does a resize of out parameter if it runs into an error i.e., a partial deletion case scenario. I have raised a separate gerrit patch to address this issue: https://gerrit.cloudera.org/#/c/24134/ Please take a look. http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metadata-test.cc File src/kudu/tablet/tablet_metadata-test.cc: http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metadata-test.cc@441 PS2, Line 441: TEST_F(TestTabletMetadata, TestOrphanBlockDeletionFailed) { > Can we also assert orphaned_block_cleanup_failures_bytes in this test? I wanted to do that as well but the problem with that is the way we fetch size of block. The block size is determined by calling OpenBlock on each block for which delete failed. OpenBlock looks up into a in-memory shard of blocks (managed_block_shards_) indexed with block_id. Since we have already erased block entry from managed_block_shards_ before hitting EIO error which happens at later stage, OpenBlock fails and we end up with 0 bytes. Come to think of it, it seems to be miss from my side because orphaned_block_cleanup_failures_bytes is always going to be 0 because removal of block from in-memory shard of blocks happens way before actual on-disk deletion attempt is made. It is incorrect to rely on OpenBlock to get size for those blocks for which delete failed. Anyway, this is a cosmetic change to get more insight into bytes and overhead of making it work doesn't seem cost-effective. I am inclined toward removing this metric altogether. http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metadata.cc@601 PS2, Line 601: auto bm_size = fs_manager()->block_manager(); > do we need this declaration here? can't we reuse the above declared bm? We can reuse. Since, I have removed the use of failed bytes stat altogether so this code block won't be there now. http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metadata.cc@623 PS2, Line 623: // Regardless of whether we deleted all the blocks or not, remove them from : // the orphaned blocks list. If we failed to delete the blocks due to : // hardware issues, there's not much we can do and we assume the disk isn't : // coming back. At worst, this leaves some untracked orphaned blocks. > Maybe I dont have the complete picture yet, but doesnt this contradict that You are right. I missed out to make correction in above LOG message in PS2. In PS1, I had tried to incorporate the full fix by keeping track of un-deleted blocks in the superblock but it turned out to be a bit more complex. I chose to keep that change separate: https://gerrit.cloudera.org/#/c/24125/ Since this patch only takes care of collecting metrics, LOG_WITH_PREFIX comment is no more applicable. Removed. http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metrics.cc File src/kudu/tablet/tablet_metrics.cc: http://gerrit.cloudera.org:8080/#/c/24092/2/src/kudu/tablet/tablet_metrics.cc@247 PS2, Line 247: > nit: extra new line? Done -- To view, visit http://gerrit.cloudera.org:8080/24092 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id386d9fc8d0900839e229e66772f35299b3ef2e9 Gerrit-Change-Number: 24092 Gerrit-PatchSet: 2 Gerrit-Owner: Ashwani Raina <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Marton Greber <[email protected]> Gerrit-Reviewer: Zoltan Martonka <[email protected]> Gerrit-Comment-Date: Tue, 24 Mar 2026 17:08:40 +0000 Gerrit-HasComments: Yes
