Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek Chennaka, Wang Xixu,
I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20725 to look at the new patch set (#11). Change subject: KUDU-3527 Fix block manager test when using 64k container block alignment ...................................................................... KUDU-3527 Fix block manager test when using 64k container block alignment BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system where we use 64k alignment for data blocks. Root cause: Currently tablets fail to load if .metadata is missing but there is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater than zero, then there is a chance that, when we delete the container file, we only delete the ".meta", but leave the ".data" file. So deleting containers with injected io errors is expected to sometime prevent the block manager from restarting properly. However container deletion almost never occurred in this test until we run it on the new RHEL 8.8 ARM 64k core. Why is it stable on x86_64: On x86_64 we usually use 4k block alignment. We write 6080 byte data into a block, which is padded to 8k. So in the current test settings we have 32 blocks in a container when it becomes full (FLAGS_log_container_max_size = 256k). Later we delete exactly half of the 500 blocks we wrote. The chance of deleting all 32 blocks in a container is very small, and even if it happens, it still has around 0.09 chance to become corrupted. It is a bit flaky, but it would fail less than once in a billion run. If you dramatically decrease the FLAGS_log_container_max_size flag, the test starts to occasionally fail on a traditional x86_64 machine too. Why is it unstable with 64k alignment: With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4 blocks in a full container file. We write 500 blocks, so we expect to have nearly 125 full files. If we delete exactly half of the blocks, we will make many (full) container file empty. Some of them might fail to be deleted properly leaving a lonely non-empty .data file without .metadata. On my RHEL machine the test fails 97-98% of the time for this exact reason. Solution: The test TestMetadataOkayDespiteFailure was supposed to test reloading the block manager with containers having deleted blocks, with some previous failed delete. It (probably) never tested the case when container deletion occurs. + Disabled container deletion, so the test scope remains the same as it was with smaller block alignments. + Add a new (currently disabled) test, to see how block manager handles the above described situation. Added a jira to find a solution (KUDU-3528). The original issue is not ARM specific, and far from trivial to solve, and was always in the system. Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20 --- M src/kudu/fs/block_manager-test.cc 1 file changed, 90 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/11 -- To view, visit http://gerrit.cloudera.org:8080/20725 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20 Gerrit-Change-Number: 20725 Gerrit-PatchSet: 11 Gerrit-Owner: Zoltan Martonka <zmarto...@cloudera.com> Gerrit-Reviewer: Abhishek Chennaka <achenn...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <ale...@apache.org> Gerrit-Reviewer: Ashwani Raina <ara...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <mre...@cloudera.com> Gerrit-Reviewer: Wang Xixu <1450306...@qq.com> Gerrit-Reviewer: Zoltan Martonka <zmarto...@cloudera.com>