Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#15).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
......................................................................

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
    Currently  tablets fail to load if .metadata is missing but there
    is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
    than zero, then there is a chance that, when we delete the container file,
    we only delete the ".meta", but leave the ".data" file.
    So deleting containers with injected io errors is expected to
    sometime prevent the block manager from restarting properly.

    However container deletion almost never occurred in this test until we run
    it on the new RHEL 8.8 ARM with 64K page size.

Why is it stable on x86_64:
    On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
    block, which is padded to 8k. So in the current test settings we have 32
    blocks in a container when it becomes full
    (FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
    500 blocks we wrote. The chance of deleting all 32 blocks in a container
    is very small, and even if it happens, it still has around 0.09 chance to
    become corrupted. It is a bit flaky, but it would fail less than once in a
    billion run.
    If you dramatically decrease the FLAGS_log_container_max_size flag, the test
    starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
    With 64k alignment (currently used on ARM RHEL 8.8 with 64k page size),
    there are 4 blocks in a full container file. We write 500 blocks, so we
    expect to have nearly 125 full files. If we delete exactly half of the
    blocks, we will make many (full) container file empty. Some of them might
    fail to be deleted properly leaving a lonely non-empty .data file without
    .metadata. On my RHEL machine the test fails 97-98% of the time for this
    exact reason.

Solution:
    The test TestMetadataOkayDespiteFailure was supposed to test reloading the
    block manager with containers having deleted blocks, with some previous
    failed delete. It (probably) never tested the case when container deletion
    occurs.
    + Disabled container deletion, so the test scope remains the same as it was
      with smaller block alignments.
    + Add a new (currently disabled) test, to see how block manager handles the
    above described situation. Filed a JIRA issue to track the issue: KUDU-3528.

    The original issue is not ARM specific, and far from trivial to solve, and
    was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 97 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/15
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Martonka <zmarto...@cloudera.com>
Gerrit-Reviewer: Abhishek Chennaka <achenn...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <ale...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ara...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mre...@cloudera.com>
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka <zmarto...@cloudera.com>

Reply via email to