[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-29 Thread Alexey Serbin (Code Review)
Alexey Serbin has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM with 64K page size.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 with 64k page size),
there are 4 blocks in a full container file. We write 500 blocks, so we
expect to have nearly 125 full files. If we delete exactly half of the
blocks, we will make many (full) container file empty. Some of them might
fail to be deleted properly leaving a lonely non-empty .data file without
.metadata. On my RHEL machine the test fails 97-98% of the time for this
exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Filed a JIRA issue to track the issue: KUDU-3528.

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Reviewed-on: http://gerrit.cloudera.org:8080/20725
Reviewed-by: Ashwani Raina 
Reviewed-by: Alexey Serbin 
Tested-by: Alexey Serbin 
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 97 insertions(+), 16 deletions(-)

Approvals:
  Ashwani Raina: Looks good to me, but someone else must approve
  Alexey Serbin: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 16
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-29 Thread Alexey Serbin (Code Review)
Alexey Serbin has removed a vote on this change.

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Removed Verified-1 by Kudu Jenkins (120)
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-29 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Patch Set 15: Verified+1

unrelated test failures (ASAN, TSAN)


--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 
Gerrit-Comment-Date: Tue, 30 Jan 2024 05:33:26 +
Gerrit-HasComments: No


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-29 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Patch Set 15: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc
File src/kudu/fs/block_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@981
PS14, Line 981: // The chance that we delete the
  : // .metadata but fail to delete the .data file is (1-x) * x
> Look at the last 2 lines of LogBlockContainerNativeMeta::~LogBlockContainer
I see -- thank you for the clarification.  Indeed, in the destructor the only 
error handling is adding a warning log message if an operation returns non-OK 
status, that's it.  For some reason, I was under impression the removal of 
those files is done in a different place, but I should have double-checked.

Then it's a good news: this estimate is a precise one.



--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 
Gerrit-Comment-Date: Tue, 30 Jan 2024 01:34:18 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-29 Thread Ashwani Raina (Code Review)
Ashwani Raina has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Patch Set 15: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 
Gerrit-Comment-Date: Mon, 29 Jan 2024 16:30:28 +
Gerrit-HasComments: No


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-29 Thread Zoltan Martonka (Code Review)
Zoltan Martonka has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Patch Set 14:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@21
PS14, Line 21: RHEL 8.8 ARM 64k core
> Did you mean "RHEL 8.8 ARM with 64K page size"?
Done


http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@36
PS14, Line 36: 64k core
> nit: 64K page size ?
Done


http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@51
PS14, Line 51: Added a jira to find a solution
> nit: Filed a JIRA issue to track the issue
Done


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc
File src/kudu/fs/block_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@161
PS14, Line 161: CreateABlock
> style nit: why not just CreateBlock() ?  If you want to emphasize it's a si
Thank you.
I just used the original lambda name (create_a_block), did not gave it much 
thought. I will rename it to CreateBlock()


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@942
PS14, Line 942: is
> are
Done


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@949
PS14, Line 949: resistent
> resistant
Done


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@981
PS14, Line 981: // The chance that we delete the
  : // .metadata but fail to delete the .data file is (1-x) * x
> I guess that's a bit simplistic given that the deletion of these two files
Look at the last 2 lines of 
LogBlockContainerNativeMeta::~LogBlockContainerNativeMeta().

There is no error handling there. You can verify that the delete indeed happens 
when this smart pointers are reset:

+ Put a breakpoint to line 995 (CommitDeletedBlocks) in block_manager-test.cc 
and run the new test (use --gtest_also_run_disabled_tests)
+ When the breakpoint is hit, put a breakpoint into "Status DeleteFile(...)" in 
env_posix.cc. Resume the debugging.

You will see a .data then a .metadata delete. you can check the debug callstack 
to see they are indeed called from ~LogBlockContainerNativeMeta()

So the deletion error chances are independent as long as it is our artificial 
error and not a real disk error.


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@1022
PS14, Line 1022: ,
> nit: remove the comma?
Done


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@1028
PS14, Line 1028: "
> nit: remove the stray quote?
Done



--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 
Gerrit-Comment-Date: Mon, 29 Jan 2024 16:26:25 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-29 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#15).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM with 64K page size.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 with 64k page size),
there are 4 blocks in a full container file. We write 500 blocks, so we
expect to have nearly 125 full files. If we delete exactly half of the
blocks, we will make many (full) container file empty. Some of them might
fail to be deleted properly leaving a lonely non-empty .data file without
.metadata. On my RHEL machine the test fails 97-98% of the time for this
exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Filed a JIRA issue to track the issue: KUDU-3528.

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 97 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/15
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-26 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Patch Set 14: Code-Review+1

(9 comments)

Thanks a lot for revving the patch.

Almost there, just a few nits and typos to fix.

http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@21
PS14, Line 21: RHEL 8.8 ARM 64k core
Did you mean "RHEL 8.8 ARM with 64K page size"?


http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@36
PS14, Line 36: 64k core
nit: 64K page size ?


http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@51
PS14, Line 51: Added a jira to find a solution
nit: Filed a JIRA issue to track the issue


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc
File src/kudu/fs/block_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@161
PS14, Line 161: CreateABlock
style nit: why not just CreateBlock() ?  If you want to emphasize it's a single 
block, consider naming this method CreateOneBlock()  By the code style used in 
the project, 'a' and 'the' articles aren't used in the method names.

You could find the corresponding style guide's section (it calls for 
consistency in naming) at 
https://google.github.io/styleguide/cppguide.html#Naming


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@942
PS14, Line 942: is
are


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@949
PS14, Line 949: resistent
resistant


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@981
PS14, Line 981: // The chance that we delete the
  : // .metadata but fail to delete the .data file is (1-x) * x
I guess that's a bit simplistic given that the deletion of these two files 
aren't exactly independent (i.e. if there is a failure to delete the first 
file, we do not even try to delete the second, right?), but it's good enough as 
an approximation to show what to expect once the test is enabled.

Thank you for adding this comment!


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@1022
PS14, Line 1022: ,
nit: remove the comma?


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@1028
PS14, Line 1028: "
nit: remove the stray quote?



--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 
Gerrit-Comment-Date: Fri, 26 Jan 2024 23:21:41 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-26 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#14).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM 64k core.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4
blocks in a full container file. We write 500 blocks, so we expect to have
nearly 125 full files. If we delete exactly half of the blocks, we will
make many (full) container file empty. Some of them might fail to be deleted
properly leaving a lonely non-empty .data file without .metadata. On my RHEL
machine the test fails 97-98% of the time for this exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Added a jira to find a solution (KUDU-3528).

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 95 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/14
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-26 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#13).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM 64k core.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4
blocks in a full container file. We write 500 blocks, so we expect to have
nearly 125 full files. If we delete exactly half of the blocks, we will
make many (full) container file empty. Some of them might fail to be deleted
properly leaving a lonely non-empty .data file without .metadata. On my RHEL
machine the test fails 97-98% of the time for this exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Added a jira to find a solution (KUDU-3528).

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 95 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/13
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 13
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-25 Thread Zoltan Martonka (Code Review)
Zoltan Martonka has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Patch Set 9:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/20725/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20725/9//COMMIT_MSG@53
PS9, Line 53: ARM
> Not related to this change. Just for my understanding.
On rhel system, where I am testing default is xfs. (ext4 on ubuntu)
The problem only occurs when using LogBlockManagerNativeMeta, both the original 
unmodified test and the newly added passes when using FileBlockManager.

If you compile and run Kudu on the arm cloudcat images , it will use 
LogBlockManagerNativeMeta without problems (probably the same in public aws).


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc
File src/kudu/fs/block_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@933
PS9, Line 933: Bolck
> nit: Block
thx.


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@935
PS9, Line 935: because it can't handle a .data file
> Maybe we can write:
thx


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@939
PS9, Line 939: smarted delete
> Not sure what "smarted delete" means. If you are pointing to approach of "f
I was pointing to that, but I will just write your second suggestion.


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@984
PS9, Line 984: 0.9994174
> Add a one-liner to show how you arrived with this calculation. Haven't veri
ok



--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 
Gerrit-Comment-Date: Thu, 25 Jan 2024 17:40:57 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-25 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#12).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM 64k core.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4
blocks in a full container file. We write 500 blocks, so we expect to have
nearly 125 full files. If we delete exactly half of the blocks, we will
make many (full) container file empty. Some of them might fail to be deleted
properly leaving a lonely non-empty .data file without .metadata. On my RHEL
machine the test fails 97-98% of the time for this exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Added a jira to find a solution (KUDU-3528).

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 95 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/12
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 12
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-25 Thread Ashwani Raina (Code Review)
Ashwani Raina has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Patch Set 9:

(10 comments)

Overall looks good to me. Just a few nits.

http://gerrit.cloudera.org:8080/#/c/20725/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20725/9//COMMIT_MSG@53
PS9, Line 53: ARM
Not related to this change. Just for my understanding.
On ARM, what is the out-of-the-box underlying filesystem type?
Do we end up using LBM or File block manager on ARM?


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc
File src/kudu/fs/block_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@933
PS9, Line 933: Bolck
nit: Block


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@935
PS9, Line 935: because it can't handle a .data file
Maybe we can write:
because it can't handle a .data file whose corresponding .metadata file is not 
present.


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@939
PS9, Line 939: smarted delete
Not sure what "smarted delete" means. If you are pointing to approach of "first 
moving file to an intermediate/hidden state and then delete", then you may want 
to mention that in detailed words here, for better understanding. Or you can 
simply mention that comment needs to be rephrased/revisited after KUDU-3528 is 
fixed.


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@959
PS9, Line 959: auto create_a_block = [&](BlockId* out, const string& test_data) 
{
 : unique_ptr block;
 : RETURN_NOT_OK(this->bm_->CreateBlock(this->test_block_opts_, 
&block));
 : RETURN_NOT_OK(block->Append(test_data));
 : RETURN_NOT_OK(block->Finalize());
 : RETURN_NOT_OK(block->Close());
 : *out = block->id();
 : return Status::OK();
 :   };
Do you think it makes sense to move this into a class method that can be called 
from different tests?


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@982
PS9, Line 982: container
nit: containers


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@982
PS9, Line 982: of
remove


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@983
PS9, Line 983: with
is


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@984
PS9, Line 984: chance
remove


http://gerrit.cloudera.org:8080/#/c/20725/9/src/kudu/fs/block_manager-test.cc@984
PS9, Line 984: 0.9994174
Add a one-liner to show how you arrived with this calculation. Haven't verified 
it, but maybe something like this:
[1 - (0.75)^250] ~ 0.9994174



--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 
Gerrit-Comment-Date: Thu, 25 Jan 2024 13:26:54 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-25 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#11).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM 64k core.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4
blocks in a full container file. We write 500 blocks, so we expect to have
nearly 125 full files. If we delete exactly half of the blocks, we will
make many (full) container file empty. Some of them might fail to be deleted
properly leaving a lonely non-empty .data file without .metadata. On my RHEL
machine the test fails 97-98% of the time for this exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Added a jira to find a solution (KUDU-3528).

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 90 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/11
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-25 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#10).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM 64k core.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4
blocks in a full container file. We write 500 blocks, so we expect to have
nearly 125 full files. If we delete exactly half of the blocks, we will
make many (full) container file empty. Some of them might fail to be deleted
properly leaving a lonely non-empty .data file without .metadata. On my RHEL
machine the test fails 97-98% of the time for this exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Added a jira to find a solution (KUDU-3528).

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 91 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/10
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-25 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#9).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM 64k core.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4
blocks in a full container file. We write 500 blocks, so we expect to have
nearly 125 full files. If we delete exactly half of the blocks, we will
make many (full) container file empty. Some of them might fail to be deleted
properly leaving a lonely non-empty .data file without .metadata. On my RHEL
machine the test fails 97-98% of the time for this exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Added a jira to find a solution (KUDU-3528).

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 91 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/9
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-24 Thread Zoltan Martonka (Code Review)
Zoltan Martonka has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Patch Set 7:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/20725/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20725/7//COMMIT_MSG@10
PS7, Line 10: fs_block_size
> What's "fs_block_size"?  Is it indeed the filesystem's block size (i.e. wha
I will use "container block alignment". (it does not matter if it's st_blksize 
or f_bsize or some other value).


http://gerrit.cloudera.org:8080/#/c/20725/7/src/kudu/fs/block_manager-test.cc
File src/kudu/fs/block_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20725/7/src/kudu/fs/block_manager-test.cc@928
PS7, Line 928:   const int kNumTries = 3;
 :   const int kNumBlockTries = 500;
 :   const int kNumAppends = 4;
> nit: could these be constexpr?
thanks.


http://gerrit.cloudera.org:8080/#/c/20725/7/src/kudu/fs/block_manager-test.cc@939
PS7, Line 939: It is not the
 :   // scope of this test to test this case.
> This is looks quite confusing given what this test scenario is doing.  Inst
I changed/simplified the test.


http://gerrit.cloudera.org:8080/#/c/20725/7/src/kudu/fs/block_manager-test.cc@942
PS7, Line 942: like the 64k option of rhel 8.8
> What are those exactly?
I will change it to "If we use 64k block alignment (e.g RHEL ARM 8.8 64k core)"


http://gerrit.cloudera.org:8080/#/c/20725/7/src/kudu/fs/block_manager-test.cc@951
PS7, Line 951: If you set a lower value, the test becomes flaky. With 0.2 it 
almost always fails.
> This might become confusing later on when the root cause of KUDU-3528 is ad
I try to give a better explanation. Also the root cause of the error is not 
that a container becomes full, but that we also delete all blocks.


http://gerrit.cloudera.org:8080/#/c/20725/7/src/kudu/fs/block_manager-test.cc@982
PS7, Line 982:   if (s.ok()) {
 : InsertOrDie(&ids, id);
 : num_created++;
 :   }
> If this is just a way to overcome injected IO errors, why not to set FLAGS_
It was important in the original test. It is not important here (but also not a 
problem).
I change the test to only have errors at deletion.
Thank you for the suggestion.


http://gerrit.cloudera.org:8080/#/c/20725/7/src/kudu/fs/block_manager-test.cc@999
PS7, Line 999: if (it != ids.end()) {
 :   it++;
 : }
> nit for here and elsewhere: it would be great if you preserved more of the
Nothing. The original test tested what happens of we delete every 2nd block. So 
I copied it. It is better to delete all blocks. The problematic behaviour will 
occur more often. I change it.


http://gerrit.cloudera.org:8080/#/c/20725/7/src/kudu/fs/block_manager-test.cc@1033
PS7, Line 1033:   FLAGS_log_container_max_size = 1024 * 1024;
This is not a good fix.I think there still is a chance around 1e-5 that we will 
be left with a .data file without .metadata with this setting. I obviously 
can't test it in a reasonable time, but I will revert to the flag solution. 
Please read the new commit message for more info.



--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 
Gerrit-Comment-Date: Wed, 24 Jan 2024 18:47:29 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-24 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#8).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM 64k core.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4
blocks in a full container file. We write 500 blocks, so we expect to have
nearly 125 full files. If we delete exactly half of the blocks, we will
make many (full) container file empty. Some of them might fail to be deleted
properly leaving a lonely non-empty .data file without .metadata. On my RHEL
machine the test fails 97-98% of the time for this exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Added a jira to find a solution (KUDU-3528).

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 91 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/8
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka