[kudu-CR] [rpc] validate security-related parameters earlier

2024-01-26 Thread Alexey Serbin (Code Review)
Alexey Serbin has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/20961


Change subject: [rpc] validate security-related parameters earlier
..

[rpc] validate security-related parameters earlier

This patch moves the verification of a few security-related RPC flags
from the connection negotiation time (i.e. DoServerNegotiation()) into
the server bootstrap time (i.e. RpcServer::Init()).

This is more optimal because:
  * misconfiguration are spot earlier: if this type of misconfiguration
is detected, the outcome is visible and actionable outright since
the server simply refuses to start, reporting on the issue
  * there isn't a way to have a Kudu server running with such an issue
  * some amount of CPU resources is spared since a server no longer
performs the same verification over and over again when establishing
a new RPC connection, but performs it once during the start-up time

Change-Id: Ie16537c440041c9ee74276da35f2599592ef7e04
---
M src/kudu/rpc/negotiation.cc
M src/kudu/server/rpc_server.cc
2 files changed, 14 insertions(+), 7 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/20961/1
--
To view, visit http://gerrit.cloudera.org:8080/20961
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie16537c440041c9ee74276da35f2599592ef7e04
Gerrit-Change-Number: 20961
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-26 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20725 )

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..


Patch Set 14: Code-Review+1

(9 comments)

Thanks a lot for revving the patch.

Almost there, just a few nits and typos to fix.

http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@21
PS14, Line 21: RHEL 8.8 ARM 64k core
Did you mean "RHEL 8.8 ARM with 64K page size"?


http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@36
PS14, Line 36: 64k core
nit: 64K page size ?


http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@51
PS14, Line 51: Added a jira to find a solution
nit: Filed a JIRA issue to track the issue


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc
File src/kudu/fs/block_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@161
PS14, Line 161: CreateABlock
style nit: why not just CreateBlock() ?  If you want to emphasize it's a single 
block, consider naming this method CreateOneBlock()  By the code style used in 
the project, 'a' and 'the' articles aren't used in the method names.

You could find the corresponding style guide's section (it calls for 
consistency in naming) at 
https://google.github.io/styleguide/cppguide.html#Naming


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@942
PS14, Line 942: is
are


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@949
PS14, Line 949: resistent
resistant


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@981
PS14, Line 981: // The chance that we delete the
  : // .metadata but fail to delete the .data file is (1-x) * x
I guess that's a bit simplistic given that the deletion of these two files 
aren't exactly independent (i.e. if there is a failure to delete the first 
file, we do not even try to delete the second, right?), but it's good enough as 
an approximation to show what to expect once the test is enabled.

Thank you for adding this comment!


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@1022
PS14, Line 1022: ,
nit: remove the comma?


http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@1028
PS14, Line 1028: "
nit: remove the stray quote?



--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 
Gerrit-Comment-Date: Fri, 26 Jan 2024 23:21:41 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-26 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#14).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM 64k core.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4
blocks in a full container file. We write 500 blocks, so we expect to have
nearly 125 full files. If we delete exactly half of the blocks, we will
make many (full) container file empty. Some of them might fail to be deleted
properly leaving a lonely non-empty .data file without .metadata. On my RHEL
machine the test fails 97-98% of the time for this exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Added a jira to find a solution (KUDU-3528).

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 95 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/14
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] [WIP][catalog manager] Skip eviction during bootstrapping

2024-01-26 Thread Code Review
Hello Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20958

to look at the new patch set (#2).

Change subject: [WIP][catalog_manager] Skip eviction during bootstrapping
..

[WIP][catalog_manager] Skip eviction during bootstrapping

If the leader replica shuts down and returns to the quorum, it will take
some time to bootstrap. During this time, it is possible to send a full
tablet report it still reckons itself as the leader and since it is
bootstrapping, the overall health is unknown. This can cause trouble as
it there is a check that the leader must be always healthy.
This issue is solved by checking if the tablet is bootstrapping and if
it is then eviction check is skipped. There is no need for new test
because this change fixes ToolTest::TestRecreateCMeta flakiness and
verifies the correct behaviour indirectly.

Change-Id: I0cbbe6b77d99d167fc58692047072f86d3d963cf
---
M src/kudu/master/catalog_manager.cc
1 file changed, 4 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/58/20958/2
--
To view, visit http://gerrit.cloudera.org:8080/20958
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0cbbe6b77d99d167fc58692047072f86d3d963cf
Gerrit-Change-Number: 20958
Gerrit-PatchSet: 2
Gerrit-Owner: Ádám Bakai 
Gerrit-Reviewer: Kudu Jenkins (120)


[kudu-CR] [WIP][catalog manager] Skip eviction during bootstrapping

2024-01-26 Thread Code Review
Ádám Bakai has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/20958


Change subject: [WIP][catalog_manager] Skip eviction during bootstrapping
..

[WIP][catalog_manager] Skip eviction during bootstrapping

If the leader replica shuts down and returns to the quorum, it will take
some time to bootstrap. During this time, it is possible to send a full
tablet report it still reckons itself as the leader and since it is
bootstrapping, the overall health is unknown. This can cause trouble as
it there is a check that the leader must be always healthy.
This issue is solved by checking if the tablet is bootstrapping and if
it is then eviction check is skipped. There is no need for new test
because this change fixes ToolTest::TestRecreateCMeta flakiness and
verifies the correct behaviour indirectly.

Change-Id: I0cbbe6b77d99d167fc58692047072f86d3d963cf
---
M src/kudu/master/catalog_manager.cc
1 file changed, 4 insertions(+), 3 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/58/20958/1
--
To view, visit http://gerrit.cloudera.org:8080/20958
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0cbbe6b77d99d167fc58692047072f86d3d963cf
Gerrit-Change-Number: 20958
Gerrit-PatchSet: 1
Gerrit-Owner: Ádám Bakai 


[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment

2024-01-26 Thread Zoltan Martonka (Code Review)
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek 
Chennaka, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20725

to look at the new patch set (#13).

Change subject: KUDU-3527 Fix block manager test when using 64k container block 
alignment
..

KUDU-3527 Fix block manager test when using 64k container block alignment

BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system
where we use 64k alignment for data blocks.

Root cause:
Currently  tablets fail to load if .metadata is missing but there
is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater
than zero, then there is a chance that, when we delete the container file,
we only delete the ".meta", but leave the ".data" file.
So deleting containers with injected io errors is expected to
sometime prevent the block manager from restarting properly.

However container deletion almost never occurred in this test until we run
it on the new RHEL 8.8 ARM 64k core.

Why is it stable on x86_64:
On x86_64 we usually use 4k block alignment. We write 6080 byte data into a
block, which is padded to 8k. So in the current test settings we have 32
blocks in a container when it becomes full
(FLAGS_log_container_max_size = 256k). Later we delete exactly half of the
500 blocks we wrote. The chance of deleting all 32 blocks in a container
is very small, and even if it happens, it still has around 0.09 chance to
become corrupted. It is a bit flaky, but it would fail less than once in a
billion run.
If you dramatically decrease the FLAGS_log_container_max_size flag, the test
starts to occasionally fail on a traditional x86_64 machine too.

Why is it unstable with 64k alignment:
With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4
blocks in a full container file. We write 500 blocks, so we expect to have
nearly 125 full files. If we delete exactly half of the blocks, we will
make many (full) container file empty. Some of them might fail to be deleted
properly leaving a lonely non-empty .data file without .metadata. On my RHEL
machine the test fails 97-98% of the time for this exact reason.

Solution:
The test TestMetadataOkayDespiteFailure was supposed to test reloading the
block manager with containers having deleted blocks, with some previous
failed delete. It (probably) never tested the case when container deletion
occurs.
+ Disabled container deletion, so the test scope remains the same as it was
  with smaller block alignments.
+ Add a new (currently disabled) test, to see how block manager handles the
above described situation. Added a jira to find a solution (KUDU-3528).

The original issue is not ARM specific, and far from trivial to solve, and
was always in the system.

Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
---
M src/kudu/fs/block_manager-test.cc
1 file changed, 95 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/13
--
To view, visit http://gerrit.cloudera.org:8080/20725
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20
Gerrit-Change-Number: 20725
Gerrit-PatchSet: 13
Gerrit-Owner: Zoltan Martonka 
Gerrit-Reviewer: Abhishek Chennaka 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Ashwani Raina 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Reviewer: Zoltan Martonka 


[kudu-CR] [log block manager] Write lock for deletion

2024-01-26 Thread Code Review
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20901

to look at the new patch set (#5).

Change subject: [log_block_manager] Write lock for deletion
..

[log_block_manager] Write lock for deletion

Both removing blocks and compacting metadata manipulates the
live_blocks_ value and file content. This way it can happen that on
thread A live_blocks_ is updated, on thread B metadata is
compacted and live_blocks_ value is overwritten, then thread A writes
deletion records in the metadata file. This way it creates inconsistency
between file content and live_blocks_ value, which leads to incorrect
behaviour to skip compaction later.
This fix makes sure that CompactMetadata and deletion can not overlap
each other. No new test is needed because it fixes LogBlockManagerTest::
TestContainerBlockLimitingByMetadataSizeWithCompaction flakiness.

Change-Id: I73712a5fd8eccf23621f8abda3e99ebdf10a769f
---
M src/kudu/fs/log_block_manager.cc
1 file changed, 2 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/01/20901/5
--
To view, visit http://gerrit.cloudera.org:8080/20901
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I73712a5fd8eccf23621f8abda3e99ebdf10a769f
Gerrit-Change-Number: 20901
Gerrit-PatchSet: 5
Gerrit-Owner: Ádám Bakai 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Attila Bukor 
Gerrit-Reviewer: Kudu Jenkins (120)


[kudu-CR] KUDU-3535 Should clear log cache while tombstoning a replica.

2024-01-26 Thread Song Jiacheng (Code Review)
Hello Kudu Jenkins, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20822

to look at the new patch set (#5).

Change subject: KUDU-3535 Should clear log cache while tombstoning a replica.
..

KUDU-3535 Should clear log cache while tombstoning a replica.

The log cache of a replica still exists even if the replica has been
already tombstoned. This problem might take place if we decrease
the replication factor of a table with high throughput.

So we should clear the log cache while deleting the replica with
delete type TABLET_DATA_TOMBSTONED.

Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9
---
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache.cc
M src/kudu/consensus/log_cache.h
5 files changed, 29 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/20822/5
--
To view, visit http://gerrit.cloudera.org:8080/20822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9
Gerrit-Change-Number: 20822
Gerrit-PatchSet: 5
Gerrit-Owner: Song Jiacheng 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Song Jiacheng 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>


[kudu-CR] RKUDU-3535 Should clear log cache while tombstoning a replica.

2024-01-26 Thread Song Jiacheng (Code Review)
Song Jiacheng has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20822 )

Change subject: RKUDU-3535 Should clear log cache while tombstoning a replica.
..


Patch Set 4:

> Patch Set 3:
>
> (1 comment)

Thanks for the review!
I added a test to confirm the log cache is cleared.


--
To view, visit http://gerrit.cloudera.org:8080/20822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9
Gerrit-Change-Number: 20822
Gerrit-PatchSet: 4
Gerrit-Owner: Song Jiacheng 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Song Jiacheng 
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Comment-Date: Fri, 26 Jan 2024 09:59:03 +
Gerrit-HasComments: No


[kudu-CR] RKUDU-3535 Should clear log cache while tombstoning a replica.

2024-01-26 Thread Song Jiacheng (Code Review)
Hello Kudu Jenkins, Wang Xixu,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20822

to look at the new patch set (#4).

Change subject: RKUDU-3535 Should clear log cache while tombstoning a replica.
..

RKUDU-3535 Should clear log cache while tombstoning a replica.

The log cache of a replica still exists even if the replica has been
already tombstoned. This problem might take place if we decrease
the replication factor of a table with high throughput.

So we should clear the log cache while deleting the replica with
delete type TABLET_DATA_TOMBSTONED.

Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9
---
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache.cc
M src/kudu/consensus/log_cache.h
5 files changed, 29 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/20822/4
--
To view, visit http://gerrit.cloudera.org:8080/20822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9
Gerrit-Change-Number: 20822
Gerrit-PatchSet: 4
Gerrit-Owner: Song Jiacheng 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>