[kudu-CR] [rpc] validate security-related parameters earlier
Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/20961 Change subject: [rpc] validate security-related parameters earlier .. [rpc] validate security-related parameters earlier This patch moves the verification of a few security-related RPC flags from the connection negotiation time (i.e. DoServerNegotiation()) into the server bootstrap time (i.e. RpcServer::Init()). This is more optimal because: * misconfiguration are spot earlier: if this type of misconfiguration is detected, the outcome is visible and actionable outright since the server simply refuses to start, reporting on the issue * there isn't a way to have a Kudu server running with such an issue * some amount of CPU resources is spared since a server no longer performs the same verification over and over again when establishing a new RPC connection, but performs it once during the start-up time Change-Id: Ie16537c440041c9ee74276da35f2599592ef7e04 --- M src/kudu/rpc/negotiation.cc M src/kudu/server/rpc_server.cc 2 files changed, 14 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/20961/1 -- To view, visit http://gerrit.cloudera.org:8080/20961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ie16537c440041c9ee74276da35f2599592ef7e04 Gerrit-Change-Number: 20961 Gerrit-PatchSet: 1 Gerrit-Owner: Alexey Serbin
[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/20725 ) Change subject: KUDU-3527 Fix block manager test when using 64k container block alignment .. Patch Set 14: Code-Review+1 (9 comments) Thanks a lot for revving the patch. Almost there, just a few nits and typos to fix. http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@21 PS14, Line 21: RHEL 8.8 ARM 64k core Did you mean "RHEL 8.8 ARM with 64K page size"? http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@36 PS14, Line 36: 64k core nit: 64K page size ? http://gerrit.cloudera.org:8080/#/c/20725/14//COMMIT_MSG@51 PS14, Line 51: Added a jira to find a solution nit: Filed a JIRA issue to track the issue http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc File src/kudu/fs/block_manager-test.cc: http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@161 PS14, Line 161: CreateABlock style nit: why not just CreateBlock() ? If you want to emphasize it's a single block, consider naming this method CreateOneBlock() By the code style used in the project, 'a' and 'the' articles aren't used in the method names. You could find the corresponding style guide's section (it calls for consistency in naming) at https://google.github.io/styleguide/cppguide.html#Naming http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@942 PS14, Line 942: is are http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@949 PS14, Line 949: resistent resistant http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@981 PS14, Line 981: // The chance that we delete the : // .metadata but fail to delete the .data file is (1-x) * x I guess that's a bit simplistic given that the deletion of these two files aren't exactly independent (i.e. if there is a failure to delete the first file, we do not even try to delete the second, right?), but it's good enough as an approximation to show what to expect once the test is enabled. Thank you for adding this comment! http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@1022 PS14, Line 1022: , nit: remove the comma? http://gerrit.cloudera.org:8080/#/c/20725/14/src/kudu/fs/block_manager-test.cc@1028 PS14, Line 1028: " nit: remove the stray quote? -- To view, visit http://gerrit.cloudera.org:8080/20725 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20 Gerrit-Change-Number: 20725 Gerrit-PatchSet: 14 Gerrit-Owner: Zoltan Martonka Gerrit-Reviewer: Abhishek Chennaka Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Ashwani Raina Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy Gerrit-Reviewer: Wang Xixu <1450306...@qq.com> Gerrit-Reviewer: Zoltan Martonka Gerrit-Comment-Date: Fri, 26 Jan 2024 23:21:41 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek Chennaka, Wang Xixu, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20725 to look at the new patch set (#14). Change subject: KUDU-3527 Fix block manager test when using 64k container block alignment .. KUDU-3527 Fix block manager test when using 64k container block alignment BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system where we use 64k alignment for data blocks. Root cause: Currently tablets fail to load if .metadata is missing but there is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater than zero, then there is a chance that, when we delete the container file, we only delete the ".meta", but leave the ".data" file. So deleting containers with injected io errors is expected to sometime prevent the block manager from restarting properly. However container deletion almost never occurred in this test until we run it on the new RHEL 8.8 ARM 64k core. Why is it stable on x86_64: On x86_64 we usually use 4k block alignment. We write 6080 byte data into a block, which is padded to 8k. So in the current test settings we have 32 blocks in a container when it becomes full (FLAGS_log_container_max_size = 256k). Later we delete exactly half of the 500 blocks we wrote. The chance of deleting all 32 blocks in a container is very small, and even if it happens, it still has around 0.09 chance to become corrupted. It is a bit flaky, but it would fail less than once in a billion run. If you dramatically decrease the FLAGS_log_container_max_size flag, the test starts to occasionally fail on a traditional x86_64 machine too. Why is it unstable with 64k alignment: With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4 blocks in a full container file. We write 500 blocks, so we expect to have nearly 125 full files. If we delete exactly half of the blocks, we will make many (full) container file empty. Some of them might fail to be deleted properly leaving a lonely non-empty .data file without .metadata. On my RHEL machine the test fails 97-98% of the time for this exact reason. Solution: The test TestMetadataOkayDespiteFailure was supposed to test reloading the block manager with containers having deleted blocks, with some previous failed delete. It (probably) never tested the case when container deletion occurs. + Disabled container deletion, so the test scope remains the same as it was with smaller block alignments. + Add a new (currently disabled) test, to see how block manager handles the above described situation. Added a jira to find a solution (KUDU-3528). The original issue is not ARM specific, and far from trivial to solve, and was always in the system. Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20 --- M src/kudu/fs/block_manager-test.cc 1 file changed, 95 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/14 -- To view, visit http://gerrit.cloudera.org:8080/20725 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20 Gerrit-Change-Number: 20725 Gerrit-PatchSet: 14 Gerrit-Owner: Zoltan Martonka Gerrit-Reviewer: Abhishek Chennaka Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Ashwani Raina Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy Gerrit-Reviewer: Wang Xixu <1450306...@qq.com> Gerrit-Reviewer: Zoltan Martonka
[kudu-CR] [WIP][catalog manager] Skip eviction during bootstrapping
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20958 to look at the new patch set (#2). Change subject: [WIP][catalog_manager] Skip eviction during bootstrapping .. [WIP][catalog_manager] Skip eviction during bootstrapping If the leader replica shuts down and returns to the quorum, it will take some time to bootstrap. During this time, it is possible to send a full tablet report it still reckons itself as the leader and since it is bootstrapping, the overall health is unknown. This can cause trouble as it there is a check that the leader must be always healthy. This issue is solved by checking if the tablet is bootstrapping and if it is then eviction check is skipped. There is no need for new test because this change fixes ToolTest::TestRecreateCMeta flakiness and verifies the correct behaviour indirectly. Change-Id: I0cbbe6b77d99d167fc58692047072f86d3d963cf --- M src/kudu/master/catalog_manager.cc 1 file changed, 4 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/58/20958/2 -- To view, visit http://gerrit.cloudera.org:8080/20958 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0cbbe6b77d99d167fc58692047072f86d3d963cf Gerrit-Change-Number: 20958 Gerrit-PatchSet: 2 Gerrit-Owner: Ádám Bakai Gerrit-Reviewer: Kudu Jenkins (120)
[kudu-CR] [WIP][catalog manager] Skip eviction during bootstrapping
Ádám Bakai has uploaded this change for review. ( http://gerrit.cloudera.org:8080/20958 Change subject: [WIP][catalog_manager] Skip eviction during bootstrapping .. [WIP][catalog_manager] Skip eviction during bootstrapping If the leader replica shuts down and returns to the quorum, it will take some time to bootstrap. During this time, it is possible to send a full tablet report it still reckons itself as the leader and since it is bootstrapping, the overall health is unknown. This can cause trouble as it there is a check that the leader must be always healthy. This issue is solved by checking if the tablet is bootstrapping and if it is then eviction check is skipped. There is no need for new test because this change fixes ToolTest::TestRecreateCMeta flakiness and verifies the correct behaviour indirectly. Change-Id: I0cbbe6b77d99d167fc58692047072f86d3d963cf --- M src/kudu/master/catalog_manager.cc 1 file changed, 4 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/58/20958/1 -- To view, visit http://gerrit.cloudera.org:8080/20958 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I0cbbe6b77d99d167fc58692047072f86d3d963cf Gerrit-Change-Number: 20958 Gerrit-PatchSet: 1 Gerrit-Owner: Ádám Bakai
[kudu-CR] KUDU-3527 Fix block manager test when using 64k container block alignment
Hello Mahesh Reddy, Alexey Serbin, Ashwani Raina, Kudu Jenkins, Abhishek Chennaka, Wang Xixu, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20725 to look at the new patch set (#13). Change subject: KUDU-3527 Fix block manager test when using 64k container block alignment .. KUDU-3527 Fix block manager test when using 64k container block alignment BlockManagerTest.TestMetadataOkayDespiteFailure might fail on system where we use 64k alignment for data blocks. Root cause: Currently tablets fail to load if .metadata is missing but there is still a non-empty ".data" file. If FLAGS_env_inject_eio is set to greater than zero, then there is a chance that, when we delete the container file, we only delete the ".meta", but leave the ".data" file. So deleting containers with injected io errors is expected to sometime prevent the block manager from restarting properly. However container deletion almost never occurred in this test until we run it on the new RHEL 8.8 ARM 64k core. Why is it stable on x86_64: On x86_64 we usually use 4k block alignment. We write 6080 byte data into a block, which is padded to 8k. So in the current test settings we have 32 blocks in a container when it becomes full (FLAGS_log_container_max_size = 256k). Later we delete exactly half of the 500 blocks we wrote. The chance of deleting all 32 blocks in a container is very small, and even if it happens, it still has around 0.09 chance to become corrupted. It is a bit flaky, but it would fail less than once in a billion run. If you dramatically decrease the FLAGS_log_container_max_size flag, the test starts to occasionally fail on a traditional x86_64 machine too. Why is it unstable with 64k alignment: With 64k alignment (currently used on ARM RHEL 8.8 64k core), there are 4 blocks in a full container file. We write 500 blocks, so we expect to have nearly 125 full files. If we delete exactly half of the blocks, we will make many (full) container file empty. Some of them might fail to be deleted properly leaving a lonely non-empty .data file without .metadata. On my RHEL machine the test fails 97-98% of the time for this exact reason. Solution: The test TestMetadataOkayDespiteFailure was supposed to test reloading the block manager with containers having deleted blocks, with some previous failed delete. It (probably) never tested the case when container deletion occurs. + Disabled container deletion, so the test scope remains the same as it was with smaller block alignments. + Add a new (currently disabled) test, to see how block manager handles the above described situation. Added a jira to find a solution (KUDU-3528). The original issue is not ARM specific, and far from trivial to solve, and was always in the system. Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20 --- M src/kudu/fs/block_manager-test.cc 1 file changed, 95 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/20725/13 -- To view, visit http://gerrit.cloudera.org:8080/20725 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I7e325bde502b7d7f39dd17fa84cb7eb42a3d7c20 Gerrit-Change-Number: 20725 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Martonka Gerrit-Reviewer: Abhishek Chennaka Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Ashwani Raina Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy Gerrit-Reviewer: Wang Xixu <1450306...@qq.com> Gerrit-Reviewer: Zoltan Martonka
[kudu-CR] [log block manager] Write lock for deletion
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20901 to look at the new patch set (#5). Change subject: [log_block_manager] Write lock for deletion .. [log_block_manager] Write lock for deletion Both removing blocks and compacting metadata manipulates the live_blocks_ value and file content. This way it can happen that on thread A live_blocks_ is updated, on thread B metadata is compacted and live_blocks_ value is overwritten, then thread A writes deletion records in the metadata file. This way it creates inconsistency between file content and live_blocks_ value, which leads to incorrect behaviour to skip compaction later. This fix makes sure that CompactMetadata and deletion can not overlap each other. No new test is needed because it fixes LogBlockManagerTest:: TestContainerBlockLimitingByMetadataSizeWithCompaction flakiness. Change-Id: I73712a5fd8eccf23621f8abda3e99ebdf10a769f --- M src/kudu/fs/log_block_manager.cc 1 file changed, 2 insertions(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/01/20901/5 -- To view, visit http://gerrit.cloudera.org:8080/20901 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I73712a5fd8eccf23621f8abda3e99ebdf10a769f Gerrit-Change-Number: 20901 Gerrit-PatchSet: 5 Gerrit-Owner: Ádám Bakai Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins (120)
[kudu-CR] KUDU-3535 Should clear log cache while tombstoning a replica.
Hello Kudu Jenkins, Wang Xixu, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20822 to look at the new patch set (#5). Change subject: KUDU-3535 Should clear log cache while tombstoning a replica. .. KUDU-3535 Should clear log cache while tombstoning a replica. The log cache of a replica still exists even if the replica has been already tombstoned. This problem might take place if we decrease the replication factor of a table with high throughput. So we should clear the log cache while deleting the replica with delete type TABLET_DATA_TOMBSTONED. Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9 --- M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache.cc M src/kudu/consensus/log_cache.h 5 files changed, 29 insertions(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/20822/5 -- To view, visit http://gerrit.cloudera.org:8080/20822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9 Gerrit-Change-Number: 20822 Gerrit-PatchSet: 5 Gerrit-Owner: Song Jiacheng Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Song Jiacheng Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
[kudu-CR] RKUDU-3535 Should clear log cache while tombstoning a replica.
Song Jiacheng has posted comments on this change. ( http://gerrit.cloudera.org:8080/20822 ) Change subject: RKUDU-3535 Should clear log cache while tombstoning a replica. .. Patch Set 4: > Patch Set 3: > > (1 comment) Thanks for the review! I added a test to confirm the log cache is cleared. -- To view, visit http://gerrit.cloudera.org:8080/20822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9 Gerrit-Change-Number: 20822 Gerrit-PatchSet: 4 Gerrit-Owner: Song Jiacheng Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Song Jiacheng Gerrit-Reviewer: Wang Xixu <1450306...@qq.com> Gerrit-Comment-Date: Fri, 26 Jan 2024 09:59:03 + Gerrit-HasComments: No
[kudu-CR] RKUDU-3535 Should clear log cache while tombstoning a replica.
Hello Kudu Jenkins, Wang Xixu, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/20822 to look at the new patch set (#4). Change subject: RKUDU-3535 Should clear log cache while tombstoning a replica. .. RKUDU-3535 Should clear log cache while tombstoning a replica. The log cache of a replica still exists even if the replica has been already tombstoned. This problem might take place if we decrease the replication factor of a table with high throughput. So we should clear the log cache while deleting the replica with delete type TABLET_DATA_TOMBSTONED. Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9 --- M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache.cc M src/kudu/consensus/log_cache.h 5 files changed, 29 insertions(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/20822/4 -- To view, visit http://gerrit.cloudera.org:8080/20822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6cf545e604f80d41e7ebd9660acfd2e928cd27a9 Gerrit-Change-Number: 20822 Gerrit-PatchSet: 4 Gerrit-Owner: Song Jiacheng Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>