Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/20166 )
Change subject: KUDU-3407 Avoid unchecked scheduling of flush operations. ...................................................................... Patch Set 29: (6 comments) Overall looks good to me, just a few questions and nits. Thank you very much for working on this! http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@12 PS29, Line 12: eventually break : due to OOM > Yes, normally tablet servers reject write requests if their memory usage re Right -- overall performance is dropping, so every operation takes longer time to complete. However, longer running operations leads make the memory pressure building up faster, and the requests are rejected with a higher ratio, right? If so, then it's not clear how OOM could result out of such performance drop. Actually, the memory pressure mechanism is there to prevent OOM under such conditions. I guess the OOM could happen because of at least the following: 1. Some operations aren't accounted for and run despite of the memory pressure, and they might consume a lot of memory. 2. Operations are given wrong memory estimates when the process is on the threshold of an OOM condition, and allowed to run and they in fact consume much more memory they estimated, so it results in OOM condition. I know that item 1 is present for sure, and it's reported by KUDU-3406: running CompactRowSetsOp on a long history of deltas might consume a lot of memory. Do you think it could be the culprit behind the OOM in this case as well? Have you tried to check memory profile of the processes that were eventually killed by the OOM killer? http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@16 PS29, Line 16: server is under memory pressure > Not really actually, this patch has been applied in over 100 clusters for a Ah, that's good news. Thanks for the clarification. http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc File src/kudu/util/maintenance_manager-test.cc: http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc@584 PS34, Line 584: nit: fix the alignment for this line http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc@1022 PS34, Line 1022: What's "e"? This looks like an incomplete word or a typo. http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc@1023 PS34, Line 1023: ntenan nit: greater http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc@1026 PS34, Line 1026: op3.set_sleep_time(MonoDelta::FromMilliseconds(30)); : op3.set_register_self(true); These margins make sense from the theoretical perspective, but how did you verify that these margins are safe enough to avoid flakiness in this test? Just curious: did you try to run this test scenario a test with --stress_cpu_threads=16 flag in TSAN and ASAN build configurations many times (say, setting --gtest_repeat=200)? -- To view, visit http://gerrit.cloudera.org:8080/20166 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idc2fd3a850cf99d54ef2980211b712468440ed80 Gerrit-Change-Number: 20166 Gerrit-PatchSet: 29 Gerrit-Owner: Song Jiacheng <songjiach...@thinkingdata.cn> Gerrit-Reviewer: Abhishek Chennaka <achenn...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <ale...@apache.org> Gerrit-Reviewer: Ashwani Raina <ara...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Song Jiacheng <songjiach...@thinkingdata.cn> Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Wang Xixu <1450306...@qq.com> Gerrit-Comment-Date: Tue, 14 Nov 2023 03:59:43 +0000 Gerrit-HasComments: Yes