[kudu-CR] KUDU-3407 Avoid unchecked scheduling of flush operations.

Alexey Serbin (Code Review) Mon, 13 Nov 2023 20:04:23 -0800

Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20166 )


Change subject: KUDU-3407 Avoid unchecked scheduling of flush operations.
......................................................................


Patch Set 29:

(6 comments)

Overall looks good to me, just a few questions and nits.

Thank you very much for working on this!

http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@12
PS29, Line 12: eventually break
             : due to OOM
> Yes, normally tablet servers reject write requests if their memory usage re
Right -- overall performance is dropping, so every operation takes longer time 
to complete.  However, longer running operations leads make the memory pressure 
building up faster, and the requests are rejected with a higher ratio, right? 
If so, then it's not clear how OOM could result out of such performance drop.  
Actually, the memory pressure mechanism is there to prevent OOM under such 
conditions.

I guess the OOM could happen because of at least the following:
  1. Some operations aren't accounted for and run despite of the memory 
pressure, and they might consume a lot of memory.
  2. Operations are given wrong memory estimates when the process is on the 
threshold of an OOM condition, and allowed to run and they in fact consume much 
more memory they estimated, so it results in OOM condition.

I know that item 1 is present for sure, and it's reported by KUDU-3406: running 
CompactRowSetsOp on a long history of deltas might consume a lot of memory.

Do you think it could be the culprit behind the OOM in this case as well?  Have 
you tried to check memory profile of the processes that were eventually killed 
by the OOM killer?


http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@16
PS29, Line 16: server is under memory pressure
> Not really actually, this patch has been applied in over 100 clusters for a
Ah, that's good news.  Thanks for the clarification.


http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc
File src/kudu/util/maintenance_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc@584
PS34, Line 584:
nit: fix the alignment for this line


http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc@1022
PS34, Line 1022:
What's "e"?  This looks like an incomplete word or a typo.


http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc@1023
PS34, Line 1023: ntenan
nit: greater


http://gerrit.cloudera.org:8080/#/c/20166/34/src/kudu/util/maintenance_manager-test.cc@1026
PS34, Line 1026:   op3.set_sleep_time(MonoDelta::FromMilliseconds(30));
               :   op3.set_register_self(true);
These margins make sense from the theoretical perspective, but how did you 
verify that these margins are safe enough to avoid flakiness in this test?

Just curious: did you try to run this test scenario a test with 
--stress_cpu_threads=16 flag in TSAN and ASAN build configurations many times 
(say, setting --gtest_repeat=200)?



--
To view, visit http://gerrit.cloudera.org:8080/20166
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idc2fd3a850cf99d54ef2980211b712468440ed80
Gerrit-Change-Number: 20166
Gerrit-PatchSet: 29
Gerrit-Owner: Song Jiacheng <songjiach...@thinkingdata.cn>
Gerrit-Reviewer: Abhishek Chennaka <achenn...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <ale...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ara...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Song Jiacheng <songjiach...@thinkingdata.cn>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Comment-Date: Tue, 14 Nov 2023 03:59:43 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-3407 Avoid unchecked scheduling of flush operations.

Reply via email to