Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20166 )

Change subject: KUDU-3407 Avoid unchecked scheduling of flush operations.
......................................................................


Patch Set 29:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@11
PS29, Line 11: wal
nit: WAL


http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@12
PS29, Line 12: eventually break
             : due to OOM
I'm curious what do you think was the actual reason behind that OOM condition.  
If the tablet server flushes MRSs, and pushes back the incoming write 
operations according to the memory pressure, even if each read/write operation 
takes longer and longer, I'd assume that would just make the memory pressure 
higher (so less write operations per second are admitted), but why does it 
eventually goes OOM?


http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@15
PS29, Line 15: add
nit: adds


http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@16
PS29, Line 16: server is under memory pressure
By your experience with applying this functionality in a real cluster, wouldn't 
it increase the chances of an OOM condition?  IIUC, avoiding flushing MRSs and 
instead focusing on running compactions would be triggering an OOM under heavy 
load, no?


http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@27
PS29, Line 27:
nit: remove the trailing space


http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@28
PS29, Line 28: thememory
nit: the memory


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc
File src/kudu/util/maintenance_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@314
PS29, Line 314: that
nit: drop 'that'


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@316
PS29, Line 316: schedule_time_sum_
nit: usually it's called queue time or something, meaning how much time an 
entity has spent waiting in the queue after it had been added into the queue of 
runnable tasks


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@318
PS29, Line 318: count_
nit: maybe, name this 'run_count_'?


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@389
PS29, Line 389: indicate_memory_pressure_
I guess it makes sense to rename this field since its semantics is different 
now: 'memory_pressure_pct_' might be a good option.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@982
PS29, Line 982: Perf improvement op should be scheduled, not the memory op.
What is the probability of actually choosing a memory flushing op here?  In 
other words, how often do you expect this to fail?  Would be great to have a 
comment about that here.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@985
PS29, Line 985: op1.DurationHistogram()->TotalCount(), 1
nit for here and elsewhere for ASSERT_EQ(): the expected value comes first


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@998
PS29, Line 998: This test would not assert anything since it tests the 
probability flags
Even with probabilistic behavior, isn't it always possible at least to check 
against the expected number of operations with a wide, but still meaningful 
safety margin?  If running thousands of operations, I'd think that those 
numbers are within particular bounds that depend on the induced memory pressure 
level, with very high probability, and having 1 out of 1 trillion runs failed 
isn't a big deal -- the natural flakiness of the IC infrastructure is much 
higher than that :)


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@999
PS29, Line 999: ComprehensiveTest
Given this test scenario runs way over 3 seconds, please add  
SKIP_IF_SLOW_NOT_ALLOWED() as the very first line in this scope to skip running 
this if KUDU_ALLOW_SLOW_TESTS env variable isn't set to 1.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1006
PS29, Line 1006:       "run_non_memory_ops_prob = $1, 
data_gc_prioritization_prob = $2",
               :       indicate_memory_pressure_.load(), 
FLAGS_run_non_memory_ops_prob,
               :       FLAGS_data_gc_prioritization_prob);
Why to output this if all these variables aren't even changing at this point 
after they have been assigned in the code just above?  Consider removing this 
logging since it doesn't provide any useful information from the standpoint of 
an automated test.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1035
PS29, Line 1035: MaintenanceMgr num
What is 'MaintenanceMgr num'?


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1036
PS29, Line 1036: StartManager(1);
Wrap this with NO_FATALS() ?


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1044
PS29, Line 1044:   SleepFor(MonoDelta::FromMilliseconds(10000));
Why to wait so long?  Would be great to comment on this.  Alternatively, may be 
it's better to rather check for the number of operations run by the maintenance 
manager so far, and shoot for a fixed threshold there.  That would be better if 
thinking about variations across hardware capabilities of the node where this 
test is run.

You could use DurationHistogram()->TotalCount() for an operation to check how 
many times it has been run so far.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1046
PS29, Line 1046:   // Wait until all the operations are done.
               :   SleepFor(MonoDelta::FromMilliseconds(8000));
Is there a way to check for the number of operations still running or 
operations that are to be scheduled?  E.g., check against 
maintenance_ops_running_, etc.?



--
To view, visit http://gerrit.cloudera.org:8080/20166
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idc2fd3a850cf99d54ef2980211b712468440ed80
Gerrit-Change-Number: 20166
Gerrit-PatchSet: 29
Gerrit-Owner: Song Jiacheng <songjiach...@thinkingdata.cn>
Gerrit-Reviewer: Abhishek Chennaka <achenn...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <ale...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ara...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Song Jiacheng <songjiach...@thinkingdata.cn>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wang Xixu <1450306...@qq.com>
Gerrit-Comment-Date: Mon, 16 Oct 2023 21:10:07 +0000
Gerrit-HasComments: Yes

Reply via email to