David Ribeiro Alves has posted comments on this change. Change subject: [catalog_manager] categorization of rw operation failures ......................................................................
Patch Set 26: (25 comments) http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/integration-tests/catalog_manager_tsk-itest.cc File src/kudu/integration-tests/catalog_manager_tsk-itest.cc: PS26, Line 82: We want this to happen as often as possible, but : // due to truncation issues (double --> int64_t) two TSK keys might be : // generated with interval less than 1 second, which is not desirable. not sure what's the intent of this sentence? do you do something to mitigate the "undesirability"? PS26, Line 88: master_tsk_propagated_by_non_leaders this flag is a bit cryptic, at first I thought this was referring to tablet servers. not sure it's defined in this patch but master_non_leader_masters_propagate_tsk would be clearer, IMO PS26, Line 114: builder : .default_admin_operation_timeout(timeout) no need to wrap, in fact could you do all of this in the decl line? Line 126: // signed by the key which hasn't yet reached the tserver. This can with a key which hasn't yet reached the tserver or non-leader master (or somesuch) PS26, Line 132: IPKI can you TODO yourself? PS26, Line 134: Ideally, (hb_interval_ms_ * 2) would be enough, but due : // to network latency and multi-process OS, the factor should be greater. how about: We allow for extra slop beyond the minimum (hb_interval_ms_ * 2) to account for network latency and thread preemption. PS26, Line 136: SleepFor(MonoDelta::FromMilliseconds(hb_interval_ms_ * 10)); these timeouts always come back to bite us in the form of flakyness. could we sleep the minimum and keep retrying for a while (always making sure that we get the expected error and none other) PS26, Line 137: // std::min<int64_t>(run_time_seconds_ * 1000 / 2, hb_interval_ms_ * 50))); did you mean to leave this? PS26, Line 185: //SleepFor(MonoDelta::FromMilliseconds(hb_interval_ms_)); left over garbage? http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS26, Line 471: const we never declare status const. I know you can here, but I haven't seen that anywhere, so I'd prefer to follow the same style PS26, Line 484: const same PS26, Line 492: LOG(WARNING) << err_msg << "will try next time"; this is not very informative PS26, Line 494: has left where did it go? :) PS26, Line 762: const same PS26, Line 777: leadership change in the middle: : // if the master server loses its leadership role, then there will be an : // error if trying to write into the system table. If that happens, we do : // not activate the newly generated CA information since it is no longer : // relevant. If it was leadership change indeed, the system table should : // contain the CA certificate information when this master is re-elected : // next. numbered bullets or some better way to lay this out. it's hard to parse as it is PS26, Line 794: a PS26, Line 857: LOG(INFO) << "Generated new certificate authority record"; can you provide more info? it'd be good to print a seq no or some identifier PS26, Line 914: using std::bind; using declarations go on the top of the file PS26, Line 919: auto wrapper = [this](std::function<Status()> func, : const Consensus& consensus, : int64_t start_term, : const char* op_description) { : leader_lock_.AssertAcquiredForWriting(); : const Status s = func(); not really sure what's happening here. can you simplify? PS26, Line 961: static const char* const kLoadMetaOpDescription = : "Loading table and tablet metadata into memory"; can you declare this const on the top of the file? PS26, Line 3231: LOG_WITH_PREFIX(INFO) : << "Latest config for has opid_index of " << latest_index : << " while this task has opid_index of " : << cstate_.config().opid_index() << ". Aborting task."; spurious reformat PS26, Line 3285: LOG_WITH_PREFIX(WARNING) : << "ChangeConfig() failed with leader " : << target_ts_desc_->ToString() : << " due to CAS failure. No further retry: " : << status.ToString(); : MarkFailed(); : break; : default: : LOG_WITH_PREFIX(INFO) : << "ChangeConfig() failed with leader " : << target_ts_desc_->ToString() : << " due to error " : << TabletServerErrorPB::Code_Name(resp_.error().code()) : << ". This operation will be retried. Error detail: " : << status.ToString(); : break; spurious reformat http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: PS26, Line 358: changed from the initial. what's the initial PS26, Line 568: of the object which object? PS26, Line 607: load TSK records from the system table and import them into the : // TokenSigner. After doing that check whether it's time to generate new : // token signing key; if so, then generate and store the new one in the system : // catalog table. Besides, purge the expired TSKs from the system table. can you split this into numbered steps (1, 1.1, 1.2, 2) etc -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 26 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Dan Burkert <danburk...@apache.org> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes