Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16952 )
Change subject: KUDU-2612: background task to commit transaction ...................................................................... Patch Set 7: Code-Review+1 (11 comments) http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc File src/kudu/integration-tests/txn_commit-itest.cc: http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@302 PS7, Line 302: DeleteTable Does it work as expected as well if dropping a tablet (just a part of a table), say, dropping one range of a range-partitioned table? Is it worth adding a test scenario for such case? http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@316 PS7, Line 316: TestRestartingWhileCommitting I think it would be great to add a scenario when TxnStatusManager instance is frozen/unfrozen (i.e. sent SIGSTOP/SIGCONT) during the commit phase. That should include a case when the leadership is transferred to another instance because of the unresponsive process hosting the former TxnStatusManager leader replica, but once the commit finalization is picked up by the new leader, the former leader is back and tries to continue its the commit phase finalization routine. It guess it might be a good test to verify the robustness of the commit phase and re-enterability of corresponding methods. What do you think? http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@565 PS7, Line 565: : // Wait a bit to allow would-be background aborts to happen. : SleepFor(MonoDelta::FromSeconds(1)); Is this going to put the txn keepalive thread to sleep as well? I'm not sure. Instead, why not to move the 'txn' handle into a sub-scope, and serialize it into a string which is stored in the top-level scope. Then, when the original 'txn' scope goes out of scope, it's possible to de-serialize the handle and use the de-serialized handle for calls like KuduTransaction::IsCommitComplete() and alike. What do you think? http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@690 PS7, Line 690: tsm_ts_->Shutdown(); I guess a real 'abrupt' way of terminating the process is sending SIGKILL :) Do you think it's worth adding a scenario like that in this test suite? Doing that in a separate change list might be the best option, I guess. http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/integration-tests/txn_commit-itest.cc@754 PS7, Line 754: } It would be great to add at least one failure scenario, when finalizing the commit phase fails to verify that TxnManager doesn't crash and the transaction fails to finalize: * one participant responses with an error to its BEGIN_COMMIT request (e.g., only one replica out of 3 is running -- BTW, is it going to turn into Status::TimedOut() seen at the TxnStatusManager's side or that would be something else?) * sending the commit timestamp from a participant fails (again, because of only one replica out of 3 is running) http://gerrit.cloudera.org:8080/#/c/16952/4/src/kudu/transactions/txn_status_manager.cc File src/kudu/transactions/txn_status_manager.cc: http://gerrit.cloudera.org:8080/#/c/16952/4/src/kudu/transactions/txn_status_manager.cc@152 PS4, Line 152: > I don't think so, since I'm pretty sure the client automatically retries Se Ah, indeed. It seems I misunderstood where the status comes from (even if you added a comment :) ). Thank you for the clarification! http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc File src/kudu/transactions/txn_status_manager.cc: http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@201 PS7, Line 201: if (PREDICT_FALSE(s.IsNotFound())) { : LOG(INFO) << Substitute("Participant $0 was not found: $1", : participant_ids_[participant_idx], s.ToString()); After some consideration, I'm not quite sure this 'good-riddance' semantic is a good fit for all cases. E.g., what if some batch of data is being inserted, and a tablet (or a whole table) is dropped unintentionally during that process? From the user perspective (who isn't aware of the dropped tablet/table) that should be certainly an error -- the data that they thought is reliably persisted after the commit has successfully finalized is actually gone (what a surprise, isn't it?). Should we instead fail a transaction is such a case? Maybe, at least add a flag to make this behavior configurable, where by default a transaction fails to commit if one of its actors with some added data disappeared during the commit phase? http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@286 PS7, Line 286: DCHECK_EQ(0, ops_in_flight_); : ops_in_flight_ = participant_ids_.size(); nit: just for the sake of working with std::atomic in a consistent way, maybe replace this with something like: auto old_val = ops_in_flight_.exchange(participant_ids_.size()); DCHECK_EQ(0, old_val); ? http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@463 PS7, Line 463: MonoDelta::FromSeconds(10) Should this be a parameter controlled by a run-time flag? http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@471 PS7, Line 471: txn_client nit: maybe, move this outside of the 'for ()' cycle? http://gerrit.cloudera.org:8080/#/c/16952/7/src/kudu/transactions/txn_status_manager.cc@492 PS7, Line 492: // Stop any previously on-going tasks. : for (auto& txn_id_and_tasks : commits_in_flight) { : txn_id_and_tasks.second->stop(); : } Why not to do so prior to starting new commit tasks? I guess it might be less threads spawn if doing it the other way, no? If not, I guess it would be great to document the rationale behind the current sequence. -- To view, visit http://gerrit.cloudera.org:8080/16952 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie2258dded3ab3d527cb5d0abdc7d5e7deb4da15e Gerrit-Change-Number: 16952 Gerrit-PatchSet: 7 Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Hao Hao <hao....@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Tue, 02 Feb 2021 06:19:50 +0000 Gerrit-HasComments: Yes