Todd Lipcon has posted comments on this change. Change subject: KUDU-1863: improve overall safety of graceful server shutdown ......................................................................
Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/7183/2//COMMIT_MSG Commit Message: PS2, Line 15: we can guarantee that there : are no reactor threads running RPC callbacks within the subsystems that we : may be trying to shut down to guarantee lack of RPC callbacks, wouldn't step 1/2 also need to be ensuring that _outbound_ RPCs are all marked failed/aborted/whatever? Line 31: generic subsystems. does this have any issues with hitting the webserver after things like TSTabletManager are shut down, etc? PS2, Line 36: fixing KUDU-1863 in : the process isn't KUDU-1863 still an issue because the SyncWrite might be blocked waiting for an operation to be committed even when all peers are down? Or do we properly abort it? http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/kserver/kserver.cc File src/kudu/kserver/kserver.cc: PS2, Line 87: RPC callbacks can you be a bit more explicit about inbound vs outbound? i.e there might be RPC callbacks executing (due to the respnose of an outbound RPC) and there might be inbound RPCs executing (on the RPC worker threads or pending but not occupying an explicit RPC thread, eg waiting somewhere in consensus). Messenger::Shutdown() is enough to ensure that the RPC workers shut down and are joined, perhaps, but what about the case where the RPC worker has deferred an RpcContext off to some other thread? What will happen then when that other thread calls RpcContext::Respond? Is that OK? (I suppose it would be fine to just black-hole the response?) http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/rpc/messenger.cc File src/kudu/rpc/messenger.cc: Line 387: // Release the map outside of the lock. technically I dont think the extra scope is necessary, since 'guard' is defined after 'to_release' it will be destructed first, right? that said, maybe there's illustrative value in the extra scope, so if you think it's nice, feel free to leave it. Another option might be to use unique_lock and then add an explicit .unlock() Line 400: // Release the service outside of the lock. same http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/server/server_base.cc File src/kudu/server/server_base.cc: PS2, Line 475: // First, shut down incoming connections and wait for any outstanding RPCs to : // finish processing. similar to comment elsewhere, not sure this actually finishes all RPCs since they may be deferring responses to other worker pools -- To view, visit http://gerrit.cloudera.org:8080/7183 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes