subject:"\[kudu\-CR\] KUDU\-1863\: improve overall safety of graceful server shutdown"

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

2017-06-27 Thread Todd Lipcon (Code Review)

Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-1863: improve overall safety of graceful server shutdown
..


KUDU-1863: improve overall safety of graceful server shutdown

What exactly is a "safe" graceful shutdown? I define it like this:
1. Stop handling network traffic (i.e. RPCs and HTTP requests):
   a. Future incoming RPCs are responded to with an error.
   b. Future outgoing RPCs are immediately rejected (i.e. their callbacks
  run on the issuing threads with an error status).
   c. Queued but not yet handled incoming RPCs are responded to with an
  error.
   d. A "black hole" is set up for anyone who tries to respond to a deferred
  incoming RPC (i.e. an RPC that was thunked from an RPC thread and may
  be waiting for external stimuli).
   e. Outgoing RPCs awaiting a response are marked as failed and their
  callbacks are run on RPC threads with an error status.
2. Wait for any RPC/HTTP-related threads to finish. They may be processing
   an incoming RPC, processing the response from an outgoing RPC, or dealing
   with a failed RPC from step 2e.
3. Shut down RPC-using subsystems. Each one is responsible for cleaning up
   its deferred incoming RPCs. If any responses are sent, they will be black
   holed as per step 1d.
4. Shut down remaining server state.

By ordering step 3 after steps 1 and 2, we guarantee that there are no
reactor threads modifying state belonging to the subsystems that we are
trying to shut down. Unfortunately, adhering to this order is very
difficult today: we have to choose between steps 1a and 1b or destroying the
RPC subsystem altogether, which causes problems in step 3. Moreover, it's
not possible to stop handling HTTP requests without tearing down state,
which leads to failures in some subsystems.

So for now, we'll settle for a modified form of the above:
1. Stop accepting new RPCs.
2. Shut down RPC-using subsystems.
3. Destroy all RPC RPC state, waiting for outstanding RPC threads to finish.
4. Shut down the rest of the server.

This patch implements this modified form of a safe shutdown:
- During shutdown, servers first disable the acceptance of new RPCs and stop
  the webserver.
- Servers now shut down master and tserver-specific subsystems before
  generic subsystems.
- Shutting down generic subsystems first explicitly waits for reactor
  threads to finish via a new synchronous Messenger::Shutdown.

The bulk of the patch is to Messenger::Shutdown and friends. While there, I
also reduced lock acquisition to just critical sections (fixing KUDU-1863 in
the process), extended the life of tls_context_ to Messenger destruction,
and changed Shutdown to be callable with registered services.

Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
Reviewed-on: http://gerrit.cloudera.org:8080/7183
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon 
---
M src/kudu/master/master.cc
M src/kudu/rpc/messenger.cc
M src/kudu/rpc/messenger.h
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/server/rpc_server.cc
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
M src/kudu/tserver/tablet_server.cc
9 files changed, 176 insertions(+), 65 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7183
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

2017-06-26 Thread Adar Dembo (Code Review)

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7183

to look at the new patch set (#4).

Change subject: KUDU-1863: improve overall safety of graceful server shutdown
..

KUDU-1863: improve overall safety of graceful server shutdown

What exactly is a "safe" graceful shutdown? I define it like this:
1. Stop handling network traffic (i.e. RPCs and HTTP requests):
   a. Future incoming RPCs are responded to with an error.
   b. Future outgoing RPCs are immediately rejected (i.e. their callbacks
  run on the issuing threads with an error status).
   c. Queued but not yet handled incoming RPCs are responded to with an
  error.
   d. A "black hole" is set up for anyone who tries to respond to a deferred
  incoming RPC (i.e. an RPC that was thunked from an RPC thread and may
  be waiting for external stimuli).
   e. Outgoing RPCs awaiting a response are marked as failed and their
  callbacks are run on RPC threads with an error status.
2. Wait for any RPC/HTTP-related threads to finish. They may be processing
   an incoming RPC, processing the response from an outgoing RPC, or dealing
   with a failed RPC from step 2e.
3. Shut down RPC-using subsystems. Each one is responsible for cleaning up
   its deferred incoming RPCs. If any responses are sent, they will be black
   holed as per step 1d.
4. Shut down remaining server state.

By ordering step 3 after steps 1 and 2, we guarantee that there are no
reactor threads modifying state belonging to the subsystems that we are
trying to shut down. Unfortunately, adhering to this order is very
difficult today: we have to choose between steps 1a and 1b or destroying the
RPC subsystem altogether, which causes problems in step 3. Moreover, it's
not possible to stop handling HTTP requests without tearing down state,
which leads to failures in some subsystems.

So for now, we'll settle for a modified form of the above:
1. Stop accepting new RPCs.
2. Shut down RPC-using subsystems.
3. Destroy all RPC RPC state, waiting for outstanding RPC threads to finish.
4. Shut down the rest of the server.

This patch implements this modified form of a safe shutdown:
- During shutdown, servers first disable the acceptance of new RPCs and stop
  the webserver.
- Servers now shut down master and tserver-specific subsystems before
  generic subsystems.
- Shutting down generic subsystems first explicitly waits for reactor
  threads to finish via a new synchronous Messenger::Shutdown.

The bulk of the patch is to Messenger::Shutdown and friends. While there, I
also reduced lock acquisition to just critical sections (fixing KUDU-1863 in
the process), extended the life of tls_context_ to Messenger destruction,
and changed Shutdown to be callable with registered services.

Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
---
M src/kudu/master/master.cc
M src/kudu/rpc/messenger.cc
M src/kudu/rpc/messenger.h
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/server/rpc_server.cc
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
M src/kudu/tserver/tablet_server.cc
9 files changed, 176 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/83/7183/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7183
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

2017-06-26 Thread Adar Dembo (Code Review)

Adar Dembo has posted comments on this change.

Change subject: KUDU-1863: improve overall safety of graceful server shutdown
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7183/2//COMMIT_MSG
Commit Message:

Line 31:   generic subsystems.
> That's a good question. I didn't encounter any issues, but that's because a
It turns out that this isn't doable either, at least not right now.

The Webserver subsystem (and squeasel) doesn't provide a mechanism to stop 
receiving inbound HTTP requests without also destroying state (be it in-memory 
state or bound sockets). Once the state is gone, the tserver heartbeater 
threads fail in SetupRegistration().

So I'm going to roll this part of the change back (and update the commit 
message accordingly).


-- 
To view, visit http://gerrit.cloudera.org:8080/7183
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

2017-06-26 Thread Adar Dembo (Code Review)

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7183

to look at the new patch set (#3).

Change subject: KUDU-1863: improve overall safety of graceful server shutdown
..

KUDU-1863: improve overall safety of graceful server shutdown

What exactly is a "safe" graceful shutdown? I define it like this:
1. Stop handling network traffic (i.e. RPCs and HTTP requests):
   a. Future incoming RPCs are responded to with an error.
   b. Future outgoing RPCs are immediately rejected (i.e. their callbacks
  run on the issuing threads with an error status).
   c. Queued but not yet handled incoming RPCs are responded to with an
  error.
   d. A "black hole" is set up for anyone who tries to respond to a deferred
  incoming RPC (i.e. an RPC that was thunked from an RPC thread and may
  be waiting for external stimuli).
   e. Outgoing RPCs awaiting a response are marked as failed and their
  callbacks are run on RPC threads with an error status.
2. Wait for any RPC/HTTP-related threads to finish. They may be processing
   an incoming RPC, processing the response from an outgoing RPC, or dealing
   with a failed RPC from step 2e.
3. Shut down RPC-using subsystems. Each one is responsible for cleaning up
   its deferred incoming RPCs. If any responses are sent, they will be black
   holed as per step 1d.
4. Shut down remaining server state.

By ordering step 3 after steps 1 and 2, we guarantee that there are no
reactor threads modifying state belonging to the subsystems that we are
trying to shut down. Unfortunately, adhering to this order is very
difficult today: we only really have the capability to do steps 1a and 1b,
or to destroy the RPC subsystem altogether (causing problems during step 3).

So for now, we'll settle for a modified form of the above:
1. Stop accepting new RPCs and HTTP requests, waiting for any HTTP-related
   threads to finish processing.
2. Shut down RPC-using subsystems.
3. Destroy all RPC RPC state, waiting for outstanding RPC threads to finish.
4. Shut down the rest of the server.

This patch implements this modified form of a safe shutdown:
- During shutdown, servers first disable the acceptance of new RPCs and stop
  the webserver.
- Servers now shut down master and tserver-specific subsystems before
  generic subsystems.
- Shutting down generic subsystems first explicitly waits for reactor
  threads to finish via a new synchronous Messenger::Shutdown.

The bulk of the patch is to Messenger::Shutdown and friends. While there, I
also reduced lock acquisition to just critical sections (fixing KUDU-1863 in
the process), extended the life of tls_context_ to Messenger destruction,
and changed Shutdown to be callable with registered services.

Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
---
M src/kudu/master/master.cc
M src/kudu/rpc/messenger.cc
M src/kudu/rpc/messenger.h
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/server/rpc_server.cc
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
M src/kudu/tserver/tablet_server.cc
9 files changed, 174 insertions(+), 62 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/83/7183/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7183
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

2017-06-26 Thread Adar Dembo (Code Review)

Adar Dembo has posted comments on this change.

Change subject: KUDU-1863: improve overall safety of graceful server shutdown
..


Patch Set 2:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/7183/2//COMMIT_MSG
Commit Message:

PS2, Line 15: we can guarantee that there
: are no reactor threads running RPC callbacks within the 
subsystems that we
: may be trying to shut down
> to guarantee lack of RPC callbacks, wouldn't step 1/2 also need to be ensur
Ideally yes, but since we can't do that today, individual subsystems are 
responsible for safely canceling their outbound RPCs as part of shutting 
themselves down. Some do it well, others, not so well (I'm looking at you, 
CatalogManager).

I'll add that level of detail here and show how what we do is even more 
crippled than what I wrote originally.


Line 31:   generic subsystems.
> does this have any issues with hitting the webserver after things like TSTa
That's a good question. I didn't encounter any issues, but that's because all 
of our "hit the web UI periodically" coverage uses external mini clusters.

I'm not confident that all subsystems are protected against webserver handlers 
running concurrently with shutdown (or even running after shutdown), so I'll 
change "step 1" to also shut down the web UI.


PS2, Line 36: fixing KUDU-1863 in
: the process
> isn't KUDU-1863 still an issue because the SyncWrite might be blocked waiti
Hmm. There are two different kinds of deadlocks described in KUDU-1863.

The first (and the one that motivated filing the bug, AFAICT) has to do with 
the peers responding to a SyncWrite, but those responses are blocked on the 
messenger lock because there's a concurrent Shutdown() that's blocked on the 
SyncWrite while holding the messenger lock. I believe this patch addresses this.

The second is the one you described in the third comment of the bug report. 
This patch doesn't address that (though I didn't see a concrete repro for it 
either; did someone run into it?)

So, should I remove the references to KUDU-1863 from this commit? Or just from 
the summary? Or just explain the distinction in this part of the description?


http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/kserver/kserver.cc
File src/kudu/kserver/kserver.cc:

PS2, Line 87: RPC callbacks
> can you be a bit more explicit about inbound vs outbound? i.e there might b
This is really tricky to document well, but I gave it another shot (also see 
the new commit message). More feedback would be appreciated.

Note that Messenger::Shutdown destroys too much state to effectively "black 
hole" deferred incoming RPCs. Notably, when the reactor threads exit they reset 
the Messenger ref belonging to each Reactor, and this causes 
RpcContext::Respond to crash (I allude to this in server_base.h). If you can 
think of an easy way around this, I'm all ears.


http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/rpc/messenger.cc
File src/kudu/rpc/messenger.cc:

PS2, Line 302: ThreadRestrictions::AssertWaitAllowed();
> Should this be within the ShutdownInternal() instead?  I.e., add AssertWait
Done


PS2, Line 324: services_to_release = std::move(rpc_services_);
 : pools_to_shutdown = std::move(acceptor_pools_);
> Why std::move() instead of swap?  I read David's comment on PS1 and your re
It's a style I prefer, I guess. swap() isn't bad but it forces the reader to 
track one more piece of state (i.e. the contents of the local container) when 
they're evaluating the code; move() ensures that the existing contents are 
blown away.


Line 387:   // Release the map outside of the lock.
> technically I dont think the extra scope is necessary, since 'guard' is def
Yeah, I left the extra scope in for its descriptive value. I don't always 
notice the order in which variables have been defined, so I thought being 
explicit would prevent someone from accidentally undoing the fix.


Line 400:   // Release the service outside of the lock.
> same
See above.


http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/server/server_base.cc
File src/kudu/server/server_base.cc:

PS2, Line 475:   // First, shut down incoming connections and wait for any 
outstanding RPCs to
 :   // finish processing.
> similar to comment elsewhere, not sure this actually finishes all RPCs sinc
Reworded.


http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/server/server_base.h
File src/kudu/server/server_base.h:

PS2, Line 129: rejected
> Could you be more specific on what 'rejected' means?  I.e. what status woul
I think that's too much information for this code comment, especially given the 
client response error code churn that we've seen recently.


-- 
To view, visit http://gerrit.cloudera.org:8080/7183
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
Gerrit-PatchSet: 2

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

2017-06-20 Thread Todd Lipcon (Code Review)

Todd Lipcon has posted comments on this change.

Change subject: KUDU-1863: improve overall safety of graceful server shutdown
..


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7183/2//COMMIT_MSG
Commit Message:

PS2, Line 15: we can guarantee that there
: are no reactor threads running RPC callbacks within the 
subsystems that we
: may be trying to shut down
to guarantee lack of RPC callbacks, wouldn't step 1/2 also need to be ensuring 
that _outbound_ RPCs are all marked failed/aborted/whatever?


Line 31:   generic subsystems.
does this have any issues with hitting the webserver after things like 
TSTabletManager are shut down, etc?


PS2, Line 36: fixing KUDU-1863 in
: the process
isn't KUDU-1863 still an issue because the SyncWrite might be blocked waiting 
for an operation to be committed even when all peers are down? Or do we 
properly abort it?


http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/kserver/kserver.cc
File src/kudu/kserver/kserver.cc:

PS2, Line 87: RPC callbacks
can you be a bit more explicit about inbound vs outbound? i.e there might be 
RPC callbacks executing (due to the respnose of an outbound RPC) and there 
might be inbound RPCs executing (on the RPC worker threads or pending but not 
occupying an explicit RPC thread, eg waiting somewhere in consensus).

Messenger::Shutdown() is enough to ensure that the RPC workers shut down and 
are joined, perhaps, but what about the case where the RPC worker has deferred 
an RpcContext off to some other thread? What will happen then when that other 
thread calls RpcContext::Respond? Is that OK? (I suppose it would be fine to 
just black-hole the response?)


http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/rpc/messenger.cc
File src/kudu/rpc/messenger.cc:

Line 387:   // Release the map outside of the lock.
technically I dont think the extra scope is necessary, since 'guard' is defined 
after 'to_release' it will be destructed first, right?

that said, maybe there's illustrative value in the extra scope, so if you think 
it's nice, feel free to leave it.

Another option might be to use unique_lock and then add an explicit .unlock()


Line 400:   // Release the service outside of the lock.
same


http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/server/server_base.cc
File src/kudu/server/server_base.cc:

PS2, Line 475:   // First, shut down incoming connections and wait for any 
outstanding RPCs to
 :   // finish processing.
similar to comment elsewhere, not sure this actually finishes all RPCs since 
they may be deferring responses to other worker pools


-- 
To view, visit http://gerrit.cloudera.org:8080/7183
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

2017-06-20 Thread Alexey Serbin (Code Review)

Alexey Serbin has posted comments on this change.

Change subject: KUDU-1863: improve overall safety of graceful server shutdown
..


Patch Set 2:

(3 comments)

Is it possible to add a small test to make sure that KUDU-1863 is fixed indeed?

After looking at KUDU-1863 for some time, I think that should not be hard to 
create a test which would reproduce it.  I think using 'log_inject_latency_xxx' 
flags for 2 master-followers would help  while keeping the leader master free 
from artificially injected latency and calling Master::Shutdown() just after 
creating a new table or something like that.

http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/rpc/messenger.cc
File src/kudu/rpc/messenger.cc:

PS2, Line 302: ThreadRestrictions::AssertWaitAllowed();
Should this be within the ShutdownInternal() instead?  I.e., add 
AssertWaitAllowed() for the SYNC mode and keep the current mode for the ASYNC 
mode.


PS2, Line 324: services_to_release = std::move(rpc_services_);
 : pools_to_shutdown = std::move(acceptor_pools_);
Why std::move() instead of swap?  I read David's comment on PS1 and your 
response, but why not to use swap() here to make this code look less suspicious 
for the reader?


http://gerrit.cloudera.org:8080/#/c/7183/2/src/kudu/server/server_base.h
File src/kudu/server/server_base.h:

PS2, Line 129: rejected
Could you be more specific on what 'rejected' means?  I.e. what status would 
client get?

I guess it's Status::ServiceUnavailable, right?


-- 
To view, visit http://gerrit.cloudera.org:8080/7183
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

2017-06-20 Thread Adar Dembo (Code Review)

Hello David Ribeiro Alves, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7183

to look at the new patch set (#2).

Change subject: KUDU-1863: improve overall safety of graceful server shutdown
..

KUDU-1863: improve overall safety of graceful server shutdown

What exactly is a "safe" graceful shutdown? I define it like this:
1. Stop accepting new RPCs.
2. Wait for any existing incoming RPCs to finish execution.
3. Shut down server-specific subsystems.
4. Shut down generic subsystems.

By ordering steps 3 and 4 after steps 1 and 2, we can guarantee that there
are no reactor threads running RPC callbacks within the subsystems that we
may be trying to shut down. Unfortunately, adhering to this order is very
difficult today. Step 2 in particular is thorny; we could implement it via
synchronous Messenger::Shutdown, but even that tears down just enough state
that active subsystems can run into trouble.

So for now, we'll settle for a modified form of the above:
1. Stop accepting new RPCs.
2. Shut down server-specific subsystems.
3. Shut down generic subsystems, including waiting for any existing incoming
   RPCs to finish execution.

This patch implements this modified form of a safe shutdown:
- During shutdown, servers disable the acceptance of new RPCs first.
- Servers now shut down master and tserver-specific subsystems before
  generic subsystems.
- Shutting down generic subsystems now explicitly waits for outstanding RPC
  callbacks to finish, via a new synchronous Messenger::Shutdown.

The bulk of the patch is to Messenger::Shutdown and friends. While there, I
also reduced lock acquisition to just critical sections (fixing KUDU-1863 in
the process), extended the life of tls_context_ to Messenger destruction,
and changed Shutdown to be callable with registered services.

Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
---
M src/kudu/kserver/kserver.cc
M src/kudu/master/master.cc
M src/kudu/rpc/messenger.cc
M src/kudu/rpc/messenger.h
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/server/rpc_server.cc
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
M src/kudu/tserver/tablet_server.cc
10 files changed, 166 insertions(+), 60 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/83/7183/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7183
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibbda29dd471e4dc1a8cb4f472cc456ccdb1e48cc
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

[kudu-CR] KUDU-1863: improve overall safety of graceful server shutdown

8 matches

Site Navigation

Mail list logo

Footer information