On Tue, 12 Oct 2010 13:58:43 -0400 Jeffrey Hutzelman <[email protected]> wrote:
> However, for the "human admins want to see what's going on" problem, > perhaps an RPC interface is better. It should be a separate Rx > service (though probably on the same port), and have at least one > dedicated thread. And for introspection, it may want to completely > ignore locks and risk giving out bogus data rather than risking > deadlock. Well, that's the most reliable way to do this, sure. I'm just not sure how much work we really want to expend on this / how many cases to cover. Practically speaking, I think at least recently any deadlocks (or just "takes a long time shutting down") on shutdown would be involving VOL_LOCK or H_LOCK, so just avoiding those would be fine for almost all people. Maybe I'm just being lazy, though, and what you describe is the right way it should be done. Personally, I'm looking more at fixing the causes of such slow shutdowns, at least in the short term. > You could have the fileserver send periodic signals to its parent > while shutting down. Or, provide for an environment variable > containing the number of a file descriptor over which periodic > heartbeats should be sent. My worry about this kind of approach is when there's some kind of bug that causes the "shutdown the volumes" loop to continue, but we don't actually make any kind of progress. That is, we keep trying to offline the same volume over and over again or something. I'd rather something that keeps a count of offlined volumes and total number of volumes (or something like that), so invalid counts can be aborted immediately. Of course, see above on dealing with unlikely corner cases that aren't actually a problem... What you describe sounds easy, but it's going to (potentially) screw up launching the fileserver process outside of bosserver, which I like to be able to do (easier to attach a debugger on startup that way). > > Or, as I've mentioned before, if the timeout code is just added to > > the fileserver itself, this isn't a problem. > > No; the idea is to KILL KILL KILL the fileserver (or any other server) > if it doesn't shut down in a reasonable time. That has to be done > outside; a process that is hung isn't going to kill itself. If the first thing we do on shutdown is spawn a thread that just abort()s after N seconds, or N seconds after not receiving a signal or whatever, it's hard for me to see what can go wrong with it. Of course memory corruption and other "anything-can-happen" scenarios could screw it up, but really... I'm not saying to also remove an external timeout in bosserver, though. Just that the fileserver itself could have a much finger-grained timeout (adjusting for # of volumes, or the last internal heartbeat, etc) with bosserver having a larger unconditional one. -- Andrew Deason [email protected] _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
