On Mon, Jun 15, 2015 at 7:08 AM Tim Bain <tb...@alumni.duke.edu> wrote: > > It seems pretty clear that the assumption that acquiring a single file lock > without doing any further checks will provide thread-safety in all cases is > not an accurate one. > > As I see it, here are the failures of the existing approach: > > - Failure #1 is that when an NFS failure occurs, the master broker never > recognizes that an NFS failure has occurred and never recognizes that the > slave broker has replaced it as the master. The broker needs to identify > those things even when it has no other reason to write to NFS. > > - Failure #2 is that the slave broker believes that it can immediately > become the master. This wouldn't be a problem if the master broker > instantaneously recognized that it has been supplanted and immediately > ceded control, but assuming anything happens instantaneously (especially > across multiple machines) is pretty unrealistic. This means there will be > a period of unresponsiveness when a failover occurs. > > - Failure #3 is that once the master recognizes that it no longer is the > master, it needs to abort all pending writes (in a way that guarantees that > the data files will not be corrupted if NFS returns when some have been > cancelled and others have not).
I don't know if this has been called out already, but it will important for users to coordinate how their NFS is configured. For example, activating asynchronous writes on the NFS side would probably make a mess of any assumptions we can make from the client side. I also wonder how buffering might affect how a heartbeat would have to work, whether or not mtime won't get propogated until enough data has been written to cause a flush to disk. > I've got a proposed solution that I think will address all of these > failures, but hopefully others will see ways to improve it. Here's what > I've got in mind: > > 1. It is no longer sufficient to hold an NFS lock on the DB lock file. > In order to maintain master status, you must successfully write to the DB > lock file within some time period. If you fail to do so within that time > period, you must close all use of the DB files and relinquish master status > to another broker. > > 2. When the master shuts down during normal NFS circumstances, it will > delete the DB lock file. If at any point a slave broker sees that there is > no DB lock file or that the DB lock file is so stale that the master must > have shut down (more on that later), it may immediately attempt to create > one and begin writing to it. If that write succeeds, it is the master. Random curiousity, is this a network lock manager (NLM) based lock? > 3. All brokers should determine whether the current master is still > alive by checking the current content of the DB lock file against the > content read the last time you checked, rather than simply locking the file > and assuming that tells you who's got ownership. This means that the > current master needs to update some content in the DB lock file to make it > unique each on each write; I propose that the content of the file be the > broker's host, the broker's PID, the current local time on the broker at > the time it did its write, and a UUID that will guarantee uniqueness of the > content from write to write even in the face of time issues. Note that > only the UUID is actually required for this algorithm to work, but I think > that having the other information will make it easier to troubleshoot. That sounds good. > 4. Because time can drift between machines, it is not sufficient to > compare the write date on the DB lock file with your host's current time > when determining that a file is stale; you must successfully read the file > repeatedly over a time period and receive the same value each time in order > to decide that the DB lock file is stale. > > 5. The master should use the same approach as the slaves to determine if > it's still in control, by checking for changes to the content of the DB > lock file. This means the master needs to positively confirm that each > periodic write to the DB lock file succeeded by reading it back (in a > separate thread, using a timeout on the read operation to identify > situations where NFS doesn't respond), rather than simply assuming that its > call to write() worked successfully. > > 6. When a slave determines that the master has failed to write to the DB > lock file for longer than the timeout, it attempts to acquire the write > lock on the DB lock file and write to it to become the master. If it > succeeds (because the master has lost NFS connectivity but the slave has > not), it is the master. If it fails because another slave acquired the > lock first or because the slave has lost NFS connectivity, it goes back to > monitoring for the master to fail to write to the DB lock file. Would the master re-read the data file again just before the next write as a guard against some other process somehow snagging the lock out from under it? Jim