On Mon, Jun 15, 2015 at 7:08 AM Tim Bain <tb...@alumni.duke.edu> wrote:
>
> It seems pretty clear that the assumption that acquiring a single file
lock
> without doing any further checks will provide thread-safety in all cases
is
> not an accurate one.
>
> As I see it, here are the failures of the existing approach:
>
>    - Failure #1 is that when an NFS failure occurs, the master broker
never
>    recognizes that an NFS failure has occurred and never recognizes that
the
>    slave broker has replaced it as the master.  The broker needs to
identify
>    those things even when it has no other reason to write to NFS.
>
>    - Failure #2 is that the slave broker believes that it can immediately
>    become the master.  This wouldn't be a problem if the master broker
>    instantaneously recognized that it has been supplanted and immediately
>    ceded control, but assuming anything happens instantaneously
(especially
>    across multiple machines) is pretty unrealistic.  This means there
will be
>    a period of unresponsiveness when a failover occurs.
>
>    - Failure #3 is that once the master recognizes that it no longer is
the
>    master, it needs to abort all pending writes (in a way that guarantees
that
>    the data files will not be corrupted if NFS returns when some have been
>    cancelled and others have not).

I don't know if this has been called out already, but it will
important for users to coordinate how their NFS is configured.  For
example, activating asynchronous writes on the NFS side would
probably make a mess of any assumptions we can make from the client
side.  I also wonder how buffering might affect how a heartbeat
would have to work, whether or not mtime won't get propogated until
enough data has been written to cause a flush to disk.

> I've got a proposed solution that I think will address all of these
> failures, but hopefully others will see ways to improve it. Here's what
> I've got in mind:
>
>    1. It is no longer sufficient to hold an NFS lock on the DB lock file.
>    In order to maintain master status, you must successfully write to the
DB
>    lock file within some time period. If you fail to do so within that
time
>    period, you must close all use of the DB files and relinquish master
status
>    to another broker.
>
>    2. When the master shuts down during normal NFS circumstances, it will
>    delete the DB lock file.  If at any point a slave broker sees that
there is
>    no DB lock file or that the DB lock file is so stale that the master
must
>    have shut down (more on that later), it may immediately attempt to
create
>    one and begin writing to it.  If that write succeeds, it is the master.

Random curiousity, is this a network lock manager (NLM) based lock?

>    3. All brokers should determine whether the current master is still
>    alive by checking the current content of the DB lock file against the
>    content read the last time you checked, rather than simply locking the
file
>    and assuming that tells you who's got ownership. This means that the
>    current master needs to update some content in the DB lock file to
make it
>    unique each on each write; I propose that the content of the file be
the
>    broker's host, the broker's PID, the current local time on the broker
at
>    the time it did its write, and a UUID that will guarantee uniqueness
of the
>    content from write to write even in the face of time issues.  Note that
>    only the UUID is actually required for this algorithm to work, but I
think
>    that having the other information will make it easier to troubleshoot.

That sounds good.

>    4. Because time can drift between machines, it is not sufficient to
>    compare the write date on the DB lock file with your host's current
time
>    when determining that a file is stale; you must successfully read the
file
>    repeatedly over a time period and receive the same value each time in
order
>    to decide that the DB lock file is stale.
>
>    5. The master should use the same approach as the slaves to determine
if
>    it's still in control, by checking for changes to the content of the DB
>    lock file. This means the master needs to positively confirm that each
>    periodic write to the DB lock file succeeded by reading it back (in a
>    separate thread, using a timeout on the read operation to identify
>    situations where NFS doesn't respond), rather than simply assuming
that its
>    call to write() worked successfully.
>
>    6. When a slave determines that the master has failed to write to the
DB
>    lock file for longer than the timeout, it attempts to acquire the write
>    lock on the DB lock file and write to it to become the master.  If it
>    succeeds (because the master has lost NFS connectivity but the slave
has
>    not), it is the master.  If it fails because another slave acquired the
>    lock first or because the slave has lost NFS connectivity, it goes
back to
>    monitoring for the master to fail to write to the DB lock file.


Would the master re-read the data file again just before the next
write as a guard against some other process somehow snagging the
lock out from under it?

Jim

Reply via email to