let me introduce to you:

        Rasto Levrinc.

LinBit hired an other developer, Rasto, who's first task was to
implement the heartbeat plugin for resource-level fencing of DRBD
resources.

Find the initial (working) implementation in the drbd svn.
checkout:
 svn co http://svn.drbd.org/drbd/trunk/ drbd-latest
in the tools subdir.

Our intention would be that this eventually becomes part of the
heartbeat CVS / release as soon as deemed appropriate.  (Alan?)

Please review and comment.

=====

What does it do, and why does it do it this way?

How DRBD behaves when we think that "someone" should do fencing:

for example:

 we are Connected Primary/Secondary
 we lose our replication channel

 "fencing = dont-care" --> we just keep going,
        (we are the primary after all!)
        This is basically how drbd 0.7 behaves.
        This risks diverging data sets, e.g. in case of split brain.
        (NOTE: since DRBD is a "shared-nothing shared disk", we do
        _NOT_ risk file system corruption; whether diverging data
        sets are better or worse, is an other story)

 "fencing = resource" --> we invoke the "outdate-peer-handler",
        Which up to now had been a hackish script using ssh,
        but can now be configured to use the new drbd-outdate-peer
        heartbeat plugin.

        heartbeat should not have stonith configured here, or we
        risk that in the event of total communication loss -->
        stonith the other node might win, and we might have
        acknowledged transactions within the time period
        "connection loss" to "beeing stonithed",
        which then will be gone.

        This uses the heartbeat communication links, but
        completely bypasses the crm or any heartbeat authority.
        This is on purpose. It is only one of several possible
        implementations of the concept.

 "fencing = resource-and-stonith"
        We expect heartbeat to have stonith configured, so we will
        freeze all io immediately, invoke the outdate-peer-handler,
        and will only unfreeze io when this handler returns success
        -- or some higher authority explicitly unfreezes us.
        The handler should attempt (or trigger) resource level
        fencing first (mark peer as "outdated"), and fall back to
        stonith if resource level fencing did not work out
        (peer unreachable).

        This could also be called the "oracle mode",
        though orcale people probably want write-quorum >= 2,
        (which will be implemented someday, too ....)

    !!we need some help here!!
        In this case the drbd outdate-peer-handler would need to
        communicate with the crm in the fallback case
        (peer unreachable, resource-fencing not possible),
        if only to ask whether the other node got stonithed,
        or to wait for the stonith operation to take place and complete,
        or even to trigger such a stonith operation, then wait for
        it to complete.

 Does this make sense so far?

Some more notes:

 STONITH should, if configured, always be implemented as "switch off",
 not as "reset", to avoid them stonithing each other.
 (assume two-node cluster...
  there is some problem with quorum based decisions here)

 If you configure drbd fencing=resource, but have stonith configured in
 heartbeat, that is a configuration error.

 If you configure drbd fencing=resource-and-stonith, but have no
 stonith configured in heartbeat, that will freeze io uneccessarily.

 If you have fencing = resource, and no stonith configured, we need not
 freeze io and still avoid diverging data sets even during total
 communication loss: a secondary that has any doubt about the peers disk
 state will refuse to become primary, whereas a primary that does not
 know about its peers disk state will continue to be primary.

 If after a cluster crash the cluster should come up without
 communication, one cannot promote drbd to primary until communication
 is restored or "some authority" explicitly assures one of the nodes
 that the other node has been fenced.

 If one node knows that the peers disk is "bad" (has been marked "outdated",
 is "inconsistent"), this is stored in meta data, so that a degraded
 cluster may crash/reboot and become primary anyways.

 Obviously we do not store "peers disk is good", that would be stupid.

Comments, Please...

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to