----- Original Message -----
> From: "Lars Ellenberg" <lars.ellenb...@linbit.com>
> To: linux-ha@lists.linux-ha.org
> Cc: "David Vossel" <dvos...@redhat.com>, "Fabio M. Di Nitto" 
> <fdini...@redhat.com>, "Andrew Beekhof"
> <and...@beekhof.net>, "Lars Marowsky-Bree" <l...@suse.com>, "Lon Hohberger" 
> <l...@redhat.com>, "Jonathan Brassow"
> <jbras...@redhat.com>, "Dejan Muhamedagic" <deja...@fastmail.fm>
> Sent: Tuesday, May 14, 2013 6:22:08 AM
> Subject: LVM Resource agent, "exclusive" activation
> 
> 
> This is about pull request
> https://github.com/ClusterLabs/resource-agents/pull/222
> "Merge redhat lvm.sh feature set into heartbeat LVM agent"
> 
> Apologies to the CC for list duplicates.  Cc list was made by looking at
> the comments in the pull request, and some previous off-list thread.
> 
> Even though this is about resource agent feature development,
> and thus actually a topic for the -dev list,
> I wanted to give this the maybe wider audience of the users list,
> to encourage feedback from people who actually *use* this feature
> with rgmanager, or intend to use it once it is in the pacemaker RA.
> 
> 
> 
> Here is my perception of this pull request, as such very subjective, and
> I may have gotten some intentions or facts wrong, so please correct me,
> or add whatever I may have missed.
> 
> 
> Appart from a larger restructuring of the code, this introduces the
> feature of "exclusive activation" of LVM volume groups.
> 
> From the commit message:
> 
>       This patch leaves the original LVM heartbeat functionality
>       intact while adding these addition features from the redhat agent.
> 
>       1. Exclusive activation using volume group tags. This feature
>       allows a volume group to live on shared storage within the cluster
>       without requiring the use of cLVM for metadata locking.
> 
>       2. individual logical volume activation for local and cluster
>       volume groups by using the new 'lvname' option.
> 
>       3. Better setup validation when the 'exclusive' option is enabled.
>       This patch validates that when exclusive activation is enabled, either
>       a cluster volume group is in use with cLVM, or the tags variant is
>       configured correctly. These new checks also makes it impossible to
>       enable the exclusive activation for cloned resources.
> 
> 
> That sounds great. Why even discuss it, of course we want that.
> 
> But I feel it does not do what it advertises.
> Rather I think it gives a false sense of "exclusivity"
> that is actually not met.
> 
> (point 2., individual LV activation is ok with me, I think;
>  my difficulties are with the "exclusive by tagging" thingy)
> 
> So what does it do.
> 
> To activate a VG "exclusively", it uses "LVM tags" (see the LVM
> documentation about these).
> 
> Any VG or LV can be tagged with a number of tags.
> Here, only one tag is used (and any other tags will be stripped!).
> 
> I try to contrast current behaviour and "exclusive" behaviour:
> 
> start:
>     non-exclusive:
>       just (try to) activate the VG
>     exclusive by tag:
>       check if a the VG is currently tagged with my node name
>       if not, is it tagged at all?
>             if tagged, and that happens to be a node name that
>              is in the current corosync membership:
>                 FAIL activation
>           else, it is tagged, but that is not a node name,
>              or not currently in the membership:
>                 strip any and all tags, then proceed
>       if not FAILed because already tagged by an other member,
>       re-tag with *my* nodename
>       activate it.
> 
> Also it does double check the "ownership" in
> monitor:
>     non-exclusive:
>         I think due to the high timeout potential under load
>       when using any LVM commands, this just checks for the presence
>       of the /dev/$VGNAME directory nowadays, which is lightweight,
>       and usually good enough (as the services *using* the LVs are
>       monitored anyways).
>     exclusive by tag:
>         it does the above, then, if active, double checks
>       that the current node name is also the current tag value,
>       and if not (tries to) deactivate (which will usually fail,
>       as it can only succeed if it is unused), and returns failure
>       to Pacemaker, which will then do its recovery cycle.
> 
>       By default, Pacemaker would stop all depending resources,
>       stop this one, and restart the whole stack.
> 
>           Which will, in a real split brain situation just
>         make sure that nodes will keep stealing it from each other;
>         it does not prevent corruption in any way.
> 
>         In a non-split-brain case, this situation "can not happen"
>         anyways.  Unless two nodes raced to activate it,
>         when it was untagged.
>         Oops, so it does not prevent that either.
> 
> For completeness, on
> stop:
>    non-exclusive:
>        just deactivate the VG
>    exclusive by tag:
>        double check I am the tag "owner"
>        then strip that tag (so no tag remains, the VG becomes untagged)
>        and deactivate.
>         
> So the resource agent tries to double check membership information,
> as it seems to think it is smarter than Pacemaker.
> 
> So what does that gain us above just trusting pacemaker?
> 
> What does that gain us above
> start:
>    strip all current tags
>    tag with my node name
>    activate
> 
> (If we insist on useing tags,
> for whatever other reason we may have to use them)
> 
> and, for monitor, you could add a $role=Stopped monitor action, to
> double check that it is not started where it is supposed to be stopped.
> The normal monitoring will only check that it is started where it is
> supposed to be started from Pacemakers point of view.
> 
> 
> The thing is, Pacemaker primitives will only be started on one node.
> If that node leaves the membership, it will be stonithed, to make sure
> it is really gone, before starting the primitive somewhere else.
> 
> So why would the resource agent need to double check that pacemaker did
> the right thing? Why would the resource agent think it is in a better
> position to determine wheter or not it is started somewhere else,
> if it relies on the exact same infrastructure that pacemaker relies on?
> 
> What about "split brain":
> exclusivity then can only be ensured by reliable stonith.
> 
> If that is in place, pacemaker has already made sure
> that this is started exclusively.
> 
> If that is not in place, you get data corruption
> wheter you configured that primitive to be "exclusive" or not:
> the currently active node "owns" the VG, but is not in the membership
> of the node that is about to activate it, which will simply relable the
> thing and activate it anyways.
> 
>    => setting "exclusive=1" attribute makes you "feel" safer,
>    but you are not.
>    That is a Bad Thing.
> 
> In the github comment function,
> Lon wrote:
>   > I believe the tagging/stripping and the way it's implemented is designed
>   > to prevent a few things:
>   > 1) obvious administrative errors within a running cluster: - clone
>   >    resource that really must NEVER be cloned - executing agent directly
>   >    while it's active (and other things)
> 
>   So we don't trust the Admin.
>   What if the admin forgets to add "exclusive=1" to the primitive?
> 
>   If he does not foget that, it is sufficient to just fail all actions
>   (but stop), if we $exclusive and operated as clone.
> 
>   > 2) bugs in the resource manager - betting your entire volume group
>   > on "no bugs"?
>   > "Don't do that" only goes so far, and is little comfort to an
>   > administrator who has corrupt LVM metadata.
>   
>   Uhm, ok.
>   So the failure scenario is that a Pacemaker bug could lead to
>   start of one primitive on more than one node,
>   and the resource agent is supposed to detect that and fail.
> 
>   So you don't trust Pacemaker, but you trust corosync and stonith,
>   and you trust that your resource agent by checking tags against
>   membership before overriding them gets it right.
> 
>   Also, once a supposedly exclusive VG is activated concurrently,
>   chances are that potential LVM *meta* data corruption is less of a
>   concern: you already have *data* corruption due to concurrent
>   modifications, one node doing journal replays of stuff that is live on
>   the other.
> 
>   > There's probably some other bits and pieces I've forgotten;
>   > Jon Brassow would know.
> 
> Hey Jon ;-)
> 
> Anyone, any input?
> 
> Is there any real use case for this implementation other than
> "I don't trust Pacemaker to be able to count to 1,
> but I still rely on the rest of the infrastructure"?

Here's what it comes down to.  You aren't guaranteed exclusive activation just 
because pacemaker is in control. There are scenarios with SAN disks where the 
node starts up and can potentially attempt to activate a volume before 
pacemaker has initialized.  Pacemaker would then shut down the volume 
immediately, but at that point its too late.  The volume got activated on 
multiple nodes when we explicitly wanted exclusive activation. The only way to 
guarantee exclusive activation without cLVMd is to use this node tagging 
feature to filter who is allowed to access the volume outside of pacemaker.

We trust pacemaker, but pacemaker isn't always in control when it comes to 
exclusive activation.  This feature accounts for that.  Jon would be able to 
answer more specific questions about this.

-- Vossel

> 
> Maybe that is a valid scenario.
> I just feel this is a layer violation,
> and it results in a false sense of safety,
> which it actually does not provide.
> 
> It seems to try to "simulate" scsi3 persistent reservations
> with unsuitable means.
> 
> What I'm suggesting is to clearly define
> what the goal of this "exclusive" feature is to be.
> 
> Then check again if we really want that,
> or if it is actually already covered
> by Pacemaker some way or other.
> 
> Maybe this has been originally implemented on request of some customer,
> which was happy once he was able to say "exclusive=1", without thinking
> about technical details?
> 
> Or maybe I'm just missing the point completely.
> 
> Thanks,
>       Lars
> 
> 
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to