Hi,

On 01/13/11 17:14, Lars Marowsky-Bree wrote:
Hi all,

sorry for the delay in posting this.
And sorry for the delay in replying this :-) I have some questions about this blow.


IntroductioN: At LPC 2010, we discussed (once more) that a key feature
for pacemaker in 2011 would be improved support for multi-site clusters;
by multi-site, we mean two (or more) sites with a local cluster each,
Would the topology of such a multi-site deployment be indicated in cib configuration? Or it's just something corosync would need to care about?

And the cibs between different sites would still be synchronized? In other words, normally there would be only one DC among the sites, right?

and some higher level entity coordinating fail-over across these (as
opposed to "stretched" clusters, where a single cluster might spawn the
whole campus in the city).

Typically, such multi-site environments are also too far apart to
support synchronous communication/replication.

There are several aspects to this that we discussed; Andrew and I first
described and wrote this out a few years ago, so I hope he can remember
the rest ;-)

"Tokens" are, essentially, cluster-wide attributes (similar to node
attributes, just for the whole partition).
Specifically, a "<tokens>" section with an attribute set (
"<token_set>" or something) under "/cib/configuration"?

Should an admin grant a token to the cluster initially? Or grant it to several nodes which are supposed to be from a same site? Or grant it to a partition after a split-brain happens -- A split-brain can happen between the sites or inside a site. How could it be distinguished and what policies to handle the scenarios respectively? What if a partition split further?

Additionally, when a split-brain happens, how about the existing stonith mechanism. Should the partition without quorum be stonithed? If shouldn't, or if couldn't, should the partition elect a DC? What about the no-quorum-policy?


Via dependencies (similar to
rsc_location), one can specify that certain resources require a specific
token to be set before being started
Which way do you prefer? I found you discussed this in another thread last year. The choices mentioned there as:
- A "<rsc_order>" with "Deadman" order-type specified:
<rsc_order id="order-tokenA-rscX" first-token="tokenA" then="rscX" kind="Deadman"/>

- A "<rsc_colocation>":
<rsc_colocation id="rscX-with-tokenA" rsc="rscX" with-token="tokenA" kind="Deadman"/>


Other choices I can imagine:

- There could be a "requires" field in an "op", which could be set to "quorum" or "fencing". Similarly, we could also introduce a "requires-token" field:

<op id="rscX-start" name="start" interval="0" requires-token="tokenA"/>

The shortcoming is a resource cannot depend on multiple tokens.


- A "<rsc_location>" with expressions:

  <rsc_location id="loc-rscX" rsc="rscX" kind="Deadman">
    <rule id="loc-rscX-rule-0">
<expression id="expr-0" attribute="#tokenA" operation="eq" value="true"/>
    </rule>
  </rsc_location>

Via boolean-op, a resource can depend on multiple tokens, or any one of the specified multiple tokens.

- A completely new type of constraint:
<rsc_token id="rscX-with-tokenA" rsc="rscX" token="tokenA" kind="Deadman"/>


(and, vice versa, need to be
stopped if the token is cleared). You could also think of our current
"quorum" as a special, cluster-wide token that is granted in case of
node majority.

The token thus would be similar to a "site quorum"; i.e., the permission
to manage/own resources associated with that site, which would be
recorded in a rsc dependency. (It'd probably make a lot of sense if this
would support resource sets,
If so, the "op" and the current "rsc_location" are not preferred.

so one can easily list all the resources;
also, some resources like m/s may tie their role to token ownership.)

These tokens can be granted/revoked either manually (which I actually
expect will be the default for the classic enterprise clusters), or via
an automated mechanism described further below.


Another aspect to site fail-over is recovery speed. A site can only
activate the resources safely if it can be sure that the other site has
deactivated them. Waiting for them to shutdown "cleanly" could incur
very high latency (think "cascaded stop delays"). So, it would be
desirable if this could be short-circuited. The idea between Andrew and
myself was to introduce the concept of a "dead man" dependency; if the
origin goes away,nodes which host dependent resources are fenced,
immensely speeding up recovery.
Does the "origin" mean "token"? If so, isn't it supposed to be revoked manually by default? So the short-circuited fail-over needs an admin to participate?

BTW, Xinwei once suggested to treat "the token is not set" and "the token is set to no" differently. For the former, the behavior would be like the token dependencies don't exist. If the token is explicitly set, invoke the appropriate policies. Does that help to distinguish scenarios?


It seems to make most sense to make this an attribute of some sort for
the various dependencies that we already have, possibly, to make this
generally available. (It may also be something admins want to
temporarily disable - i.e., for a graceful switch-over, they may not
want to trigger the dead man process always.)
Does it means an option for users to choose if they want an immediate fencing or stopping the resources normally? Is it global or particularly for a specific token , or even/just for a specific dependency?



The next bit is what we called the "Cluster Token Registry"; for those
scenarios where the site switch is supposed to be automatic (instead of
the admin revoking the token somewhere, waiting for everything to stop,
and then granting it on the desired site). The participating clusters
would run a daemon/service that would connect to each other, exchange
information on their connectivity details (though conceivably, not mere
majority is relevant, but also current ownership, admin weights, time
of day, capacity ...), and vote on which site gets which token(s); a
token would only be granted to a site once they can be sure that it has
been relinquished by the previous owner, which would need to be
implemented via a timer in most scenarios (see the dead man flag).

Further, sites which lose the vote (either explicitly or implicitly by
being disconnected from the voting body) would obviously need to perform
said release after a sane time-out (to protect against brief connection
issues).


A final component is an idea to ease administration and management of
such environments. The dependencies allow an automated tool to identify
which resources are affected by a given token, and this could be
automatically replicated (and possibly transformed) between sites, to
ensure that all sites have an uptodate configuration of relevant
resources. This would be handled by yet another extension, a CIB
replicator service (that would either run permanently or explicitly when
the admin calls it).

Conceivably, the "inactive" resources may not even be present in the
active CIB of sites which don't own the token (and be inserted once
token ownership is established). This may be an (optional) interesting
feature to keep CIB sizes under control.


Andrew, is that about what we discussed? Any comments from anyone else?
Did I capture what we spoke about at LPC?


Regards,
     Lars


Regards,
  Yan
--
Yan Gao <y...@novell.com>
Software Engineer
China Server Team, OPS Engineering, Novell, Inc.
 <javascript:void(0);>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to