Re: [corosync] RFC: Extending corosync to high node counts

Darren Thompson Thu, 19 Mar 2015 12:50:05 -0700

Team

My first though re this proposal, what mechanism is going to be in-place to
ensure that there is no "split brain" scenario.


For smaller rings we can rely on the fall-back of a "shared media/SBD
device" to ensure that there is consistency/

If there is a comms interruption between ring members, is there a danger
that each reaming half will then recruit new nodes from their "satellite
spares"?

Do we need to consider a mechanism to adapt the node configuration (e.g.
adding SBD devices) ir is that just going to complicate things further?

Darren Thompson

Professional Services Engineer / Consultant

 *[image: cid:[email protected]]*

Level 3, 60 City Road

Southgate, VIC 3006

Mb: 0400 640 414

Mail: [email protected] <[email protected]>
Web: www.akurit.com.au

On 19 March 2015 at 23:00, <[email protected]> wrote:

> Send discuss mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.corosync.org/mailman/listinfo/discuss
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of discuss digest..."
>
>
> Today's Topics:
>
>    1. RFC: Extending corosync to high node counts (Christine Caulfield)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 Mar 2015 10:05:41 +0000
> From: Christine Caulfield <[email protected]>
> To: [email protected]
> Subject: [corosync] RFC: Extending corosync to high node counts
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=utf-8
>
> Extending corosync
> ------------------
>
> This is an idea that came out of several discussions at the cluster
> summit in February. Please comment !
>
> It is not meant to be a generalised solution to extending corosync for
> most users. For single & double digit cluster sizes the current ring
> protocols should be sufficient. This is intended to make corosync usable
> over much larger node counts.
>
> The problem
> -----------
> Corosync doesn't scale well to large numbers of nodes (60-100 to 1000s)
> This is mainly down to the requirements of virtual synchrony(VS) and the
> ring protocol.
>
> A proposed solution
> -------------------
> Have 'satellite' nodes that are not part of the ring (and do not not
> participate in VS).
> They communicate via a single 'host' node over (possibly) TCP. The host
> sends the messages
> to them in a 'send and forget' system - though TCP guaratees ordering
> and delivery.
> Host nodes can support many satellites. If a host goes down the
> satellites can reconnect to
> another node and carry on.
>
> Satellites have no votes, and do not participate in Virtual Synchrony.
>
> Satellites can send/receive CPG messages and get quorum information but
> will not appear in
> the quorum nodes list.
>
> There must be a separate nodes list for satellites, probably maintained
> by a different subsystem.
> Satellites will have nodeIDs (required for CPG) that do not clash with
> the ring nodeids.
>
>
> Appearance to the user/admin
> ----------------------------
> corosync.conf defines which nodes are satellites and which nodes to
> connect to (initially). May
> want some utility to force satellites to migrate from a node if it gets
> full.
>
> Future: Automatic configuration of who is in the VS cluster and who is a
> satellite. Load balancing.
>         Maybe need 'preferred nodes' to avoid bad network topologies
>
>
> Potential problems
> ------------------
> corosync uses a packet-based protocol, TCP is a stream (I don't see this
> as a big problem, TBH)
> Where to hook the message transmission in the corosync networking stack?
>   - We don't need a lot of the totem messages
>   - maybe hook into group 'a' and/or 'sync'(do we need 'sync' in
> satellites [CPG, so probably yes]?)
> Which is client/server? (if satellites are client with authkey we get
> easy failover and config, but ... DOS potential??)
> What if tcp buffers get full? Suggest just cutting off the node.
> How to stop satellites from running totemsrp?
> Fencing, do we need it? (pacemaker problem?)
> GFS2? is this needed/possible?
> Keeping two node lists (totem/quorum and satellite) - duplicate node IDs
> are not allowed and this will need to be enforced.
> No real idea if this will scale as well as I hope it will!
>
>
> How it will (possibly) work
> ---------------------------
> Totemsrp messages will be unaffected (in 1st revision at least),
> satellites are not part of this protocol
> Totempg messages are sent around the ring as usual.
> When one arrives at a node with satellites, it forwards it around the
> ring as usual, then it sends that message to all of the satellites in turn.
> If a send fails then the node is cut off and removed from the
> configuration.
> When a message is received from a satellite it is repackaged as a
> totempg message and sent around the cluster as normal.
> Satellite nodes will be handled by another corosync service that is loaded.
> Use a new corosync service handler to maintain extra nodes list and
> (maybe) do the satellite forwarding.
>
> - Joining
>   A satellite sends a TCP connect and then a join request to its
> nominated (or fallback) host.
>   The host can accept or reject this for reasons of (at least):
>    - duplicated nodeid
>    - no capacity
>    - bad key
>    - bad config
>   The service then sends new node information to the rest of the cluster
>   quorum is not affected
>
>  - leaving
>    If a TCP send fails or a socket is disconnected then the node is
> summarily removed
>    - there will probably also be a 'leave' message for tidy removal
>    - leave notifications are sent around the cluster so that CPG and the
> secondary nodelist know.
>    - quorum does not need to know.
>
>  - failover
>    Satellites have a list of all nodes (quorum and satellite) and if a
> TCP connection
>    is broken then they can try to contact the next node in the nodeid
> list of quorum nodes
>
> Timescales
> ----------
> Nothing decided at this stage, certainly Corosync 3.0 at the earliest as
> it will break on-wire protocol.
> Need to do a proof-of-concept, maybe using containers to get high node
> count.
>
>
> ------------------------------
>
> _______________________________________________
> discuss mailing list
> [email protected]
> http://lists.corosync.org/mailman/listinfo/discuss
>
>
> End of discuss Digest, Vol 43, Issue 10
> ***************************************
>

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] RFC: Extending corosync to high node counts

Reply via email to