Re: [Linux-ha-dev] Thinking about a new communications plugin

Bob Schatz Wed, 24 Nov 2010 10:10:52 -0800

I am curious.

What is driving the need for more than 32 nodes?   Are many people doing that 
or 
planning on doing that?

In my experience, > 80% of the people just want 2 nodes to work reliability and 
more than 4 nodes is just a marketing requirement to put on a glossy handout.

Is that still the case or am I off base?

Thanks,

Bob

----- Original Message ----
From: Lars Ellenberg <lars.ellenb...@linbit.com>
To: linux-ha-dev@lists.linux-ha.org
Sent: Wed, November 24, 2010 9:18:23 AM
Subject: Re: [Linux-ha-dev] Thinking about a new communications plugin

On Mon, Nov 22, 2010 at 02:18:27PM -0700, Alan Robertson wrote:
> Hi,
> 
> I've been thinking about a new unicast communications plugin that would 
> work slightly differently from the current ucast plugin.
> 
> It would take a filename giving the hostnames or ipv4 or ipv6 unicast 
> addresses that one wants to send heartbeats to.
> 
> When heartbeat receives a SIGHUP, this plugin would reread this file and 
> reconfigure the hosts to send heartbeats to.
> 
> This would mean that there would be no reason to have to restart 
> heartbeat just to add or delete a host from the list being sent heartbeats.
> 
> Some environments (notably clouds) don't allow either broadcasts or 
> multicasts.  This would allow those environments to be able to add and 
> delete hosts to the cluster without having to restart heartbeat - as 
> occurs now...  [and I'd like to support ipv6 for heartbeats].

mcast6 is already there.
ucast6 would be a matter of an afternoon.

> Any thoughts about this?
> 
> Would anyone else like such a plugin?

My direct answer to that question would be "Yes, I'd like that".

But it triggers a slightly longer answer, too:

There is much more interesting work to do in the heartbeat comm layer
than to reconfigure ucast on the fly.

Like
a)  clearly separate control fields and payload fields,
   - for example, always put payload in it's own "FT_UNCOMPRESS",
     that way transparent compression could even compress
     very long FT_STRING payload fields,
     and we would no longer be confused by payload fields accidentally
     being named client_gen ...

b)  support >= 64k media payload (hard udp limit) by sending multiple
     udp packets for one message.
     This limit, btw, may be even less, depending on network setup and
     equipment involved, and is not even mentioned anywhere in doc or
     code. It will just get a EMSGSIZE from sendto().

c)  not sending node messages via every unicast link
   - Problem with global per node seq number space, that currently is
     shared for cluster-wide and directed-node messages.
     Next cluster message would generate rexmit requests.
     Possible solutions:
     - separate these seq number spaces
     - or append a new control field with seq numbers to cluster
       messages that record seq numbers used for node messages,
       so the receiving node of the cluster message would know
       which "missing" seq numbers to not re-request.

Pacemaker 1.1 currently won't work on heartbeat even with a just
normal-sized cib, because it sends down FT_STRING fields with
the full cib up to about 128k.  Workaround would be to enable
"traditional" compression...  or do it differently in pacemaker.
Or, see above -- I think it is actually a design but in the heartbeat
comm layer, and could be fixed by a) above.

Once you aim for more than a handful of nodes, the heartbeat media
cluster communication will break horribly, because of the hard 64k udp
message size limit, and no way to have a msg fragmented to more than one
udp packet.

Even with compression enabled, with 32 nodes and a few clones you will
quickly get > 64k messages.

The rds plugin I wrote as prove of concept could handle much bigger
messages, and would greatly benefit both of c) above, and a method to
re-read a list of peers from some config file (what you proposed fo
ucast). It would easily support multi megabyte message sizes, and even
do away with re-ordering and rexmit requests on the receiving side.
Only it is just proof of concept, does not do anything useful once
things break, nodes vanish, or on congestion (no need for rexmit
requests from the receiving side is traded against need to retry sending
on congestion on the sending side).
So much work to do there, too, if someone wants to pick that up.

So the question of joining additional nodes is not a question of
conveniently configuring it. It's a question whether the communication
layer can support the increased message size caused by one more node in
the cib, as full cib updates including status section must still be
supported, even though they became less frequent lately.

Currently, the answer to that question is
"No, one more node will break it", very quickly.

Once that basically works, then would be a time to think about
convenience of configuration, IMHO.

But that's obviously more work than re-reading a config file on a signal,
so it will likely not be done too soon. Unless someone has a specific
pressing need, and is not willing to try an alternative messaging layer,
but really wants it fixed in heartbeat.

Thanks for reading all of that.

Thoughts?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Thinking about a new communications plugin

Reply via email to