I am curious. What is driving the need for more than 32 nodes? Are many people doing that or planning on doing that?
In my experience, > 80% of the people just want 2 nodes to work reliability and more than 4 nodes is just a marketing requirement to put on a glossy handout. Is that still the case or am I off base? Thanks, Bob ----- Original Message ---- From: Lars Ellenberg <lars.ellenb...@linbit.com> To: linux-ha-dev@lists.linux-ha.org Sent: Wed, November 24, 2010 9:18:23 AM Subject: Re: [Linux-ha-dev] Thinking about a new communications plugin On Mon, Nov 22, 2010 at 02:18:27PM -0700, Alan Robertson wrote: > Hi, > > I've been thinking about a new unicast communications plugin that would > work slightly differently from the current ucast plugin. > > It would take a filename giving the hostnames or ipv4 or ipv6 unicast > addresses that one wants to send heartbeats to. > > When heartbeat receives a SIGHUP, this plugin would reread this file and > reconfigure the hosts to send heartbeats to. > > This would mean that there would be no reason to have to restart > heartbeat just to add or delete a host from the list being sent heartbeats. > > Some environments (notably clouds) don't allow either broadcasts or > multicasts. This would allow those environments to be able to add and > delete hosts to the cluster without having to restart heartbeat - as > occurs now... [and I'd like to support ipv6 for heartbeats]. mcast6 is already there. ucast6 would be a matter of an afternoon. > Any thoughts about this? > > Would anyone else like such a plugin? My direct answer to that question would be "Yes, I'd like that". But it triggers a slightly longer answer, too: There is much more interesting work to do in the heartbeat comm layer than to reconfigure ucast on the fly. Like a) clearly separate control fields and payload fields, - for example, always put payload in it's own "FT_UNCOMPRESS", that way transparent compression could even compress very long FT_STRING payload fields, and we would no longer be confused by payload fields accidentally being named client_gen ... b) support >= 64k media payload (hard udp limit) by sending multiple udp packets for one message. This limit, btw, may be even less, depending on network setup and equipment involved, and is not even mentioned anywhere in doc or code. It will just get a EMSGSIZE from sendto(). c) not sending node messages via every unicast link - Problem with global per node seq number space, that currently is shared for cluster-wide and directed-node messages. Next cluster message would generate rexmit requests. Possible solutions: - separate these seq number spaces - or append a new control field with seq numbers to cluster messages that record seq numbers used for node messages, so the receiving node of the cluster message would know which "missing" seq numbers to not re-request. Pacemaker 1.1 currently won't work on heartbeat even with a just normal-sized cib, because it sends down FT_STRING fields with the full cib up to about 128k. Workaround would be to enable "traditional" compression... or do it differently in pacemaker. Or, see above -- I think it is actually a design but in the heartbeat comm layer, and could be fixed by a) above. Once you aim for more than a handful of nodes, the heartbeat media cluster communication will break horribly, because of the hard 64k udp message size limit, and no way to have a msg fragmented to more than one udp packet. Even with compression enabled, with 32 nodes and a few clones you will quickly get > 64k messages. The rds plugin I wrote as prove of concept could handle much bigger messages, and would greatly benefit both of c) above, and a method to re-read a list of peers from some config file (what you proposed fo ucast). It would easily support multi megabyte message sizes, and even do away with re-ordering and rexmit requests on the receiving side. Only it is just proof of concept, does not do anything useful once things break, nodes vanish, or on congestion (no need for rexmit requests from the receiving side is traded against need to retry sending on congestion on the sending side). So much work to do there, too, if someone wants to pick that up. So the question of joining additional nodes is not a question of conveniently configuring it. It's a question whether the communication layer can support the increased message size caused by one more node in the cib, as full cib updates including status section must still be supported, even though they became less frequent lately. Currently, the answer to that question is "No, one more node will break it", very quickly. Once that basically works, then would be a time to think about convenience of configuration, IMHO. But that's obviously more work than re-reading a config file on a signal, so it will likely not be done too soon. Unless someone has a specific pressing need, and is not willing to try an alternative messaging layer, but really wants it fixed in heartbeat. Thanks for reading all of that. Thoughts? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/