Re: [ClusterLabs] large cluster with corosync
On 04.01.2017 13:52, Jan Friesse wrote: Variables you can try tweak. - Definitively start with increase totem.config (default 1000, you can try 1) what does that do? Haven't found it in corosync.conf(5) - If it doesn't help, try increase totem.join (default is 50, 1000 may work) and consider increase totem.send_join (default is 0, 100 may be good idea). The problem is indeed the flood of join messages overwhelming the receive socket. I increased the socket size in source code from 160k to 3MB and forming the cluster works fine for 80 nodes. send_join looks like a very useful variable to avoid the need for excessively large buffers, so I'll test with join=120ms and send_join=60ms. -Arne - As a last variable, increase of totem.merge (default is 200, 2000 may do the job). And definitively let us know about results. It's quite hard to test such a big amount of nodes so some of the variable may be sub-optimal. When we know which of variables are victims, we can change their defaults. Regards, Honza I want to use corosync + dlm to get a distributed lock manager to have a uniform locking service throughout the platform as a basis for many services. Thanks, Arne Cheers, Kristoffer Thanks, Arne Jan 04 10:24:49 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:00 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:616) was formed. Members joined: 168101089 Jan 04 10:25:00 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:11 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:620) was formed. Members joined: 168101090 Jan 04 10:25:11 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:21 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:624) was formed. Members joined: 168101091 Jan 04 10:25:21 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:32 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:628) was formed. Members joined: 168101092 Jan 04 10:25:32 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:09 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:632) was formed. Members joined: 168101141 Jan 04 10:26:10 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:20 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:636) was formed. Members joined: 168101142 Jan 04 10:26:20 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:31 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:640) was formed. Members joined: 168101143 Jan 04 10:26:31 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:42 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:644) was formed. Members joined: 168101144 Jan 04 10:26:42 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:52 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:648) was formed. Members joined: 168101145 Jan 04 10:26:52 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:03 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:652) was formed. Members joined: 168101146 Jan 04 10:27:03 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:14 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:656) was formed. Members joined: 168101147 Jan 04 10:27:14 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:25 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:660) was formed. Members joined: 168101148 Jan 04 10:27:25 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:35 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:664) was formed. Members joined: 168101161 Jan 04 10:27:35 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:46 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:668) was formed. Members joined: 168101162 Jan 04 10:27:46 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:57 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:672) was formed. Members joined: 168101163 Jan 04 10:27:57 [4915] reniar corosync notice [MAIN ] Completed service synchronizati
Re: [ClusterLabs] large cluster with corosync
Hi Honza, On 04.01.2017 13:52, Jan Friesse wrote: At least those limits doesn't seem to get enforced, as a 64 node cluster seems to work, although a bit shaky. No, they are not enforced. 16/32 are official supported number of nodes. Basically, this is number what was tested and known to work reliably. This doesn't mean corosync doesn't work with bigger number of nodes. Eventho I'm quite surprised that 64 nodes really works. Variables you can try tweak. - Definitively start with increase totem.config (default 1000, you can try 1) - If it doesn't help, try increase totem.join (default is 50, 1000 may work) and consider increase totem.send_join (default is 0, 100 may be good idea). - As a last variable, increase of totem.merge (default is 200, 2000 may do the job). And definitively let us know about results. It's quite hard to test such a big amount of nodes so some of the variable may be sub-optimal. When we know which of variables are victims, we can change their defaults. Thanks for the tuning hints. One thing I definitely need to do before the next test run is to build some kind of rate limiting into the daemon to prevent it from bringing our whole network down. I see sendmsg() calls in totemudp.c and totemupdu.c. Would it be sufficient to restrict those two calls? Thanks, Arne Regards, Honza I want to use corosync + dlm to get a distributed lock manager to have a uniform locking service throughout the platform as a basis for many services. Thanks, Arne Cheers, Kristoffer Thanks, Arne Jan 04 10:24:49 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:00 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:616) was formed. Members joined: 168101089 Jan 04 10:25:00 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:11 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:620) was formed. Members joined: 168101090 Jan 04 10:25:11 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:21 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:624) was formed. Members joined: 168101091 Jan 04 10:25:21 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:32 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:628) was formed. Members joined: 168101092 Jan 04 10:25:32 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:09 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:632) was formed. Members joined: 168101141 Jan 04 10:26:10 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:20 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:636) was formed. Members joined: 168101142 Jan 04 10:26:20 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:31 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:640) was formed. Members joined: 168101143 Jan 04 10:26:31 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:42 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:644) was formed. Members joined: 168101144 Jan 04 10:26:42 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:52 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:648) was formed. Members joined: 168101145 Jan 04 10:26:52 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:03 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:652) was formed. Members joined: 168101146 Jan 04 10:27:03 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:14 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:656) was formed. Members joined: 168101147 Jan 04 10:27:14 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:25 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:660) was formed. Members joined: 168101148 Jan 04 10:27:25 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:35 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:664) was formed. Members joined: 168101161 Jan 04 10:27:35 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:46 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:668) was formed. Members joined: 168101162 Jan 04 10:27:46 [4915] reni
Re: [ClusterLabs] large cluster with corosync
On 04.01.2017 11:25, Kristoffer Grönlund wrote: Arne Jansen writes: Hi, I've built corosync for solaris and am trying to build a largish cluster. I started corosync with default configuration on an increasing number of nodes, one by one. At around 70 nodes the cluster breaks down. Below is an excerpt from the logfile on the first node. When the cluster breaks down corosync seems to completely flood the network. It doesn't recover by itself, I have to stop all nodes. Things get worse if I start corosync on multiple nodes at once. In this case it already breaks down around 40 nodes. Is it supposed to work with such a setup? Does it just need tuning? My goal is have a cluster of several hundred nodes. Hi, No, corosync has a limit to how many nodes the cluster can contain and still function properly. According to SUSE, the limit is 32 nodes, according to Red Hat the limit is 16 (those are the limits for enterprise support as far as I know) - but even having that number of nodes will probably require some tweaking of timeouts in corosync.conf, depending on what kind of network you have. If I recall correctly, there are plans for overcoming this limit in the future but right now that's the situation. There is the pacemaker_remote project which allows for adding additional, non-corosync nodes to a cluster which can run resources but don't participate in quorum and rely on at least some of the core cluster nodes still being available. Using pacemaker_remote could enable a cluster of several hundred nodes. My question would be, why do you need so many nodes in a high availaibility cluster? At least those limits doesn't seem to get enforced, as a 64 node cluster seems to work, although a bit shaky. I want to use corosync + dlm to get a distributed lock manager to have a uniform locking service throughout the platform as a basis for many services. Thanks, Arne Cheers, Kristoffer Thanks, Arne Jan 04 10:24:49 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:00 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:616) was formed. Members joined: 168101089 Jan 04 10:25:00 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:11 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:620) was formed. Members joined: 168101090 Jan 04 10:25:11 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:21 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:624) was formed. Members joined: 168101091 Jan 04 10:25:21 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:32 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:628) was formed. Members joined: 168101092 Jan 04 10:25:32 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:09 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:632) was formed. Members joined: 168101141 Jan 04 10:26:10 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:20 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:636) was formed. Members joined: 168101142 Jan 04 10:26:20 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:31 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:640) was formed. Members joined: 168101143 Jan 04 10:26:31 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:42 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:644) was formed. Members joined: 168101144 Jan 04 10:26:42 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:52 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:648) was formed. Members joined: 168101145 Jan 04 10:26:52 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:03 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:652) was formed. Members joined: 168101146 Jan 04 10:27:03 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:14 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:656) was formed. Members joined: 168101147 Jan 04 10:27:14 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:25 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:660) was formed. Members joined: 168101148 Jan 04 10:27:25 [4915] reniar corosync notice [MAIN ] Completed service synchronization,
[ClusterLabs] large cluster with corosync
Hi, I've built corosync for solaris and am trying to build a largish cluster. I started corosync with default configuration on an increasing number of nodes, one by one. At around 70 nodes the cluster breaks down. Below is an excerpt from the logfile on the first node. When the cluster breaks down corosync seems to completely flood the network. It doesn't recover by itself, I have to stop all nodes. Things get worse if I start corosync on multiple nodes at once. In this case it already breaks down around 40 nodes. Is it supposed to work with such a setup? Does it just need tuning? My goal is have a cluster of several hundred nodes. Thanks, Arne Jan 04 10:24:49 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:00 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:616) was formed. Members joined: 168101089 Jan 04 10:25:00 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:11 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:620) was formed. Members joined: 168101090 Jan 04 10:25:11 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:21 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:624) was formed. Members joined: 168101091 Jan 04 10:25:21 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:25:32 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:628) was formed. Members joined: 168101092 Jan 04 10:25:32 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:09 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:632) was formed. Members joined: 168101141 Jan 04 10:26:10 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:20 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:636) was formed. Members joined: 168101142 Jan 04 10:26:20 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:31 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:640) was formed. Members joined: 168101143 Jan 04 10:26:31 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:42 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:644) was formed. Members joined: 168101144 Jan 04 10:26:42 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:26:52 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:648) was formed. Members joined: 168101145 Jan 04 10:26:52 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:03 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:652) was formed. Members joined: 168101146 Jan 04 10:27:03 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:14 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:656) was formed. Members joined: 168101147 Jan 04 10:27:14 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:25 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:660) was formed. Members joined: 168101148 Jan 04 10:27:25 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:35 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:664) was formed. Members joined: 168101161 Jan 04 10:27:35 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:46 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:668) was formed. Members joined: 168101162 Jan 04 10:27:46 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:27:57 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:672) was formed. Members joined: 168101163 Jan 04 10:27:57 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:28:07 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:676) was formed. Members joined: 168101164 Jan 04 10:28:07 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:28:18 [4915] reniar corosync notice [TOTEM ] A new membership (10.5.4.101:680) was formed. Members joined: 168101165 Jan 04 10:28:18 [4915] reniar corosync notice [MAIN ] Completed service synchronization, ready to provide service. Jan 04 10:28:28 [4915] r