Re: [ClusterLabs] large cluster with corosync

2017-01-10 Thread Arne Jansen



On 04.01.2017 13:52, Jan Friesse wrote:


Variables you can try tweak.
- Definitively start with increase totem.config (default 1000, you can
try 1)


what does that do? Haven't found it in corosync.conf(5)


- If it doesn't help, try increase totem.join (default is 50, 1000 may
work) and consider increase totem.send_join (default is 0, 100 may be
good idea).


The problem is indeed the flood of join messages overwhelming the
receive socket. I increased the socket size in source code from 160k
to 3MB and forming the cluster works fine for 80 nodes. send_join
looks like a very useful variable to avoid the need for excessively
large buffers, so I'll test with join=120ms and send_join=60ms.

-Arne


- As a last variable, increase of totem.merge (default is 200, 2000 may
do the job).

And definitively let us know about results. It's quite hard to test such
a big amount of nodes so some of the variable may be sub-optimal. When
we know which of variables are victims, we can change their defaults.

Regards,
  Honza


I want to use corosync + dlm to get a distributed lock manager to have
a uniform locking service throughout the platform as a basis for many
services.

Thanks,
Arne



Cheers,
Kristoffer



Thanks,
Arne

Jan 04 10:24:49 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:00 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:616) was formed. Members joined: 168101089
Jan 04 10:25:00 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:11 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:620) was formed. Members joined: 168101090
Jan 04 10:25:11 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:21 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:624) was formed. Members joined: 168101091
Jan 04 10:25:21 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:32 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:628) was formed. Members joined: 168101092
Jan 04 10:25:32 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:09 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:632) was formed. Members joined: 168101141
Jan 04 10:26:10 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:20 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:636) was formed. Members joined: 168101142
Jan 04 10:26:20 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:31 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:640) was formed. Members joined: 168101143
Jan 04 10:26:31 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:42 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:644) was formed. Members joined: 168101144
Jan 04 10:26:42 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:52 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:648) was formed. Members joined: 168101145
Jan 04 10:26:52 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:03 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:652) was formed. Members joined: 168101146
Jan 04 10:27:03 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:14 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:656) was formed. Members joined: 168101147
Jan 04 10:27:14 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:25 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:660) was formed. Members joined: 168101148
Jan 04 10:27:25 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:35 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:664) was formed. Members joined: 168101161
Jan 04 10:27:35 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:46 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:668) was formed. Members joined: 168101162
Jan 04 10:27:46 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:57 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:672) was formed. Members joined: 168101163
Jan 04 10:27:57 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronizati

Re: [ClusterLabs] large cluster with corosync

2017-01-04 Thread Arne Jansen

Hi Honza,

On 04.01.2017 13:52, Jan Friesse wrote:



At least those limits doesn't seem to get enforced, as a 64 node cluster
seems to work, although a bit shaky.


No, they are not enforced. 16/32 are official supported number of nodes.
Basically, this is number what was tested and known to work reliably.
This doesn't mean corosync doesn't work with bigger number of nodes.
Eventho I'm quite surprised that 64 nodes really works.

Variables you can try tweak.
- Definitively start with increase totem.config (default 1000, you can
try 1)
- If it doesn't help, try increase totem.join (default is 50, 1000 may
work) and consider increase totem.send_join (default is 0, 100 may be
good idea).
- As a last variable, increase of totem.merge (default is 200, 2000 may
do the job).

And definitively let us know about results. It's quite hard to test such
a big amount of nodes so some of the variable may be sub-optimal. When
we know which of variables are victims, we can change their defaults.



Thanks for the tuning hints. One thing I definitely need to do before
the next test run is to build some kind of rate limiting into the daemon
to prevent it from bringing our whole network down. I see sendmsg()
calls in totemudp.c and totemupdu.c. Would it be sufficient to restrict
those two calls?

Thanks, Arne


Regards,
  Honza


I want to use corosync + dlm to get a distributed lock manager to have
a uniform locking service throughout the platform as a basis for many
services.

Thanks,
Arne



Cheers,
Kristoffer



Thanks,
Arne

Jan 04 10:24:49 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:00 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:616) was formed. Members joined: 168101089
Jan 04 10:25:00 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:11 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:620) was formed. Members joined: 168101090
Jan 04 10:25:11 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:21 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:624) was formed. Members joined: 168101091
Jan 04 10:25:21 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:32 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:628) was formed. Members joined: 168101092
Jan 04 10:25:32 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:09 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:632) was formed. Members joined: 168101141
Jan 04 10:26:10 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:20 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:636) was formed. Members joined: 168101142
Jan 04 10:26:20 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:31 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:640) was formed. Members joined: 168101143
Jan 04 10:26:31 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:42 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:644) was formed. Members joined: 168101144
Jan 04 10:26:42 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:52 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:648) was formed. Members joined: 168101145
Jan 04 10:26:52 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:03 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:652) was formed. Members joined: 168101146
Jan 04 10:27:03 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:14 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:656) was formed. Members joined: 168101147
Jan 04 10:27:14 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:25 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:660) was formed. Members joined: 168101148
Jan 04 10:27:25 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:35 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:664) was formed. Members joined: 168101161
Jan 04 10:27:35 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:46 [4915] reniar corosync notice  [TOTEM ] A new
membership
(10.5.4.101:668) was formed. Members joined: 168101162
Jan 04 10:27:46 [4915] reni

Re: [ClusterLabs] large cluster with corosync

2017-01-04 Thread Arne Jansen



On 04.01.2017 11:25, Kristoffer Grönlund wrote:

Arne Jansen  writes:


Hi,

I've built corosync for solaris and am trying to build a largish
cluster. I started corosync with default configuration on an
increasing number of nodes, one by one. At around 70 nodes the
cluster breaks down. Below is an excerpt from the logfile on the
first node.
When the cluster breaks down corosync seems to completely flood
the network. It doesn't recover by itself, I have to stop all nodes.
Things get worse if I start corosync on multiple nodes at once.
In this case it already breaks down around 40 nodes.

Is it supposed to work with such a setup? Does it just need tuning?
My goal is have a cluster of several hundred nodes.


Hi,

No, corosync has a limit to how many nodes the cluster can
contain and still function properly. According to SUSE, the limit is 32
nodes, according to Red Hat the limit is 16 (those are the limits for
enterprise support as far as I know) - but even having that number
of nodes will probably require some tweaking of timeouts in
corosync.conf, depending on what kind of network you have.

If I recall correctly, there are plans for overcoming this limit in the
future but right now that's the situation.

There is the pacemaker_remote project which allows for adding
additional, non-corosync nodes to a cluster which can run resources but
don't participate in quorum and rely on at least some of the core
cluster nodes still being available. Using pacemaker_remote could enable
a cluster of several hundred nodes.

My question would be, why do you need so many nodes in a high
availaibility cluster?


At least those limits doesn't seem to get enforced, as a 64 node cluster
seems to work, although a bit shaky.
I want to use corosync + dlm to get a distributed lock manager to have
a uniform locking service throughout the platform as a basis for many
services.

Thanks,
Arne



Cheers,
Kristoffer



Thanks,
Arne

Jan 04 10:24:49 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:00 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:616) was formed. Members joined: 168101089
Jan 04 10:25:00 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:11 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:620) was formed. Members joined: 168101090
Jan 04 10:25:11 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:21 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:624) was formed. Members joined: 168101091
Jan 04 10:25:21 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:25:32 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:628) was formed. Members joined: 168101092
Jan 04 10:25:32 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:09 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:632) was formed. Members joined: 168101141
Jan 04 10:26:10 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:20 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:636) was formed. Members joined: 168101142
Jan 04 10:26:20 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:31 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:640) was formed. Members joined: 168101143
Jan 04 10:26:31 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:42 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:644) was formed. Members joined: 168101144
Jan 04 10:26:42 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:26:52 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:648) was formed. Members joined: 168101145
Jan 04 10:26:52 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:03 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:652) was formed. Members joined: 168101146
Jan 04 10:27:03 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:14 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:656) was formed. Members joined: 168101147
Jan 04 10:27:14 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Jan 04 10:27:25 [4915] reniar corosync notice  [TOTEM ] A new membership
(10.5.4.101:660) was formed. Members joined: 168101148
Jan 04 10:27:25 [4915] reniar corosync notice  [MAIN  ] Completed
service synchronization,

[ClusterLabs] large cluster with corosync

2017-01-04 Thread Arne Jansen

Hi,

I've built corosync for solaris and am trying to build a largish
cluster. I started corosync with default configuration on an
increasing number of nodes, one by one. At around 70 nodes the
cluster breaks down. Below is an excerpt from the logfile on the
first node.
When the cluster breaks down corosync seems to completely flood
the network. It doesn't recover by itself, I have to stop all nodes.
Things get worse if I start corosync on multiple nodes at once.
In this case it already breaks down around 40 nodes.

Is it supposed to work with such a setup? Does it just need tuning?
My goal is have a cluster of several hundred nodes.

Thanks,
Arne

Jan 04 10:24:49 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:25:00 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:616) was formed. Members joined: 168101089
Jan 04 10:25:00 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:25:11 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:620) was formed. Members joined: 168101090
Jan 04 10:25:11 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:25:21 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:624) was formed. Members joined: 168101091
Jan 04 10:25:21 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:25:32 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:628) was formed. Members joined: 168101092
Jan 04 10:25:32 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:26:09 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:632) was formed. Members joined: 168101141
Jan 04 10:26:10 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:26:20 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:636) was formed. Members joined: 168101142
Jan 04 10:26:20 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:26:31 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:640) was formed. Members joined: 168101143
Jan 04 10:26:31 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:26:42 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:644) was formed. Members joined: 168101144
Jan 04 10:26:42 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:26:52 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:648) was formed. Members joined: 168101145
Jan 04 10:26:52 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:27:03 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:652) was formed. Members joined: 168101146
Jan 04 10:27:03 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:27:14 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:656) was formed. Members joined: 168101147
Jan 04 10:27:14 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:27:25 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:660) was formed. Members joined: 168101148
Jan 04 10:27:25 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:27:35 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:664) was formed. Members joined: 168101161
Jan 04 10:27:35 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:27:46 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:668) was formed. Members joined: 168101162
Jan 04 10:27:46 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:27:57 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:672) was formed. Members joined: 168101163
Jan 04 10:27:57 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:28:07 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:676) was formed. Members joined: 168101164
Jan 04 10:28:07 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:28:18 [4915] reniar corosync notice  [TOTEM ] A new membership 
(10.5.4.101:680) was formed. Members joined: 168101165
Jan 04 10:28:18 [4915] reniar corosync notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Jan 04 10:28:28 [4915] r