[ClusterLabs] node1 and node2 communication time question
Thank you for your reply. Then, can I think of it as being able to adjust the time by changing the token in /etc/corosync/corosync.conf? And the site I searched and found was explaining to disable fencing. If so, could you introduce me to a site or blog that explains by activating fencing? I am a college student studying about ha. I first learned about the concept of ha, and I don't know how to set it up or what options to change. And I am using a translator because I am not good at English, but I do not understand how to apply it by looking at the document in the cluster lab. Please check it out. Thank you. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] node1 and node2 communication time question
Thank you for your reply. Then, could you explain how to activate and set the stonith? Or let me know the blog or site you know. I looked up the site I found and proceeded with the setting, and almost all the sites explained it with the setting I set. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] node1 and node2 communication time question
On Tue, 2022-08-09 at 15:23 +0900, 권오성 wrote: > Hello. > I installed linux ha on raspberry pi as below. > 1) 1) sudo apt-get install pacemaker pcs fence-agents resource-agents > 2) Host Settings > 3) 3) sudo reboot > 4) 4) sudo passwd hacluster > 5) 5) sudo systemctl enable pcsd, sudo systemctl start pcsd, sudo > systemctl enable pacemaker > 6) 6) sudo pcs cluster destroy > 7) 7) sudo pcs cluster auth -u hacluster -p for hacluster> > 8) 8) sudo pcs cluster setup --name > 9) 9) sudo pcs cluster start —all, sudo pcs cluster enable —all > 10) sudo pcs property set stonith-enabled=false > 11) sudo pcs status > 12) sudo pcs resource create Virtual IP ocf:heartbeat:IPaddr2 > ip= cidr_netmask=24op monitor interval=30s > > So, I've set it up like this way. > By the way, is it correct that node1 and node2 communicate every 30 > seconds and node2 will notice after 30 seconds when node1 dies? > Or do we communicate every few seconds? > And can node1 and node2 reduce communication time? > What I want is node1 and node2 to communicate every 10 ms and switch > as fast as possible. > Please answer. > Thank you. Unfortunately 10ms is not a realistic goal with the current software. Node loss is detected by Corosync, which passes a token around all nodes continuously. The token timeout is defined in /etc/corosync/corosync.conf and defaults to either 1 or 3 seconds. With 2 nodes and a dedicated network for corosync traffic you can probably get subsecond but I'm not sure what the practical limit is. Once node loss is detected, most of the time of switching over is in fencing (which should always be configured, otherwise you risk data loss or service malfuntions) and the stop/start time of your individual resources. Resource loss is detected by recurring monitors. That's where the interval=30s comes in; the cluster will check the resource's status that often. You can reduce that, I would say 5 or 10s would be fine, even below that could be OK. The cluster has to run the scheduler, invoke the resource agent, and record the result if changed. When resource loss is detected, the stop/start time of the resource is the main factor. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Heads up for ldirectord in SLES12 SP5 "Use of uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord line 1830"
Hi! Digging further in ldirectord, I found that the utility functions do no make a difference between a name that is not known, and a name that is (probably) known, but cannot be resolved at the moment. I hacked the corresponding functions to observe and return the error code (errno) as a negative number. Short demo: DB<2> x ld_gethostbyname('x',AF_INET) 0 '-2' # ENOENT DB<3> x ld_gethostbyname('localhost',AF_INET) 0 '127.0.0.1' DB<4> x ld_gethostbyname('localhost',AF_INET6) 0 '[::1]' ### hacked /etc/resolv.conf to make namservers unreachable (add IPs that are no nameservers or don't exist), but host exists DB<7> x ld_gethostbyname('mail-1',AF_INET) 0 '-3' # ESRCH Returning the error message string is a bit trickier, so I just used the error code. However it's not clear what to do when the resolver fails (i.e.: name would be known if resolver worked). In any case it takes quite a while until an error result is returned. For example (using the hacked functions): if (($fallback->{port} = _getservbyname($fallback->{port}, $protocol)) =~ /^-/) { _error($line, "invalid port for fallback server"); } One could check for "== '-2'" instead, but still in the other case there is no valid port value. Ideas? Regards, Ulrich >>> Ulrich Windl schrieb am 08.08.2022 um 11:19 in Nachricht <62F0D518.3F8 : >>> 161 : 60728>: > Hi! > > The bug is still under investigation, but digging in the ldirectord code I > found this part called when stopping: > > } elsif ($CMD eq "stop") { > kill 15, $oldpid; > ld_exit(0, "Exiting from ldirectord $CMD"); > > As ldirectord uses a SIGTERM handler that sets a flag only and then (at some > later time) the termination code will be started. > Doesn't that mean the cluster will see a bad exit code (success while parts > of ldirectord are still running)? > > Regards, > Ulrich > > > > >>> Ulrich Windl schrieb am 03.08.2022 um 11:13 in Nachricht <62EA3C2C.E8D : > >>> 161 > : > 60728>: > > Hi! > > > > I wanted to inform you of an unpleasant bug in ldirectord of SLES12 SP5: > > We had a short network problem while some redundancy paths reconfigured in > > the infrastructure, effectively causing that some network services could > not > > be reached. > > Unfortunately ldirectord controlled by the cluster reported a failure (the > > director, not the services being directed to): > > > > h11 crmd[28930]: notice: h11-prm_lvs_mail_monitor_30:369 [ Use of > > uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord > > > line 1830, line 21. Error [33159] reading file > > /etc/ldirectord/mail.conf at line 10: invalid address for virtual service\n > ] > > h11 ldirectord[33266]: Exiting with exit_status 2: config_error: > > Configuration Error > > > > You can guess wat happened: > > Pacemaker tried to recover (stop, then start), but the stop failed, too: > > h11 lrmd[28927]: notice: prm_lvs_mail_stop_0:35047:stderr [ Use of > > uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord > > > line 1830, line 21. ] > > h11 lrmd[28927]: notice: prm_lvs_mail_stop_0:35047:stderr [ Error [36293] > > > reading file /etc/ldirectord/mail.conf at line 10: invalid address for > > virtual service ] > > h11 crmd[28930]: notice: Result of stop operation for prm_lvs_mail on > h11: > > 1 (unknown error) > > > > A stop failure meant that the node was fenced, interrupting all the other > > services. > > > > Examining the logs I also found this interesting type of error: > > h11 attrd[28928]: notice: Cannot update > > fail-count-prm_lvs_rksapds5#monitor_30[monitor]=(null) because peer > UUID > > not known (will retry if learned) > > > > Eventually, here's the code that caused the error: > > > > sub _ld_read_config_virtual_resolve > > { > > my($line, $vsrv, $ip_port, $af)=(@_); > > > > if($ip_port){ > > $ip_port=_gethostservbyname($ip_port, $vsrv->{protocol}, > > $af); > > if ($ip_port =~ /(\[[0-9A-Fa-f:]+\]):(\d+)/) { > > $vsrv->{server} = $1; > > $vsrv->{port} = $2; > > } elsif($ip_port){ > > ($vsrv->{server}, $vsrv->{port}) = split /:/, > > $ip_port; > > } > > else { > > _error($line, > > "invalid address for virtual service"); > > } > > ... > > > > The value returned by ld_gethostservbyname is undefined. I also wonder what > > > the program logic is: > > If the host looks like an hex address in square brackets, host and port are > > > split at the colon; otherwise host and port are split at the colon. > > Why not split simply at the last colon if the value is defined, AND THEN > > check if the components look OK? > > > > So the "invalid address
Re: [ClusterLabs] node1 and node2 communication time question
Hi, It seems that you are using pcs 0.9.x. That is an old and unmaintained version. I really recommend updating it. I can see that you disabled stonith. This is really a bad practice. Cluster cannot and will not function properly without working stonith. What makes you think nodes are communicating only every 30 seconds and not often? Setting 'monitor interval=30s' certainly doesn't do such thing. Regards, Tomas Dne 09. 08. 22 v 8:23 권오성 napsal(a): Hello. I installed linux ha on raspberry pi as below. 1) 1) sudo apt-get install pacemaker pcs fence-agents resource-agents 2) Host Settings 3) 3) sudo reboot 4) 4) sudo passwd hacluster 5) 5) sudo systemctl enable pcsd, sudo systemctl start pcsd, sudo systemctl enable pacemaker 6) 6) sudo pcs cluster destroy 7) 7) sudo pcs cluster auth -u hacluster -p for hacluster> 8) 8) sudo pcs cluster setup --name 9) 9) sudo pcs cluster start —all, sudo pcs cluster enable —all 10) sudo pcs property set stonith-enabled=false 11) sudo pcs status 12) sudo pcs resource create Virtual IP ocf:heartbeat:IPaddr2 ip= cidr_netmask=24op monitor interval=30s So, I've set it up like this way. By the way, is it correct that node1 and node2 communicate every 30 seconds and node2 will notice after 30 seconds when node1 dies? Or do we communicate every few seconds? And can node1 and node2 reduce communication time? What I want is node1 and node2 to communicate every 10 ms and switch as fast as possible. Please answer. Thank you. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] node1 and node2 communication time question
Hello. I installed linux ha on raspberry pi as below. 1) 1) sudo apt-get install pacemaker pcs fence-agents resource-agents 2) Host Settings 3) 3) sudo reboot 4) 4) sudo passwd hacluster 5) 5) sudo systemctl enable pcsd, sudo systemctl start pcsd, sudo systemctl enable pacemaker 6) 6) sudo pcs cluster destroy 7) 7) sudo pcs cluster auth -u hacluster -p 8) 8) sudo pcs cluster setup --name 9) 9) sudo pcs cluster start —all, sudo pcs cluster enable —all 10) sudo pcs property set stonith-enabled=false 11) sudo pcs status 12) sudo pcs resource create Virtual IP ocf:heartbeat:IPaddr2 ip= cidr_netmask=24op monitor interval=30s So, I've set it up like this way. By the way, is it correct that node1 and node2 communicate every 30 seconds and node2 will notice after 30 seconds when node1 dies? Or do we communicate every few seconds? And can node1 and node2 reduce communication time? What I want is node1 and node2 to communicate every 10 ms and switch as fast as possible. Please answer. Thank you. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/