[ClusterLabs] node1 and node2 communication time question

2022-08-09 Thread 권오성
Thank you for your reply.
Then, can I think of it as being able to adjust the time by changing the
token in /etc/corosync/corosync.conf?
And the site I searched and found was explaining to disable fencing.
If so, could you introduce me to a site or blog that explains by activating
fencing?
I am a college student studying about ha.
I first learned about the concept of ha, and I don't know how to set it up
or what options to change.
And I am using a translator because I am not good at English, but I do not
understand how to apply it by looking at the document in the cluster lab.
Please check it out.
Thank you.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] node1 and node2 communication time question

2022-08-09 Thread 권오성
Thank you for your reply.
Then, could you explain how to activate and set the stonith?
Or let me know the blog or site you know.
I looked up the site I found and proceeded with the setting, and almost all
the sites explained it with the setting I set.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] node1 and node2 communication time question

2022-08-09 Thread Ken Gaillot
On Tue, 2022-08-09 at 15:23 +0900, 권오성 wrote:
> Hello.
> I installed linux ha on raspberry pi as below.
> 1) 1) sudo apt-get install pacemaker pcs fence-agents resource-agents
> 2) Host Settings
> 3) 3) sudo reboot
> 4) 4) sudo passwd hacluster
> 5) 5) sudo systemctl enable pcsd, sudo systemctl start pcsd, sudo
> systemctl enable pacemaker
> 6) 6) sudo pcs cluster destroy
> 7) 7) sudo pcs cluster auth   -u hacluster -p  for hacluster>
> 8) 8) sudo pcs cluster setup --name   
> 9) 9) sudo pcs cluster start —all, sudo pcs cluster enable —all
> 10) sudo pcs property set stonith-enabled=false
> 11) sudo pcs status
> 12) sudo pcs resource create Virtual IP ocf:heartbeat:IPaddr2
> ip= cidr_netmask=24op monitor interval=30s
> 
> So, I've set it up like this way.
> By the way, is it correct that node1 and node2 communicate every 30
> seconds and node2 will notice after 30 seconds when node1 dies?
> Or do we communicate every few seconds?
> And can node1 and node2 reduce communication time?
> What I want is node1 and node2 to communicate every 10 ms and switch
> as fast as possible.
> Please answer.
> Thank you.

Unfortunately 10ms is not a realistic goal with the current software.

Node loss is detected by Corosync, which passes a token around all
nodes continuously. The token timeout is defined in
/etc/corosync/corosync.conf and defaults to either 1 or 3 seconds. With
2 nodes and a dedicated network for corosync traffic you can probably
get subsecond but I'm not sure what the practical limit is.

Once node loss is detected, most of the time of switching over is in
fencing (which should always be configured, otherwise you risk data
loss or service malfuntions) and the stop/start time of your individual
resources.

Resource loss is detected by recurring monitors. That's where the
interval=30s comes in; the cluster will check the resource's status
that often. You can reduce that, I would say 5 or 10s would be fine,
even below that could be OK. The cluster has to run the scheduler,
invoke the resource agent, and record the result if changed.

When resource loss is detected, the stop/start time of the resource is
the main factor.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Heads up for ldirectord in SLES12 SP5 "Use of uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord line 1830"

2022-08-09 Thread Ulrich Windl
Hi!

Digging further in ldirectord, I found that the utility functions do no make a 
difference between a name that is not known, and a name that is (probably) 
known, but cannot be resolved at the moment.

I hacked the corresponding functions to observe and return the error code 
(errno) as a negative number.
Short demo:

  DB<2> x ld_gethostbyname('x',AF_INET)
0  '-2' # ENOENT
  DB<3> x ld_gethostbyname('localhost',AF_INET)
0  '127.0.0.1'
  DB<4> x ld_gethostbyname('localhost',AF_INET6)
0  '[::1]'
### hacked /etc/resolv.conf to make namservers unreachable (add IPs that are no 
nameservers or don't exist), but host exists
  DB<7> x ld_gethostbyname('mail-1',AF_INET)
0  '-3' # ESRCH

Returning the error message string is a bit trickier, so I just used the error 
code.

However it's not clear what to do when the resolver fails (i.e.: name would be 
known if resolver worked). In any case it takes quite a while until an error 
result is returned.

For example (using the hacked functions):
if (($fallback->{port} =
 _getservbyname($fallback->{port}, $protocol)) =~ /^-/) {
_error($line, "invalid port for fallback server");
}

One could check for "== '-2'" instead, but still in the other case there is no 
valid port value.

Ideas?

Regards,
Ulrich

>>> Ulrich Windl schrieb am 08.08.2022 um 11:19 in Nachricht <62F0D518.3F8 : 
>>> 161 :
60728>:
> Hi!
> 
> The bug is still under investigation, but digging in the ldirectord code I 
> found this part called when stopping:
> 
> } elsif ($CMD eq "stop") {
> kill 15, $oldpid;
> ld_exit(0, "Exiting from ldirectord $CMD");
> 
> As ldirectord uses a SIGTERM handler that sets a flag only and then (at some 
> later time) the termination code will be started.
> Doesn't that mean the cluster will see a bad exit code (success while parts 
> of ldirectord are still running)?
> 
> Regards,
> Ulrich
> 
> 
> 
> >>> Ulrich Windl schrieb am 03.08.2022 um 11:13 in Nachricht <62EA3C2C.E8D : 
> >>> 161 
> :
> 60728>:
> > Hi!
> > 
> > I wanted to inform you of an unpleasant bug in ldirectord of SLES12 SP5:
> > We had a short network problem while some redundancy paths reconfigured in 
> > the infrastructure, effectively causing that some network services could 
> not 
> > be reached.
> > Unfortunately ldirectord controlled by the cluster reported a failure (the 
> > director, not the services being directed to):
> > 
> > h11 crmd[28930]:   notice: h11-prm_lvs_mail_monitor_30:369 [ Use of 
> > uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord 
> 
> > line 1830,  line 21. Error [33159] reading file 
> > /etc/ldirectord/mail.conf at line 10: invalid address for virtual service\n 
> ]
> > h11 ldirectord[33266]: Exiting with exit_status 2: config_error: 
> > Configuration Error
> > 
> > You can guess wat happened:
> > Pacemaker tried to recover (stop, then start), but the stop failed, too:
> > h11 lrmd[28927]:   notice: prm_lvs_mail_stop_0:35047:stderr [ Use of 
> > uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord 
> 
> > line 1830,  line 21. ]
> > h11 lrmd[28927]:   notice: prm_lvs_mail_stop_0:35047:stderr [ Error [36293] 
> 
> > reading file /etc/ldirectord/mail.conf at line 10: invalid address for 
> > virtual service ]
> > h11 crmd[28930]:   notice: Result of stop operation for prm_lvs_mail on 
> h11: 
> > 1 (unknown error)
> > 
> > A stop failure meant that the node was fenced, interrupting all the other 
> > services.
> > 
> > Examining the logs I also found this interesting type of error:
> > h11 attrd[28928]:   notice: Cannot update 
> > fail-count-prm_lvs_rksapds5#monitor_30[monitor]=(null) because peer 
> UUID 
> > not known (will retry if learned)
> > 
> > Eventually, here's the code that caused the error:
> > 
> > sub _ld_read_config_virtual_resolve
> > {
> > my($line, $vsrv, $ip_port, $af)=(@_);
> > 
> > if($ip_port){
> > $ip_port=_gethostservbyname($ip_port, $vsrv->{protocol}, 
> > $af);
> > if ($ip_port =~ /(\[[0-9A-Fa-f:]+\]):(\d+)/) {
> > $vsrv->{server} = $1;
> > $vsrv->{port} = $2;
> > } elsif($ip_port){
> > ($vsrv->{server}, $vsrv->{port}) = split /:/, 
> > $ip_port;
> > }
> > else {
> > _error($line,
> > "invalid address for virtual service");
> > }
> > ...
> > 
> > The value returned by ld_gethostservbyname is undefined. I also wonder what 
> 
> > the program logic is:
> > If the host looks like an hex address in square brackets, host and port are 
> 
> > split at the colon; otherwise host and port are split at the colon.
> > Why not split simply at the last colon if the value is defined, AND THEN 
> > check if the components look OK?
> > 
> > So the "invalid address 

Re: [ClusterLabs] node1 and node2 communication time question

2022-08-09 Thread Tomas Jelinek

Hi,

It seems that you are using pcs 0.9.x. That is an old and unmaintained 
version. I really recommend updating it.


I can see that you disabled stonith. This is really a bad practice. 
Cluster cannot and will not function properly without working stonith.


What makes you think nodes are communicating only every 30 seconds and 
not often? Setting 'monitor interval=30s' certainly doesn't do such thing.


Regards,
Tomas


Dne 09. 08. 22 v 8:23 권오성 napsal(a):

Hello.
I installed linux ha on raspberry pi as below.
1) 1) sudo apt-get install pacemaker pcs fence-agents resource-agents
2) Host Settings
3) 3) sudo reboot
4) 4) sudo passwd hacluster
5) 5) sudo systemctl enable pcsd, sudo systemctl start pcsd, sudo 
systemctl enable pacemaker

6) 6) sudo pcs cluster destroy
7) 7) sudo pcs cluster auth   -u hacluster -p for hacluster>

8) 8) sudo pcs cluster setup --name   
9) 9) sudo pcs cluster start —all, sudo pcs cluster enable —all
10) sudo pcs property set stonith-enabled=false
11) sudo pcs status
12) sudo pcs resource create Virtual IP ocf:heartbeat:IPaddr2 
ip= cidr_netmask=24op monitor interval=30s


So, I've set it up like this way.
By the way, is it correct that node1 and node2 communicate every 30 
seconds and node2 will notice after 30 seconds when node1 dies?

Or do we communicate every few seconds?
And can node1 and node2 reduce communication time?
What I want is node1 and node2 to communicate every 10 ms and switch as 
fast as possible.

Please answer.
Thank you.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] node1 and node2 communication time question

2022-08-09 Thread 권오성
Hello.
I installed linux ha on raspberry pi as below.
1) 1) sudo apt-get install pacemaker pcs fence-agents resource-agents
2) Host Settings
3) 3) sudo reboot
4) 4) sudo passwd hacluster
5) 5) sudo systemctl enable pcsd, sudo systemctl start pcsd, sudo systemctl
enable pacemaker
6) 6) sudo pcs cluster destroy
7) 7) sudo pcs cluster auth   -u hacluster -p 
8) 8) sudo pcs cluster setup --name   
9) 9) sudo pcs cluster start —all, sudo pcs cluster enable —all
10) sudo pcs property set stonith-enabled=false
11) sudo pcs status
12) sudo pcs resource create Virtual IP ocf:heartbeat:IPaddr2 ip=
cidr_netmask=24op monitor interval=30s

So, I've set it up like this way.
By the way, is it correct that node1 and node2 communicate every 30 seconds
and node2 will notice after 30 seconds when node1 dies?
Or do we communicate every few seconds?
And can node1 and node2 reduce communication time?
What I want is node1 and node2 to communicate every 10 ms and switch as
fast as possible.
Please answer.
Thank you.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/