My rule number 1 is to put all DNS entries in /etc/hosts or use dnsmasq for 
local DNS caching.Rule number 2 , add cluster nodes as ntp/chrony peers (with 
'prefer' for the ntp servers) to avoid node drift if time source is down for a 
long time.
Should the cluster take care of unstable infra -> not without you explicitly 
asking for that (and being capable of).
Best Regards,Strahil Nikolov 
 
 
  On Thu, Aug 4, 2022 at 9:38, Ulrich Windl<ulrich.wi...@rz.uni-regensburg.de> 
wrote:   Hi!

FYI, here is a copy what I had sent to SUSE support (stating "Because of the
very same DNS resolution problem, stopping also failed"; should a temporary DNS
resolving problem cause a resource stop to fail and cause node fencing in turn?
I don't think so!):
---
The problem is the Perl code that most likely was never tested to handle a
failure of or in ld_gethostservbyname():
FIRST it should be checked whether a value was returned at all; if not there
is a failure in resolution.
In turn a failure in resolution could mean two tings:
1) The names in the configuration are not correct and will never resolve.
2) A temporary failure of some kind caused a failure and the configuration IS
CORRECT.

Clearly the bad case here was 2).

Also looking at the code I wonder why it does not handle things like this:
                $ip_port=&ld_gethostservbyname($ip_port, $vsrv->{protocol},
$af);
                if ($ip_port) {
                    if ($ip_port =~ /^(.+):([^:]+)$/) { # replacing the split
                        ($vsrv->{server}, $vsrv->{port}) = ($1, $2);
                        # this should also handle the case "$ip_port =~
/(\[[0-9A-Fa-f:]+\]):(\d+)/"
                    } else {
                        # error "unexpected return from ld_gethostservbyname"
                    }
                } else {
                    # error "failed to resolve ..."
                    # here it's unfortunate that the original $ip_port is
lost,
                    # so it cannot be part of the error message
                }

Despite of that is that the critical part was that the "stop" operation SEEMED
to have failed, causing fencing.
Regardless of the success of resolving the names ldirector should be able to
stop!
---
Opinions?

Regards,
Ulrich

>>> "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> schrieb am 03.08.2022
um
11:13 in Nachricht <62ea3c2c020000a10004c...@gwsmtp.uni-regensburg.de>:
> Hi!
> 
> I wanted to inform you of an unpleasant bug in ldirectord of SLES12 SP5:
> We had a short network problem while some redundancy paths reconfigured in 
> the infrastructure, effectively causing that some network services could not

> be reached.
> Unfortunately ldirectord controlled by the cluster reported a failure (the 
> director, not the services being directed to):
> 
> h11 crmd[28930]:  notice: h11‑prm_lvs_mail_monitor_300000:369 [ Use of 
> uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord

> line 1830, <CFGFILE> line 21. Error [33159] reading file 
> /etc/ldirectord/mail.conf at line 10: invalid address for virtual service\n
]
> h11 ldirectord[33266]: Exiting with exit_status 2: config_error: 
> Configuration Error
> 
> You can guess wat happened:
> Pacemaker tried to recover (stop, then start), but the stop failed, too:
> h11 lrmd[28927]:  notice: prm_lvs_mail_stop_0:35047:stderr [ Use of 
> uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord

> line 1830, <CFGFILE> line 21. ]
> h11 lrmd[28927]:  notice: prm_lvs_mail_stop_0:35047:stderr [ Error [36293]

> reading file /etc/ldirectord/mail.conf at line 10: invalid address for 
> virtual service ]
> h11 crmd[28930]:  notice: Result of stop operation for prm_lvs_mail on h11:

> 1 (unknown error)
> 
> A stop failure meant that the node was fenced, interrupting all the other 
> services.
> 
> Examining the logs I also found this interesting type of error:
> h11 attrd[28928]:  notice: Cannot update 
> fail‑count‑prm_lvs_rksapds5#monitor_300000[monitor]=(null) because peer 
> UUID not known (will retry if learned)
> 
> Eventually, here's the code that caused the error:
> 
> sub _ld_read_config_virtual_resolve
> {
>        my($line, $vsrv, $ip_port, $af)=(@_);
> 
>        if($ip_port){
>                $ip_port=&ld_gethostservbyname($ip_port, $vsrv‑>{protocol},

> $af);
>                if ($ip_port =~ /(\[[0‑9A‑Fa‑f:]+\]):(\d+)/) {
>                        $vsrv‑>{server} = $1;
>                        $vsrv‑>{port} = $2;
>                } elsif($ip_port){
>                        ($vsrv‑>{server}, $vsrv‑>{port}) = split /:/,
$ip_port;
>                }
>                else {
>                        &config_error($line,
>                                "invalid address for virtual service");
>                }
> ...
> 
> The value returned by ld_gethostservbyname is undefined. I also wonder what

> the program logic is:
> If the host looks like an hex address in square brackets, host and port are

> split at the colon; otherwise host and port are split at the colon.
> Why not split simply at the last colon if the value is defined, AND THEN 
> check if the components look OK?
> 
> So the "invalid address for virtual service" is only invalid when the 
> resolver service (e.g. via LDAP) is unavailable.
> I used host and service names for readability.
> 
> (I reported the issue to SLES support)
> 
> Regards,
> Ulrich
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
  
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to