Re: [ClusterLabs] Corosync 3.1.5 Fails to Autostart

2023-05-25 Thread Tyler Phillippe via Users
Hey all,

Finally got some time to check this on our servers and build up a separate test 
cluster and I found the issue, no debugging required. Seems we are/were using 
IP addresses instead of names in the corosync.conf. I replicated that with the 
separate test cluster and noticed the exact same behaviour. Thanks for all the 
support!! I really appreciate it!

Respectfully,
 Tyler Phillippe



Apr 25, 2023, 3:23 AM by jfrie...@redhat.com:

> On 24/04/2023 22:16, Tyler Phillippe via Users wrote:
>
>> Hello all,
>>
>> We are currently using RHEL9 and have set up a PCS cluster. When restarting 
>> the servers, we noticed Corosync 3.1.5 doesn't start properly with the below 
>> error message:
>>
>> Parse error in config: No valid name found for local host
>> Corosync Cluster Engine exiting with status 8 at main.c:1445.
>> Corosync.service: Main process exited, code=exited, status=8/n/a
>>
>> These are physical, blade machines that are using a 2x Fibre Channel NIC in 
>> a Mode 6 bond as their networking interface for the cluster; other than 
>> that, there is really nothing special about these machines. We have ensured 
>> the names of the machines exist in /etc/hosts and that they can resolve 
>> those names via the hosts file first. The strange
>>
>
> This is really weird. All described symptoms simply points to name service 
> (DNS/NIS/...) is not available during bootup and it will become available 
> later. But if /etc/hosts really contains static entries it should just work.
>
> Could you please try to set debug: trace in corosync.conf like
> ```
> ...
> logging {
>  to_syslog: yes
>  to_stderr: yes
>  timestamp: on
>  to_logfile: yes
>  logfile: /var/log/cluster/corosync.log
>
>  debug: trace
> }
> ...
> ```
>
> and observe very beginning output of corosync (either in syslog or in 
> /var/log/cluster/corosync.log)? There should be something like
>
> totemip_parse: IPv4 address of NAME resolved as IPADDR
>
> Also compare the difference between corosync started on boot and later after 
> multi-user.target.
>
> thing is if we start Corosync manually after we can SSH into the machines, 
> Corosync starts immediately and without issue. We did manage to get Corosync 
> to autostart properly by modifying the service file and changing the 
> After=network-online.target to After=multi-user.target. In doing this, at 
> first, Pacemaker complains about mismatching dependencies in the service 
> between Corosync and Pacemaker. Changing the Pacemaker service to 
> After=multi-user.target fixes that self-caused issue. Any ideas on this one? 
> Mostly checking to see if changing the After dependency will harm us in the 
> future.
>
> That's questionable. It's always best if resolve uses /etc/hosts reliably, 
> what is not the case now, so IMHO better to find a reason why /etc/hosts 
> doesn't work rather than "workaround" it.
>
> Regards,
>  Honza
>
>>
>> Thanks!
>>
>> Respectfully,
>>   Tyler Phillippe
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Corosync 3.1.5 Fails to Autostart

2023-04-25 Thread Jan Friesse

On 24/04/2023 22:16, Tyler Phillippe via Users wrote:

Hello all,

We are currently using RHEL9 and have set up a PCS cluster. When restarting the 
servers, we noticed Corosync 3.1.5 doesn't start properly with the below error 
message:

Parse error in config: No valid name found for local host
Corosync Cluster Engine exiting with status 8 at main.c:1445.
Corosync.service: Main process exited, code=exited, status=8/n/a

These are physical, blade machines that are using a 2x Fibre Channel NIC in a Mode 6 bond as their networking interface for the cluster; other than that, there is really nothing special about these machines. We have ensured the names of the machines exist in /etc/hosts and that they can resolve those names via the hosts file first. The strange 


This is really weird. All described symptoms simply points to name 
service (DNS/NIS/...) is not available during bootup and it will become 
available later. But if /etc/hosts really contains static entries it 
should just work.


Could you please try to set debug: trace in corosync.conf like
```
...
logging {
to_syslog: yes
to_stderr: yes
timestamp: on
to_logfile: yes
logfile: /var/log/cluster/corosync.log

debug: trace
}
...
```

and observe very beginning output of corosync (either in syslog or in 
/var/log/cluster/corosync.log)? There should be something like


totemip_parse: IPv4 address of NAME resolved as IPADDR

Also compare the difference between corosync started on boot and later 
after multi-user.target.


thing is if we start Corosync manually after we can SSH into the 
machines, Corosync starts immediately and without issue. We did manage 
to get Corosync to autostart properly by modifying the service file and 
changing the After=network-online.target to After=multi-user.target. In 
doing this, at first, Pacemaker complains about mismatching dependencies 
in the service between Corosync and Pacemaker. Changing the Pacemaker 
service to After=multi-user.target fixes that self-caused issue. Any 
ideas on this one? Mostly checking to see if changing the After 
dependency will harm us in the future.


That's questionable. It's always best if resolve uses /etc/hosts 
reliably, what is not the case now, so IMHO better to find a reason why 
/etc/hosts doesn't work rather than "workaround" it.


Regards,
  Honza



Thanks!

Respectfully,
  Tyler Phillippe


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Corosync 3.1.5 Fails to Autostart

2023-04-24 Thread Ken Gaillot
Hi,

With Corosync 3, node names must be specified in
/etc/corosync/corosync.conf like:

node {
ring0_addr: node1
name: node1
nodeid: 1
}

(ring0_addr is a resolvable name used to identify the interface, and
name is the name that should be used in the cluster)

If you set up the cluster from scratch using pcs, it should do that for
you. I'm guessing you reused an older config, or manually set up
corosync.conf.

It shouldn't be necessary to change the After. If it still is an issue
after fixing the config, you might have some unusual dependency like a 
disk that gets mounted later, in which case it would be better to add
an After for the specific dependency.

On Mon, 2023-04-24 at 22:16 +0200, Tyler Phillippe via Users wrote:
> Hello all,
> 
> We are currently using RHEL9 and have set up a PCS cluster. When
> restarting the servers, we noticed Corosync 3.1.5 doesn't start
> properly with the below error message:
> 
> Parse error in config: No valid name found for local host
> Corosync Cluster Engine exiting with status 8 at main.c:1445.
> Corosync.service: Main process exited, code=exited, status=8/n/a
> 
> These are physical, blade machines that are using a 2x Fibre Channel
> NIC in a Mode 6 bond as their networking interface for the cluster;
> other than that, there is really nothing special about these
> machines. We have ensured the names of the machines exist in
> /etc/hosts and that they can resolve those names via the hosts file
> first. The strange thing is if we start Corosync manually after we
> can SSH into the machines, Corosync starts immediately and without
> issue. We did manage to get Corosync to autostart properly by
> modifying the service file and changing the After=network-
> online.target to After=multi-user.target. In doing this, at first,
> Pacemaker complains about mismatching dependencies in the service
> between Corosync and Pacemaker. Changing the Pacemaker service to
> After=multi-user.target fixes that self-caused issue. Any ideas on
> this one? Mostly checking to see if changing the After dependency
> will harm us in the future.
> 
> Thanks!
> 
> Respectfully,
>  Tyler Phillippe
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Corosync 3.1.5 Fails to Autostart

2023-04-24 Thread Tyler Phillippe via Users
Hello all,

We are currently using RHEL9 and have set up a PCS cluster. When restarting the 
servers, we noticed Corosync 3.1.5 doesn't start properly with the below error 
message:

Parse error in config: No valid name found for local host
Corosync Cluster Engine exiting with status 8 at main.c:1445.
Corosync.service: Main process exited, code=exited, status=8/n/a

These are physical, blade machines that are using a 2x Fibre Channel NIC in a 
Mode 6 bond as their networking interface for the cluster; other than that, 
there is really nothing special about these machines. We have ensured the names 
of the machines exist in /etc/hosts and that they can resolve those names via 
the hosts file first. The strange thing is if we start Corosync manually after 
we can SSH into the machines, Corosync starts immediately and without issue. We 
did manage to get Corosync to autostart properly by modifying the service file 
and changing the After=network-online.target to After=multi-user.target. In 
doing this, at first, Pacemaker complains about mismatching dependencies in the 
service between Corosync and Pacemaker. Changing the Pacemaker service to 
After=multi-user.target fixes that self-caused issue. Any ideas on this one? 
Mostly checking to see if changing the After dependency will harm us in the 
future.

Thanks!

Respectfully,
 Tyler Phillippe
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/