[ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-10 Thread Vitaly Zolotusky
Hello everybody.
We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync 2.99+. 
It looks like they are not compatible and we are getting messages like:
Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message received from 
172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. Ignoring
on the upgraded node and 
Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid packet data
Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming packet has 
different crypto type. Rejecting
Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received message has 
invalid digest... ignoring.
on the pre-upgrade node.

Is there a good way to do this upgrade? 
I would appreciate it very much if you could point me to any documentation or 
articles on this issue.
Thank you very much!
_Vitaly
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-10 Thread Strahil Nikolov
What  is your corosync.conf timeouts (especially token & consensus)?
Last time I did live migration of RHEL 7 node with the default values, the 
cluster fenced it - thus I set it to 10s  for token and I also raised the 
consensus (check 'man corosync.conf') above the default.

Also,  start your investigation from the  virtualization layer, as during the 
nights a lot  of backups  are going on. Last week I got a cluster node fenced 
cause it failed to respond for 40s  . Thankfully that was just a QA cluster,  
so it wasn't a big deal.

The most common reasons for a VM to fail to respond are:
- CPU starvation due to high CPU utilisation on the host
- I/O issues  causing the VM to pause
- Lots  of backups eating the bandwidth on any of the  Hypervisours or on a 
switch between them  (if you have a single heartbeat  network)

With RHEL8 corosync allows using more than 2 heartbeat rings and way new stuff 
like sctp.

P.S.: You can use a second fencing mechanism like 'sbd' a.k.a. "poison pill" ,  
just make the vmdk shared &  independent .  This way your cluster can operate 
even when the vCenter is unreachable for any reason.

Best Regards,
Strahil Nikolov


На 10 юни 2020 г. 20:06:28 GMT+03:00, Howard  написа:
>Good morning.  Thanks for reading.  We have a requirement to provide
>high
>availability for PostgreSQL 10.  I have built a two node cluster with a
>quorum device as the third vote, all running on RHEL 8.
>
>Here are the versions installed:
>[postgres@srv2 cluster]$ rpm -qa|grep
>"pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
>corosync-3.0.2-3.el8_1.1.x86_64
>corosync-qdevice-3.0.0-2.el8.x86_64
>corosync-qnetd-3.0.0-2.el8.x86_64
>corosynclib-3.0.2-3.el8_1.1.x86_64
>fence-agents-vmware-soap-4.2.1-41.el8.noarch
>pacemaker-2.0.2-3.el8_1.2.x86_64
>pacemaker-cli-2.0.2-3.el8_1.2.x86_64
>pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
>pacemaker-libs-2.0.2-3.el8_1.2.x86_64
>pacemaker-schemas-2.0.2-3.el8_1.2.noarch
>pcs-0.10.2-4.el8.x86_64
>resource-agents-paf-2.3.0-1.noarch
>
>These are vmare VMs so I configured the cluster to use the ESX host as
>the
>fencing device using fence_vmware_soap.
>
>Throughout each day things generally work very well.  The cluster
>remains
>online and healthy. Unfortunately, when I check pcs status in the
>mornings,
>I see that all kinds of things went wrong overnight.  It is hard to
>pinpoint what the issue is as there is so much information being
>written to
>the pacemaker.log. Scrolling through pages and pages of informational
>log
>entries trying to find the lines that pertain to the issue.  Is there a
>way
>to separate the logs out to make it easier to scroll through? Or maybe
>a
>list of keywords to GREP for?
>
>It is clearly indicating that the server lost contact with the other
>node
>and also the quorum device. Is there a way to make this configuration
>more
>robust or able to recover from a connectivity blip?
>
>Here are the pacemaker and corosync logs for this morning's failures:
>pacemaker.log
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
> [10573] (pcmk_quorum_notification)   warning: Quorum lost |
>membership=952 members=1
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2
>pacemaker-controld
> [10579] (pcmk_quorum_notification)   warning: Quorum lost |
>membership=952 members=1
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
>will be fenced: peer is no longer part of the cluster
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (determine_online_status)warning:
>Node
>srv1 is unclean
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pac

Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-10 Thread Howard
Hi everyone.  As a followup, I found that the vms were having snapshot
backup at the time of the disconnects which I think freezes IO. We'll be
addressing that.  Is there anything else in the log that can be improved.

Thanks,
Howard

On Wed, Jun 10, 2020 at 10:06 AM Howard  wrote:

> Good morning.  Thanks for reading.  We have a requirement to provide high
> availability for PostgreSQL 10.  I have built a two node cluster with a
> quorum device as the third vote, all running on RHEL 8.
>
> Here are the versions installed:
> [postgres@srv2 cluster]$ rpm -qa|grep
> "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
> corosync-3.0.2-3.el8_1.1.x86_64
> corosync-qdevice-3.0.0-2.el8.x86_64
> corosync-qnetd-3.0.0-2.el8.x86_64
> corosynclib-3.0.2-3.el8_1.1.x86_64
> fence-agents-vmware-soap-4.2.1-41.el8.noarch
> pacemaker-2.0.2-3.el8_1.2.x86_64
> pacemaker-cli-2.0.2-3.el8_1.2.x86_64
> pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-schemas-2.0.2-3.el8_1.2.noarch
> pcs-0.10.2-4.el8.x86_64
> resource-agents-paf-2.3.0-1.noarch
>
> These are vmare VMs so I configured the cluster to use the ESX host as the
> fencing device using fence_vmware_soap.
>
> Throughout each day things generally work very well.  The cluster remains
> online and healthy. Unfortunately, when I check pcs status in the mornings,
> I see that all kinds of things went wrong overnight.  It is hard to
> pinpoint what the issue is as there is so much information being written to
> the pacemaker.log. Scrolling through pages and pages of informational log
> entries trying to find the lines that pertain to the issue.  Is there a way
> to separate the logs out to make it easier to scroll through? Or maybe a
> list of keywords to GREP for?
>
> It is clearly indicating that the server lost contact with the other node
> and also the quorum device. Is there a way to make this configuration more
> robust or able to recover from a connectivity blip?
>
> Here are the pacemaker and corosync logs for this morning's failures:
> pacemaker.log
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
>  [10573] (pcmk_quorum_notification)   warning: Quorum lost |
> membership=952 members=1
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld
>  [10579] (pcmk_quorum_notification)   warning: Quorum lost |
> membership=952 members=1
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
> will be fenced: peer is no longer part of the cluster
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (determine_online_status)warning: Node
> srv1 is unclean
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsql-master-ip_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (stage6) warning: Scheduling Node srv1
> for STONITH
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (pcmk__log_transition_summary)   warning:
> Calculated transition 2 (with warnings), saving inputs in
> /var/lib/pacemaker/pengine/pe-warn-34.bz2
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>  [10579] (crmd_ha_msg_filter) warning: Another DC detected: srv1
> (op=join_offer)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>  [10579] (destroy_action) warning: Cancelling timer for action 3
> (src=307)
> /var/log/pacema

[ClusterLabs] New user needs some help stabilizing the cluster

2020-06-10 Thread Howard
Good morning.  Thanks for reading.  We have a requirement to provide high
availability for PostgreSQL 10.  I have built a two node cluster with a
quorum device as the third vote, all running on RHEL 8.

Here are the versions installed:
[postgres@srv2 cluster]$ rpm -qa|grep
"pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
corosync-3.0.2-3.el8_1.1.x86_64
corosync-qdevice-3.0.0-2.el8.x86_64
corosync-qnetd-3.0.0-2.el8.x86_64
corosynclib-3.0.2-3.el8_1.1.x86_64
fence-agents-vmware-soap-4.2.1-41.el8.noarch
pacemaker-2.0.2-3.el8_1.2.x86_64
pacemaker-cli-2.0.2-3.el8_1.2.x86_64
pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
pacemaker-libs-2.0.2-3.el8_1.2.x86_64
pacemaker-schemas-2.0.2-3.el8_1.2.noarch
pcs-0.10.2-4.el8.x86_64
resource-agents-paf-2.3.0-1.noarch

These are vmare VMs so I configured the cluster to use the ESX host as the
fencing device using fence_vmware_soap.

Throughout each day things generally work very well.  The cluster remains
online and healthy. Unfortunately, when I check pcs status in the mornings,
I see that all kinds of things went wrong overnight.  It is hard to
pinpoint what the issue is as there is so much information being written to
the pacemaker.log. Scrolling through pages and pages of informational log
entries trying to find the lines that pertain to the issue.  Is there a way
to separate the logs out to make it easier to scroll through? Or maybe a
list of keywords to GREP for?

It is clearly indicating that the server lost contact with the other node
and also the quorum device. Is there a way to make this configuration more
robust or able to recover from a connectivity blip?

Here are the pacemaker and corosync logs for this morning's failures:
pacemaker.log
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
 [10573] (pcmk_quorum_notification)   warning: Quorum lost |
membership=952 members=1
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld
 [10579] (pcmk_quorum_notification)   warning: Quorum lost |
membership=952 members=1
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
will be fenced: peer is no longer part of the cluster
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (determine_online_status)warning: Node
srv1 is unclean
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsql-master-ip_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (stage6) warning: Scheduling Node srv1
for STONITH
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (pcmk__log_transition_summary)   warning:
Calculated transition 2 (with warnings), saving inputs in
/var/lib/pacemaker/pengine/pe-warn-34.bz2
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
 [10579] (crmd_ha_msg_filter) warning: Another DC detected: srv1
(op=join_offer)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
 [10579] (destroy_action) warning: Cancelling timer for action 3
(src=307)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
 [10579] (destroy_action) warning: Cancelling timer for action 2
(src=308)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
 [10579] (do_log) warning: Input I_RELEASE_DC received in state
S_RELEASE_DC from do_election_count_vote
/var/log/pacemaker/pacemaker.log:pgsqlms(pgsqld)[1164379]:  Jun 10
00:07:19  WARNING: No secondary connected to the master
/var/lo

Re: [ClusterLabs] Redudant Ring Network failure

2020-06-10 Thread ROHWEDER-NEUBECK, MICHAEL (EXTERN)
Hi,
yesterday we restart all cluster and all rings ok.
Now today 1. With broken ring.

ring 0 broken: 033

this is my cfg

[root@lvm-nfscpdata-05ct::~]# less /etc/corosync/corosync.conf
totem {
  version: 2
  transport:   knet
  cluster_name:nfscpdata
  token:   2000
  token_retransmits_before_loss_const: 10
  max_messages:150
  window_size: 300
  crypto_cipher:   aes256
  crypto_hash: sha256
  interface {
ringnumber: 0
  }
  interface {
ringnumber: 1
  }
}

logging {
  fileline:off
  to_stderr:   yes
  to_logfile:  no
  to_syslog:   yes
  syslog_facility: daemon
  syslog_priority: info
  debug:   off
  timestamp:   on
  logger_subsys {
subsys: QUORUM
debug:  off
  }
}

quorum {
  # Enable and configure quorum subsystem (default: off)
  # see also corosync.conf.5 and votequorum.5
  provider: corosync_votequorum
}

nodelist {
  node {
ring0_addr: 10.28.63.138
ring1_addr: 10.28.98.138
name: lvm-nfscpdata-04ct
nodeid: 1688
  }
  node {
ring0_addr: 10.28.63.139
ring1_addr: 10.28.98.139
name: lvm-nfscpdata-05ct
nodeid: 1689
  }
  node {
ring0_addr: 10.28.63.140
ring1_addr: 10.28.98.140
name: lvm-nfscpdata-06ct
nodeid: 1690
  }
}

Ring 1 managed by host firewall. But ports opend
Ring 0 no Firewall setting.




Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
Aktiengesellschaft, Koeln, Registereintragung / Registration: Amtsgericht Koeln 
HR B 2168
Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr. 
Karl-Ludwig Kley
Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), Thorsten 
Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef Kayser, Dr. Michael 
Niggemann


-Ursprüngliche Nachricht-
Von: Strahil Nikolov  
Gesendet: Dienstag, 9. Juni 2020 21:34
An: ROHWEDER-NEUBECK, MICHAEL (EXTERN) ; 
Cluster Labs - All topics related to open-source clustering welcomed 

Betreff: Re: [ClusterLabs] Redudant Ring Network failure

It  will  be hard to guess if you are  using sctp or udp/udpu.
If possible  share  the corosync.conf  (you can remove sensitive data,  but  
make it meaningful).

Are you using a firewall ? If yes  check :
1. Node firewall is not blocking  the communication on the specific  interfaces 
2. Verify with tcpdump that the heartbeats are received from the remote side.
3. Check for retransmissions or packet loss.

Usually you can find more details in the log specified in corosync.conf or in 
/var/log/messages (and also the journal).

Best Regards,
Strahil Nikolov

На 9 юни 2020 г. 21:11:02 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL (EXTERN)" 
 написа:
>Hi,
>
>we are using unicast ("knet")
>
>Greetings
>
>Michael
>
>
>
>
>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>Amtsgericht Koeln HR B 2168
>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr.
>Karl-Ludwig Kley
>Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), 
>Thorsten Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef 
>Kayser, Dr. Michael Niggemann
>
>
>-Ursprüngliche Nachricht-
>Von: Strahil Nikolov 
>Gesendet: Dienstag, 9. Juni 2020 19:30
>An: Cluster Labs - All topics related to open-source clustering 
>welcomed ; ROHWEDER-NEUBECK, MICHAEL (EXTERN) 
>
>Betreff: Re: [ClusterLabs] Redudant Ring Network failure
>
>Are you using multicast ?
>
>Best Regards,
>Strahil Nikolov
>
>На 9 юни 2020 г. 10:28:25 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL 
>(EXTERN)"  написа:
>>Hello,
>>We have massive problems with the redundant ring operation of our 
>>Corosync / pacemaker 3 Node NFS clusters.
>>
>>Most of the nodes either have an entire ring offline or only 1 node in
>
>>a ring.
>>Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 |
>Node3
>>Ring0 333 Ring 1 33n)
>>
>>corosync-cfgtool -R don't help
>>All nodes are VMs that build the ring together using 2 VLANs.
>>Which logs do you need to hopefully help me?
>>
>>Corosync Cluster Engine, version '3.0.1'
>>Copyright (c) 2006-2018 Red Hat, Inc.
>>Debian Buster
>>
>>
>>--
>>Mit freundlichen Grüßen
>>  Michael Rohweder-Neubeck
>>
>>NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
>>D-64291 Darmstadt
>>E-Mail:
>>m...@nsb-software.de>de%3cmailto:m...@nsb-software.de>>
>>Manager: Van-Hien Nguyen, Jörg Jaspert
>>USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt
>>
>>
>>
>>
>>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
>>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>>Amtsgericht Koeln HR B 2168
>>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board:
>Dr.
>>Karl-Ludwig Kley
>>Vorstand / Executi

Re: [ClusterLabs] Redudant Ring Network failure

2020-06-10 Thread ROHWEDER-NEUBECK, MICHAEL (EXTERN)
Jan,

actually we using this.

[root@lvm-nfscpdata-05ct::~ 100 ]# apt show corosync
Package: corosync
Version: 3.0.1-2+deb10u1

[root@lvm-nfscpdata-05ct::~]# apt show libknet1
Package: libknet1
Version: 1.8-2

This are the newest version provided on Mirror.




Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
Aktiengesellschaft, Koeln, Registereintragung / Registration: Amtsgericht Koeln 
HR B 2168
Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr. 
Karl-Ludwig Kley
Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), Thorsten 
Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef Kayser, Dr. Michael 
Niggemann


-Ursprüngliche Nachricht-
Von: Jan Friesse  
Gesendet: Mittwoch, 10. Juni 2020 09:24
An: Cluster Labs - All topics related to open-source clustering welcomed 
; ROHWEDER-NEUBECK, MICHAEL (EXTERN) 
; us...@lists.clusterlabs.org
Betreff: Re: [ClusterLabs] Redudant Ring Network failure

Michael,
what version of knet you are using? We had quite a few problems with older 
versions of knet, so current stable is recommended (1.16). Same applies for 
corosync because 3.0.4 has vastly improved display of links status.

> Hello,
> We have massive problems with the redundant ring operation of our Corosync / 
> pacemaker 3 Node NFS clusters.
> 
> Most of the nodes either have an entire ring offline or only 1 node in a ring.
> Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 | 
> Node3 Ring0 333 Ring 1 33n)

Doesn't seem completely wrong. You can ignore 'n' for ring 1, because that is 
localhost which is connected only on Ring 0 (3.0.4 has this output more 
consistent) so all nodes are connected at least via Ring 1. 
Ring 0 on node 2 seems to have some trouble with connection to node 1 but node 
1 (and 3) seems to be connected to node 2 just fine, so I think it is ether 
some bug in knet (probably already fixed) or some kind of firewall blocking 
just connection from node 2 to node 1 on ring 0.


> 
> corosync-cfgtool -R don't help
> All nodes are VMs that build the ring together using 2 VLANs.
> Which logs do you need to hopefully help me?

syslog/journal should contain everything needed especially when debug is 
enabled (corosync.conf - logging.debug: on)

Regards,
   Honza

> 
> Corosync Cluster Engine, version '3.0.1'
> Copyright (c) 2006-2018 Red Hat, Inc.
> Debian Buster
> 
> 
> --
> Mit freundlichen Grüßen
>Michael Rohweder-Neubeck
> 
> NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
> D-64291 Darmstadt
> E-Mail: 
> m...@nsb-software.de .de%3cmailto:m...@nsb-software.de>>
> Manager: Van-Hien Nguyen, Jörg Jaspert
> USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt
> 
> 
> 
> 
> Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
> Aktiengesellschaft, Koeln, Registereintragung / Registration: 
> Amtsgericht Koeln HR B 2168 Vorsitzender des Aufsichtsrats / Chairman 
> of the Supervisory Board: Dr. Karl-Ludwig Kley Vorstand / Executive 
> Board: Carsten Spohr (Vorsitzender / Chairman), Thorsten Dirks, 
> Christina Foerster, Harry Hohmeister, Dr. Detlef Kayser, Dr. Michael 
> Niggemann
> 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Redudant Ring Network failure

2020-06-10 Thread Jan Friesse

Michael,
what version of knet you are using? We had quite a few problems with 
older versions of knet, so current stable is recommended (1.16). Same 
applies for corosync because 3.0.4 has vastly improved display of links 
status.



Hello,
We have massive problems with the redundant ring operation of our Corosync / 
pacemaker 3 Node NFS clusters.

Most of the nodes either have an entire ring offline or only 1 node in a ring.
Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 | Node3 Ring0 
333 Ring 1 33n)


Doesn't seem completely wrong. You can ignore 'n' for ring 1, because 
that is localhost which is connected only on Ring 0 (3.0.4 has this 
output more consistent) so all nodes are connected at least via Ring 1. 
Ring 0 on node 2 seems to have some trouble with connection to node 1 
but node 1 (and 3) seems to be connected to node 2 just fine, so I think 
it is ether some bug in knet (probably already fixed) or some kind of 
firewall blocking just connection from node 2 to node 1 on ring 0.





corosync-cfgtool -R don't help
All nodes are VMs that build the ring together using 2 VLANs.
Which logs do you need to hopefully help me?


syslog/journal should contain everything needed especially when debug is 
enabled (corosync.conf - logging.debug: on)


Regards,
  Honza



Corosync Cluster Engine, version '3.0.1'
Copyright (c) 2006-2018 Red Hat, Inc.
Debian Buster


--
Mit freundlichen Grüßen
   Michael Rohweder-Neubeck

NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
D-64291 Darmstadt
E-Mail: 
m...@nsb-software.de>
Manager: Van-Hien Nguyen, Jörg Jaspert
USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt




Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
Aktiengesellschaft, Koeln, Registereintragung / Registration: Amtsgericht Koeln 
HR B 2168
Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr. 
Karl-Ludwig Kley
Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), Thorsten 
Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef Kayser, Dr. Michael 
Niggemann




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/