[Group.of.nepali.translators] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

Rafael David Tinoco Wed, 01 Apr 2020 11:46:26 -0700

>From Corosync 2.4.1 Release Notes:

This release contains fix for one regression and few more smaller fixes.


"""
During 2.3.6 development the bug which is causing pacemaker to not work after 
corosync configuration file is reloaded happened. Solution is ether to use this 
fixed version (recommended) or as a quick workaround (for users who wants to 
stay on 2.3.6 or 2.4.0) is to create file pacemaker (file name can be 
arbitrary) in /etc/corosync/uidgid.d directory with following content (you can 
also put same stanza into /etc/corosync/corosync.conf):

uidgid {
    gid: haclient
}
"""

Anyone relying in Trusty or Xenial corosync:

 corosync | 2.3.3-1ubuntu1   | trusty
 corosync | 2.3.3-1ubuntu4   | trusty-updates
 corosync | 2.3.5-3ubuntu1   | xenial
 corosync | 2.3.5-3ubuntu2.3 | xenial-security
 corosync | 2.3.5-3ubuntu2.3 | xenial-updates

should apply the mitigation above, like discovered previously by
commenters of this bug.

Note: Trusty is already EOS so I'm marking it as "won't fix".

Xenial should include the mitigation in a SRU.

** Changed in: pacemaker (Ubuntu Trusty)
       Status: Confirmed => Won't Fix

** Changed in: pacemaker (Ubuntu Trusty)
   Importance: Medium => Undecided

** Changed in: pacemaker (Ubuntu Xenial)
   Importance: Medium => High

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in pacemaker package in Ubuntu:
  Fix Released
Status in pacemaker source package in Trusty:
  Won't Fix
Status in pacemaker source package in Xenial:
  Confirmed
Status in pacemaker source package in Bionic:
  Fix Released

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB    ] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:    error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:    error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:    error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:    error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 1200000ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition: State transition S_STARTING -> S_STOPPING [ 
input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
terminate_cs_connection: Disconnecting from Corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB    ] Invalid IPC 
credentials (1033732-1033743).
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:    error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:     crit: cib_init: Cannot 
sign in to the cluster... terminating
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping pengine: Sent -15 to process 1033747
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:    error: 
pcmk_child_exit: Child process cib (1033743) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. 
Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping lrmd: Sent -15 to process 1033745
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping stonith-ng: Sent -15 to process 1033744
  Apr  2 11:41:34 juju-machine-4-lxc-4 corosync[1033732]:  [TOTEM ] A new 
membership (10.245.160.62:284) was formed. Members joined: 1000
  Apr  2 11:41:41 juju-machine-4-lxc-4 stonith-ng[1033744]:    error: 
setup_cib: Could not connect to the CIB service: Transport endpoint is not 
connected (-107)
  Apr  2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shutdown complete
  Apr  2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error

  ProblemType: Bug
  DistroRelease: Ubuntu 14.04
  Package: pacemaker 1.1.10+git20130802-1ubuntu2.3
  ProcVersionSignature: User Name 3.16.0-33.44~14.04.1-generic 3.16.7-ckt7
  Uname: Linux 3.16.0-33-generic x86_64
  NonfreeKernelModules: vhost_net vhost macvtap macvlan xt_conntrack ipt_REJECT 
ip6table_filter ip6_tables ebtable_nat ebtables veth 8021q garp xt_CHECKSUM mrp 
iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables nbd 
ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi openvswitch gre vxlan dm_crypt bridge 
dm_multipath intel_rapl stp scsi_dh x86_pkg_temp_thermal llc intel_powerclamp 
coretemp ioatdma kvm_intel ipmi_si joydev sb_edac kvm hpwdt hpilo dca 
ipmi_msghandler acpi_power_meter edac_core lpc_ich shpchp serio_raw mac_hid xfs 
libcrc32c btrfs xor raid6_pq hid_generic usbhid hid crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul 
glue_helper ablk_helper cryptd psmouse tg3 ptp pata_acpi hpsa pps_core
  ApportVersion: 2.14.1-0ubuntu3.7
  Architecture: amd64
  Date: Thu Apr  2 11:42:18 2015
  SourcePackage: pacemaker
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1439649/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~group.of.nepali.translators
Post to     : group.of.nepali.translators@lists.launchpad.net
Unsubscribe : https://launchpad.net/~group.of.nepali.translators
More help   : https://help.launchpad.net/ListHelp

[Group.of.nepali.translators] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

Reply via email to