[Ubuntu-ha] [Bug 1251298] Re: Failed to sign on to LRMd with Heartbeat/Pacemaker

Rafael David Tinoco Thu, 19 Mar 2020 12:55:27 -0700

Thank you for taking the time to report this bug. In an effort to keep an
up-to-date and valid list of bugs to work on, I have reviewed this report
to verify it still requires effort and occurs on an Ubuntu release in
standard support, and it does not.


Judging by the existing comments, and the pointed upstream discussion thread, it
appears that the wrong lrmd was being set after upgrade and that led existing
cluster to confusion when connecting to the local resource manager daemon.

It is unfortunate that we were unable to resolve this defect, however
there appears to be no further action possible at this time. I am
therefore moving the bug to 'Incomplete'. If you disagree or have
new information, we would be grateful if you could please add a comment
stating why and then change the status of the bug to 'New'.

** Changed in: cluster-glue (Ubuntu)
       Status: Confirmed => Incomplete

** Also affects: cluster-glue (Ubuntu Trusty)
   Importance: Undecided
       Status: New

** Changed in: cluster-glue (Ubuntu Trusty)
       Status: New => Incomplete

** Changed in: cluster-glue (Ubuntu)
       Status: Incomplete => Fix Released

** Changed in: cluster-glue (Ubuntu)
     Assignee: Rafael David Tinoco (rafaeldtinoco) => (unassigned)

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to cluster-glue in Ubuntu.
https://bugs.launchpad.net/bugs/1251298

Title:
  Failed to sign on to LRMd with Heartbeat/Pacemaker

Status in cluster-glue package in Ubuntu:
  Fix Released
Status in cluster-glue source package in Trusty:
  Incomplete

Bug description:
  I'm running a 2 node heartbeat/pacemaker cluster, which was working fine with 
Ubuntu 13.04
  After upgrading from Ubuntu 13.04 to Ubuntu 13.10, Heartbeat/Pacemaker keeps 
restarting the system due to sign on errors of lrmd and heartbeat tries to 
recover.

  As one system is already on ubuntu 13.10 and one system still running
  13.04, I've tried it without the second node, which leads to the same
  behavior, which occurs before any cluster communication happens.

  Syslog:
  Nov 14 15:53:06 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 1 (30 max) times
  Nov 14 15:53:06 wolverine crmd[2464]:   notice: crmd_client_status_callback: 
Status update: Client wolverine.domain.tld/crmd now has status [join] (DC=false)
  Nov 14 15:53:06 wolverine crmd[2464]:   notice: crmd_client_status_callback: 
Status update: Client wolverine.domain.tld/crmd now has status [online] 
(DC=false)
  Nov 14 15:53:06 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 2 (30 max) times
  Nov 14 15:53:06 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 3 (30 max) times
  Nov 14 15:53:07 wolverine stonith-ng[2462]:   notice: setup_cib: Watching for 
stonith topology changes
  Nov 14 15:53:07 wolverine stonith-ng[2462]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
  Nov 14 15:53:08 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 4 (30 max) times
  Nov 14 15:53:10 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 5 (30 max) times
  Nov 14 15:53:12 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 6 (30 max) times
  Nov 14 15:53:14 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 7 (30 max) times
  Nov 14 15:53:16 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 8 (30 max) times
  Nov 14 15:53:18 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 9 (30 max) times
  Nov 14 15:53:20 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 10 (30 max) times
  Nov 14 15:53:22 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 11 (30 max) times
  Nov 14 15:53:24 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 12 (30 max) times
  Nov 14 15:53:26 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 13 (30 max) times
  Nov 14 15:53:28 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 14 (30 max) times
  Nov 14 15:53:30 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 15 (30 max) times
  Nov 14 15:53:32 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 16 (30 max) times
  Nov 14 15:53:34 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 17 (30 max) times
  Nov 14 15:53:36 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 18 (30 max) times
  Nov 14 15:53:38 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 19 (30 max) times
  Nov 14 15:53:40 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 20 (30 max) times
  Nov 14 15:53:42 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 21 (30 max) times
  Nov 14 15:53:44 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 22 (30 max) times
  Nov 14 15:53:46 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 23 (30 max) times
  Nov 14 15:53:48 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 24 (30 max) times
  Nov 14 15:53:50 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 25 (30 max) times
  Nov 14 15:53:52 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 26 (30 max) times
  Nov 14 15:53:54 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 27 (30 max) times
  Nov 14 15:53:56 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 28 (30 max) times
  Nov 14 15:53:58 wolverine crmd[2464]:  warning: do_lrm_control: Failed to 
sign on to the LRM 29 (30 max) times
  Nov 14 15:54:00 wolverine crmd[2464]:    error: do_lrm_control: Failed to 
sign on to the LRM 30 (max) times
  Nov 14 15:54:00 wolverine crmd[2464]:    error: do_log: FSA: Input I_ERROR 
from do_lrm_control() received in state S_STARTING
  Nov 14 15:54:00 wolverine crmd[2464]:   notice: do_state_transition: State 
transition S_STARTING -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
origin=do_lrm_control ]
  Nov 14 15:54:00 wolverine crmd[2464]:  warning: do_recover: Fast-tracking 
shutdown in response to errors

  
  Symlinking lrmd from pacemaker package solved this problem partly:
  root@wolverine ~ # mv /usr/lib/heartbeat/lrmd{,.cluster-glue}
  root@wolverine ~ # cd /usr/lib/heartbeat/
  root@wolverine /usr/lib/heartbeat # ln -s ../pacemaker/lrmd
  root@wolverine /usr/lib/heartbeat # ls -la lrmd
  lrwxrwxrwx 1 root root 17 Nov 14 16:35 lrmd -> ../pacemaker/lrmd
  root@wolverine /usr/lib/heartbeat # ls -la lrmd*
  lrwxrwxrwx 1 root root    17 Nov 14 16:35 lrmd -> ../pacemaker/lrmd
  -rwxr-xr-x 1 root root 92816 Jul 18 17:55 lrmd.cluster-glue
  root@wolverine /usr/lib/heartbeat #

  Stopping heartbeat will still result in an unexpected reboot:
  Nov 14 16:37:27 wolverine crmd[2259]:   notice: process_lrm_event: LRM 
operation drbd-backup:1_notify_0 (call=45, rc=0, cib-update=0, confirmed=true) 
ok
  Nov 14 16:37:28 wolverine crmd[2259]:   notice: process_lrm_event: LRM 
operation drbd-rsyslog:1_notify_0 (call=48, rc=0, cib-update=0, confirmed=true) 
ok
  Nov 14 16:37:28 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2673 
failed authorization [no default client auth]
  Nov 14 16:37:28 wolverine heartbeat: [2238]: ERROR: 
api_process_registration_msg: cannot add client(crm_node)
  Nov 14 16:37:28 wolverine attrd[2258]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd-backup:1 (10000)
  Nov 14 16:37:28 wolverine attrd[2258]:   notice: attrd_perform_update: Sent 
update 10: master-drbd-backup:1=10000
  Nov 14 16:37:28 wolverine crmd[2259]:   notice: process_lrm_event: LRM 
operation drbd-backup:1_monitor_31000 (call=50, rc=0, cib-update=17, 
confirmed=false) ok
  Nov 14 16:37:29 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2700 
failed authorization [no default client auth]
  Nov 14 16:37:29 wolverine heartbeat: [2238]: ERROR: 
api_process_registration_msg: cannot add client(crm_node)
  Nov 14 16:37:29 wolverine attrd[2258]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd-rsyslog:1 (10000)
  Nov 14 16:37:29 wolverine attrd[2258]:   notice: attrd_perform_update: Sent 
update 13: master-drbd-rsyslog:1=10000
  Nov 14 16:37:29 wolverine crmd[2259]:   notice: process_lrm_event: LRM 
operation drbd-rsyslog:1_monitor_31000 (call=54, rc=0, cib-update=18, 
confirmed=false) ok
  Nov 14 16:37:59 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2812 
failed authorization [no default client auth]
  Nov 14 16:37:59 wolverine heartbeat: [2238]: ERROR: 
api_process_registration_msg: cannot add client(crm_node)
  Nov 14 16:38:00 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2839 
failed authorization [no default client auth]
  Nov 14 16:38:00 wolverine heartbeat: [2238]: ERROR: 
api_process_registration_msg: cannot add client(crm_node)
  Nov 14 16:38:05 wolverine heartbeat: [2238]: info: killing 
/usr/lib/heartbeat/crmd process group 2259 with signal 15
  Nov 14 16:38:05 wolverine crmd[2259]:   notice: crm_shutdown: Requesting 
shutdown, upper limit is 1200000ms
  Nov 14 16:38:05 wolverine attrd[2258]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: shutdown (1384443485)
  Nov 14 16:38:05 wolverine attrd[2258]:   notice: attrd_perform_update: Sent 
update 16: shutdown=1384443485
  Nov 14 16:38:06 wolverine crmd[2259]:   notice: process_lrm_event: LRM 
operation drbd-backup:1_notify_0 (call=57, rc=0, cib-update=0, confirmed=true) 
ok
  Nov 14 16:38:06 wolverine crmd[2259]:   notice: process_lrm_event: LRM 
operation drbd-rsyslog:1_notify_0 (call=59, rc=0, cib-update=0, confirmed=true) 
ok
  Nov 14 16:38:07 wolverine kernel: [  255.385984] d-con backup: Requested 
state change failed by peer: Refusing to be Primary while peer is not outdated 
(-7)
  Nov 14 16:38:07 wolverine kernel: [  255.386415] d-con backup: peer( Primary 
-> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) 
pdsk( UpToDate -> DUnknown )
  Nov 14 16:38:07 wolverine kernel: [  255.386428] d-con backup: asender 
terminated
  Nov 14 16:38:07 wolverine kernel: [  255.386438] d-con backup: Terminating 
drbd_a_backup
  Nov 14 16:38:07 wolverine kernel: [  255.386693] d-con backup: Connection 
closed
  Nov 14 16:38:07 wolverine kernel: [  255.386716] d-con backup: conn( 
Disconnecting -> StandAlone )
  Nov 14 16:38:07 wolverine kernel: [  255.386718] d-con backup: receiver 
terminated
  Nov 14 16:38:07 wolverine kernel: [  255.386722] d-con backup: Terminating 
drbd_r_backup
  Nov 14 16:38:07 wolverine kernel: [  255.386750] block drbd0: disk( Outdated 
-> Failed )
  Nov 14 16:38:07 wolverine kernel: [  255.409861] block drbd0: bitmap WRITE of 
0 pages took 0 jiffies
  Nov 14 16:38:07 wolverine kernel: [  255.409930] block drbd0: 0 KB (0 bits) 
marked out-of-sync by on disk bit-map.
  Nov 14 16:38:07 wolverine kernel: [  255.409943] block drbd0: disk( Failed -> 
Diskless )
  Nov 14 16:38:07 wolverine kernel: [  255.410041] block drbd0: drbd_bm_resize 
called with capacity == 0
  Nov 14 16:38:07 wolverine kernel: [  255.411773] d-con backup: Terminating 
drbd_w_backup
  Nov 14 16:38:07 wolverine kernel: [  255.466428] d-con rsyslog: Requested 
state change failed by peer: Refusing to be Primary while peer is not outdated 
(-7)
  Nov 14 16:38:07 wolverine kernel: [  255.466796] d-con rsyslog: peer( Primary 
-> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) 
pdsk( UpToDate -> DUnknown )
  Nov 14 16:38:07 wolverine kernel: [  255.466814] d-con rsyslog: asender 
terminated
  Nov 14 16:38:07 wolverine kernel: [  255.466832] d-con rsyslog: Terminating 
drbd_a_rsyslog
  Nov 14 16:38:07 wolverine kernel: [  255.467098] d-con rsyslog: Connection 
closed
  Nov 14 16:38:07 wolverine kernel: [  255.467121] d-con rsyslog: conn( 
Disconnecting -> StandAlone )
  Nov 14 16:38:07 wolverine kernel: [  255.467123] d-con rsyslog: receiver 
terminated
  Nov 14 16:38:07 wolverine kernel: [  255.467128] d-con rsyslog: Terminating 
drbd_r_rsyslog
  Nov 14 16:38:07 wolverine kernel: [  255.467169] block drbd1: disk( Outdated 
-> Failed )
  Nov 14 16:38:07 wolverine kernel: [  255.481716] block drbd1: bitmap WRITE of 
0 pages took 0 jiffies
  Nov 14 16:38:07 wolverine kernel: [  255.481778] block drbd1: 0 KB (0 bits) 
marked out-of-sync by on disk bit-map.
  Nov 14 16:38:07 wolverine kernel: [  255.481791] block drbd1: disk( Failed -> 
Diskless )
  Nov 14 16:38:07 wolverine kernel: [  255.481881] block drbd1: drbd_bm_resize 
called with capacity == 0
  Nov 14 16:38:07 wolverine kernel: [  255.482011] d-con rsyslog: Terminating 
drbd_w_rsyslog
  Nov 14 16:38:07 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2986 
failed authorization [no default client auth]
  Nov 14 16:38:07 wolverine heartbeat: [2238]: ERROR: 
api_process_registration_msg: cannot add client(crm_node)
  Nov 14 16:38:07 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2989 
failed authorization [no default client auth]
  Nov 14 16:38:07 wolverine heartbeat: [2238]: ERROR: 
api_process_registration_msg: cannot add client(crm_node)
  Nov 14 16:38:07 wolverine attrd[2258]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd-backup:1 (<null>)
  Nov 14 16:38:07 wolverine attrd[2258]:   notice: attrd_perform_update: Sent 
delete 18: node=19f64c15-2545-4b18-8d1a-39d9c3a88a56, 
attr=master-drbd-backup:1, id=<n/a>, set=(null), section=status
  Nov 14 16:38:07 wolverine crmd[2259]:   notice: process_lrm_event: LRM 
operation drbd-backup:1_stop_0 (call=64, rc=0, cib-update=19, confirmed=true) ok
  Nov 14 16:38:07 wolverine attrd[2258]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd-rsyslog:1 (<null>)
  Nov 14 16:38:07 wolverine attrd[2258]:   notice: attrd_perform_update: Sent 
delete 20: node=19f64c15-2545-4b18-8d1a-39d9c3a88a56, 
attr=master-drbd-rsyslog:1, id=<n/a>, set=(null), section=status
  Nov 14 16:38:07 wolverine crmd[2259]:   notice: process_lrm_event: LRM 
operation drbd-rsyslog:1_stop_0 (call=67, rc=0, cib-update=20, confirmed=true) 
ok
  Nov 14 16:38:07 wolverine attrd[2258]:   notice: attrd_perform_update: Sent 
delete 22: node=19f64c15-2545-4b18-8d1a-39d9c3a88a56, 
attr=master-drbd-backup:1, id=<n/a>, set=(null), section=status
  Nov 14 16:38:07 wolverine attrd[2258]:   notice: attrd_perform_update: Sent 
delete 24: node=19f64c15-2545-4b18-8d1a-39d9c3a88a56, 
attr=master-drbd-rsyslog:1, id=<n/a>, set=(null), section=status
  Nov 14 16:38:08 wolverine crmd[2259]:   notice: do_state_transition: State 
transition S_NOT_DC -> S_STOPPING [ input=I_STOP cause=C_HA_MESSAGE 
origin=route_message ]
  Nov 14 16:38:08 wolverine crmd[2259]:   notice: lrm_state_verify_stopped: 
Stopped 0 recurring operations at (null) (1274719202 ops remaining)
  Nov 14 16:38:08 wolverine crmd[2259]:   notice: do_lrm_control: Disconnected 
from the LRM
  Nov 14 16:38:08 wolverine ccm: [2254]: info: client (pid=2259) removed from 
ccm
  Nov 14 16:38:08 wolverine heartbeat: [2238]: EMERG: Rebooting system.  
Reason: /usr/lib/heartbeat/crmd

  
  root@wolverine ~ # lsb_release -rd
  Description:    Ubuntu 13.10
  Release:        13.10

  root@wolverine ~ # apt-cache policy cluster-glue
  cluster-glue:
    Installed: 1.0.11+hg2754-1.1
    Candidate: 1.0.11+hg2754-1.1
    Version table:
   *** 1.0.11+hg2754-1.1 0
          500 http://de.archive.ubuntu.com/ubuntu/ saucy/main amd64 Packages
          100 /var/lib/dpkg/status
  root@wolverine ~ #

  root@wolverine ~ # apt-cache policy heartbeat
  heartbeat:
    Installed: 1:3.0.5-3.1ubuntu1
    Candidate: 1:3.0.5-3.1ubuntu1
    Version table:
   *** 1:3.0.5-3.1ubuntu1 0
          500 http://de.archive.ubuntu.com/ubuntu/ saucy/main amd64 Packages
          100 /var/lib/dpkg/status
  root@wolverine ~ #

  root@wolverine ~ # apt-cache policy pacemaker
  pacemaker:
    Installed: 1.1.10+git20130802-1ubuntu1
    Candidate: 1.1.10+git20130802-1ubuntu1
    Version table:
   *** 1.1.10+git20130802-1ubuntu1 0
          500 http://de.archive.ubuntu.com/ubuntu/ saucy/main amd64 Packages
          100 /var/lib/dpkg/status
  root@wolverine ~ #

  Expected:
  - Working heartbeat/pacemaker setup after ubuntu upgrade

  What happend:
  - System reboots after about one minute due to heartbeat recovery tries

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/1251298/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

[Ubuntu-ha] [Bug 1251298] Re: Failed to sign on to LRMd with Heartbeat/Pacemaker

Reply via email to