[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
*** This bug is a duplicate of bug 821732 *** https://bugs.launchpad.net/bugs/821732 ** This bug has been marked a duplicate of bug 821732 socket leak in lrmd -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/676391 Title: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
** Changed in: cluster-glue (Ubuntu) Status: New = Triaged ** Changed in: cluster-glue (Ubuntu) Importance: Undecided = Low -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/676391 Title: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
cluster-glue containing upstart support (1.0.6-1ubuntu1.1) may be buggy atm in that it doesnt initialise the threading system that upstart support requires when accessing d-bus. See http://www.gossamer- threads.com/lists/linuxha/dev/68379?search_string=possible%20deadlock%20in%20lrmd;#68379 for further discussions. It does, however, include Senko Rasic's patch mentioned in that thread. -- do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick https://bugs.launchpad.net/bugs/676391 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
hb_report added just after corosync started for what it is worth. ** Attachment added: report_node1.tar.bz2 https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391/+attachment/1738641/+files/report_node1.tar.bz2 -- do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick https://bugs.launchpad.net/bugs/676391 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
-- do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick https://bugs.launchpad.net/bugs/676391 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
For clarirification I am running a 2 node failover (master slave) clustered ftp server using drbd to duplicate the filesystem. I'm using corosync/pacemaker for my HA stack. I can post the configuration but dont think that important as the system fails before really using that. -- do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick https://bugs.launchpad.net/bugs/676391 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
Versions of these other packages pacemaker 1.0.9.1-2ubuntu4 corosync 1.2.1-1ubuntu1 cluster-agents 1:1.0.3-3 The only other thing I noticed during the upgrade was heartbeat-common had no candidate on 1 of the 2 servers I upgraded. I Put this down to the fact I had previously had the old heartbeat HA working on that box. -- do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick https://bugs.launchpad.net/bugs/676391 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
Further investigation (not yet correlated) suggests that it might possibly be a permissions or a timing issue. I can see from straces crmd creating the lrm sockets [pid 4433] unlink(/var/run/heartbeat/lrm_cmd_sock) = -1 ENOENT (No such file or directory) [pid 4433] bind(4, {sa_family=AF_FILE, path=/var/run/heartbeat/lrm_cmd_sock}, 110) = 0 [pid 4433] chmod(/var/run/heartbeat/lrm_cmd_sock, 0777) = 0 [pid 4433] listen(4, 10) = 0 [pid 4433] fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) [pid 4433] fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 4433] socket(PF_FILE, SOCK_STREAM, 0) = 7 [pid 4433] unlink(/var/run/heartbeat/lrm_callback_sock) = -1 ENOENT (No such file or directory) [pid 4433] bind(7, {sa_family=AF_FILE, path=/var/run/heartbeat/lrm_callback_sock}, 110) = 0 [pid 4433] chmod(/var/run/heartbeat/lrm_callback_sock, 0777) = 0 and then shortly afterwards delete them again [pid 4433] unlink(/var/run/heartbeat/lrm_cmd_sock unfinished ... [pid 4436] ... mprotect resumed )= 0 [pid 4433] ... unlink resumed ) = 0 [pid 4425] futex(0x7f33c929b0c4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f33c929b0c0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} unfinished ... [pid 4433] close(7 unfinished ... [pid 4426] ... futex resumed ) = 0 [pid 4433] ... close resumed ) = 0 [pid 4426] futex(0x7f33c929b100, FUTEX_WAIT_PRIVATE, 2, NULL unfinished ... [pid 4433] unlink(/var/run/heartbeat/lrm_callback_sock unfinished ... On one of the two nodes I did manage to make some progress in that I stopped corosync, waited for a few minutes then started it again and now I get the sockets (Doing /etc/init.d/corosync restarts or machine reboots wasnt ever successful. r...@node1:/var/run/heartbeat# ls -l total 0 srwxrwxrwx 1 root root 0 2010-11-17 01:22 lrm_callback_sock srwxrwxrwx 1 root root 0 2010-11-17 01:22 lrm_cmd_sock drwxr-xr-x 2 root root 40 2010-11-17 01:22 rsctmp srwxrwxrwx 1 root root 0 2010-11-17 01:22 stonithd srwxrwxrwx 1 root root 0 2010-11-17 01:22 stonithd_callback but I cant repeat this on the second node. I havent tried it on the first node again (at least I can compare things with the 2nd node). After -- do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick https://bugs.launchpad.net/bugs/676391 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
Hi. I presume your name is Dave... Well, first of, I can't seems to reproduce your bug. I just setup a cluster and they come online both nodes and a DC is elected. Logs don't show any of the behavior. Now, I was thinking, given that you have upgraded... maybe there's a file that has been overwritten by the new packages, such as: /etc/default/corosync /etc/corosync/corosync.conf Or, there's something wrong with the keys. First, I'd recommend you check if above files are with the correct values (/etc/default/corosync with yes and /etc/corosync/corosync.conf the network is the same on both nodes). If everything is configured as expected, try recreating the keys with 'corosync-keygen', make sure the permissions are the correct ones, and copy the key to the other node (root:root and has 400). Let me know afterwards Thank you for reporting bugs. -- do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick https://bugs.launchpad.net/bugs/676391 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
Yes Andres, I'm Dave! To answer your questions: /etc/default/corosync - YES start at boot /etc/corosync/corosync.conf identical between nodes, looks OK/intact and is dated1 month ago. authkeys md5sum identical and dated 1 month ago 400 root.root perms Both nodes are reporting same with crm_mon so no reason to think comms is a problem (e.g. auth or multicast bad) Last few lines of log from node2 as crmd died: ... Nov 17 10:55:26 node2 crmd: [22808]: info: crm_timer_popped: Wait Timer (I_NULL) just popped! Nov 17 10:55:26 node2 crmd: [22808]: WARN: lrm_signon: can not initiate connection Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_lrm_control: Failed to sign on to the LRM 30 (max) times Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_log: FSA: Input I_ERROR from do_lrm_control() received in state S_STARTING Nov 17 10:55:26 node2 crmd: [22808]: info: do_state_transition: State transition S_STARTING - S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_lrm_control ] Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_recover: Action A_RECOVER (0100) not supported Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_started: Start cancelled... S_RECOVERY Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Nov 17 10:55:26 node2 crmd: [22808]: info: do_state_transition: State transition S_RECOVERY - S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ] Nov 17 10:55:26 node2 crmd: [22808]: info: do_ha_control: Disconnected from OpenAIS Nov 17 10:55:26 node2 crmd: [22808]: info: do_cib_control: Disconnecting CIB Nov 17 10:55:26 node2 crmd: [22808]: info: crmd_cib_connection_destroy: Connection to the CIB terminated... Nov 17 10:55:26 node2 crmd: [22808]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_exit: Could not recover from internal error Nov 17 10:55:26 node2 crmd: [22808]: info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] Nov 17 10:55:26 node2 cib: [20601]: WARN: send_ipc_message: IPC Channel to 22808 is not connected Nov 17 10:55:26 node2 cib: [20601]: WARN: send_via_callback_channel: Delivery of reply to client 22808/3210be2d-0165-4c05-8c43-8945feea0692 failed Nov 17 10:55:26 node2 crmd: [22808]: info: do_exit: [crmd] stopped (2) Nov 17 10:55:26 node2 cib: [20601]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed Nov 17 10:55:26 node2 corosync[20586]: [pcmk ] info: pcmk_ipc_exit: Client crmd (conn=0x1d98d00, async-conn=0x1d98d00) left Nov 17 10:55:27 node2 corosync[20586]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=22808, rc=2) Nov 17 10:55:27 node2 corosync[20586]: [pcmk ] pcmk_wait_dispatch: Call to wait4(crmd) failed: (10) No child processes Nov 17 10:55:27 node2 corosync[20586]: [pcmk ] ERROR: pcmk_wait_dispatch: Child respawn count exceeded by crmd Nov 17 10:55:27 node2 corosync[20586]: [pcmk ] info: update_member: Node node2 now has process list: 0002 (69906) -- do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick https://bugs.launchpad.net/bugs/676391 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
Also on node 1 (where I think lrmd is running) I have crm_mon[26251]: 2010/11/17_21:00:32 info: determine_online_status: Node node1 is standby crm_mon[26251]: 2010/11/17_21:00:32 debug: unpack_rsc_op: failover-ip_monitor_0 on node1 returned 0 (ok) instead of the expected value: 7 (not running) crm_mon[26251]: 2010/11/17_21:00:32 notice: unpack_rsc_op: Operation failover-ip_monitor_0 found resource failover-ip active on node1 crm_mon[26251]: 2010/11/17_21:00:32 debug: unpack_rsc_op: drbd_disk:0_monitor_0 on node1 returned 8 (master) instead of the expected value: 7 (not running) crm_mon[26251]: 2010/11/17_21:00:32 notice: unpack_rsc_op: Operation drbd_disk:0_monitor_0 found resource drbd_disk:0 active in master mode on node1 crm_mon[26251]: 2010/11/17_21:00:32 debug: unpack_rsc_op: fs_drbd_monitor_0 on node1 returned 0 (ok) instead of the expected value: 7 (not running) crm_mon[26251]: 2010/11/17_21:00:32 notice: unpack_rsc_op: Operation fs_drbd_monitor_0 found resource fs_drbd active on node1 crm_mon[26251]: 2010/11/17_21:00:32 debug: unpack_rsc_op: vsftpd_monitor_0 on node1 returned 0 (ok) instead of the expected value: 7 (not running) crm_mon[26251]: 2010/11/17_21:00:32 notice: unpack_rsc_op: Operation vsftpd_monitor_0 found resource vsftpd active on node1 which suggests that the ra's arent reporting correctly -- do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick https://bugs.launchpad.net/bugs/676391 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs