[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2011-11-01 Thread Ante Karamatić
*** This bug is a duplicate of bug 821732 ***
https://bugs.launchpad.net/bugs/821732

** This bug has been marked a duplicate of bug 821732
   socket leak in lrmd

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/676391

Title:
   do_lrm_control: Failed to sign on to the LRM after upgrade to
  Maverick

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-12-03 Thread Andres Rodriguez
** Changed in: cluster-glue (Ubuntu)
   Status: New = Triaged

** Changed in: cluster-glue (Ubuntu)
   Importance: Undecided = Low

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/676391

Title:
   do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-11-25 Thread flashydave
cluster-glue containing upstart support (1.0.6-1ubuntu1.1) may be buggy
atm in that it doesnt initialise the threading system that upstart
support requires when accessing d-bus.  See http://www.gossamer-
threads.com/lists/linuxha/dev/68379?search_string=possible%20deadlock%20in%20lrmd;#68379
for further discussions. It does, however, include Senko Rasic's patch
mentioned in that thread.

-- 
 do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
https://bugs.launchpad.net/bugs/676391
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-11-19 Thread flashydave
hb_report added just after corosync started for what it is worth.


** Attachment added: report_node1.tar.bz2
   
https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391/+attachment/1738641/+files/report_node1.tar.bz2

-- 
 do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
https://bugs.launchpad.net/bugs/676391
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-11-17 Thread flashydave


-- 
 do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
https://bugs.launchpad.net/bugs/676391
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-11-17 Thread flashydave
For clarirification I am running a 2 node failover (master slave)
clustered ftp server using drbd to duplicate the filesystem. I'm using
corosync/pacemaker for my HA stack. I can post the configuration but
dont think that important as the system fails before really using that.

-- 
 do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
https://bugs.launchpad.net/bugs/676391
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-11-17 Thread flashydave
Versions of these other packages
pacemaker   1.0.9.1-2ubuntu4
corosync  1.2.1-1ubuntu1
cluster-agents  1:1.0.3-3


The only other thing I noticed during the upgrade was heartbeat-common had no 
candidate on 1 of the 2 servers I upgraded. I Put this down to the fact I had 
previously had the old heartbeat HA working on that box.

-- 
 do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
https://bugs.launchpad.net/bugs/676391
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-11-17 Thread flashydave
Further investigation (not yet correlated) suggests that it might possibly be a 
permissions or a timing issue.
I can see from straces crmd creating the lrm sockets

[pid  4433] unlink(/var/run/heartbeat/lrm_cmd_sock) = -1 ENOENT (No such file 
or directory)
[pid  4433] bind(4, {sa_family=AF_FILE, 
path=/var/run/heartbeat/lrm_cmd_sock}, 110) = 0
[pid  4433] chmod(/var/run/heartbeat/lrm_cmd_sock, 0777) = 0
[pid  4433] listen(4, 10)   = 0
[pid  4433] fcntl(4, F_GETFL)   = 0x2 (flags O_RDWR)
[pid  4433] fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid  4433] socket(PF_FILE, SOCK_STREAM, 0) = 7
[pid  4433] unlink(/var/run/heartbeat/lrm_callback_sock) = -1 ENOENT (No such 
file or directory)
[pid  4433] bind(7, {sa_family=AF_FILE, 
path=/var/run/heartbeat/lrm_callback_sock}, 110) = 0
[pid  4433] chmod(/var/run/heartbeat/lrm_callback_sock, 0777) = 0

 and then shortly afterwards delete them again
[pid  4433] unlink(/var/run/heartbeat/lrm_cmd_sock unfinished ...
[pid  4436] ... mprotect resumed )= 0
[pid  4433] ... unlink resumed )  = 0
[pid  4425] futex(0x7f33c929b0c4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f33c929b0c0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} unfinished ...
[pid  4433] close(7 unfinished ...
[pid  4426] ... futex resumed )   = 0
[pid  4433] ... close resumed )   = 0
[pid  4426] futex(0x7f33c929b100, FUTEX_WAIT_PRIVATE, 2, NULL unfinished ...
[pid  4433] unlink(/var/run/heartbeat/lrm_callback_sock unfinished ...


On one of the two nodes I did manage to make some progress in that I stopped 
corosync, waited for a few minutes then started it again and now I get the 
sockets (Doing /etc/init.d/corosync restarts or machine reboots wasnt ever 
successful.
r...@node1:/var/run/heartbeat# ls -l
total 0
srwxrwxrwx 1 root root  0 2010-11-17 01:22 lrm_callback_sock
srwxrwxrwx 1 root root  0 2010-11-17 01:22 lrm_cmd_sock
drwxr-xr-x 2 root root 40 2010-11-17 01:22 rsctmp
srwxrwxrwx 1 root root  0 2010-11-17 01:22 stonithd
srwxrwxrwx 1 root root  0 2010-11-17 01:22 stonithd_callback
but I cant repeat this on the second node.
I havent tried it on the first node again (at least I can compare things with 
the 2nd node). 



After

-- 
 do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
https://bugs.launchpad.net/bugs/676391
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-11-17 Thread Andres Rodriguez
Hi. I presume your name is Dave...

Well, first of, I can't seems to reproduce your bug. I just setup a
cluster and they come online both nodes and a DC is elected. Logs don't
show any of the behavior. Now, I was thinking, given that you have
upgraded... maybe there's a file that has been overwritten by the new
packages, such as:

/etc/default/corosync
/etc/corosync/corosync.conf

Or, there's something wrong with the keys.

First, I'd recommend you check if above files are with the correct
values (/etc/default/corosync with yes and /etc/corosync/corosync.conf
the network is the same on both nodes). If everything is configured as
expected, try recreating the keys with 'corosync-keygen', make sure the
permissions are the correct ones, and copy the key to the other node
(root:root and has 400).

Let me know afterwards

Thank you for reporting bugs.

-- 
 do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
https://bugs.launchpad.net/bugs/676391
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-11-17 Thread flashydave
Yes Andres, I'm Dave!
To answer your questions:
/etc/default/corosync - YES start at boot
/etc/corosync/corosync.conf identical between nodes, looks OK/intact and is 
dated1 month ago.
authkeys md5sum identical and dated 1 month ago 400 root.root perms

Both nodes are reporting same with crm_mon so no reason to think comms
is a problem (e.g. auth or multicast bad)

Last few lines of log from node2 as crmd died:
...
Nov 17 10:55:26 node2 crmd: [22808]: info: crm_timer_popped: Wait Timer 
(I_NULL) just popped!
Nov 17 10:55:26 node2 crmd: [22808]: WARN: lrm_signon: can not initiate 
connection
Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_lrm_control: Failed to sign on 
to the LRM 30 (max) times 
Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_log: FSA: Input I_ERROR from 
do_lrm_control() received in state S_STARTING
Nov 17 10:55:26 node2 crmd: [22808]: info: do_state_transition: State 
transition S_STARTING - S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
origin=do_lrm_control ]
Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_recover: Action A_RECOVER 
(0100) not supported
Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_started: Start cancelled... 
S_RECOVERY
Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_log: FSA: Input I_TERMINATE from 
do_recover() received in state S_RECOVERY
Nov 17 10:55:26 node2 crmd: [22808]: info: do_state_transition: State 
transition S_RECOVERY - S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL 
origin=do_recover ]
Nov 17 10:55:26 node2 crmd: [22808]: info: do_ha_control: Disconnected from 
OpenAIS
Nov 17 10:55:26 node2 crmd: [22808]: info: do_cib_control: Disconnecting CIB
Nov 17 10:55:26 node2 crmd: [22808]: info: crmd_cib_connection_destroy: 
Connection to the CIB terminated...
Nov 17 10:55:26 node2 crmd: [22808]: info: do_exit: Performing A_EXIT_0 - 
gracefully exiting the CRMd
Nov 17 10:55:26 node2 crmd: [22808]: ERROR: do_exit: Could not recover from 
internal error 
Nov 17 10:55:26 node2 crmd: [22808]: info: free_mem: Dropping I_TERMINATE: [ 
state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
Nov 17 10:55:26 node2 cib: [20601]: WARN: send_ipc_message: IPC Channel to 
22808 is not connected
Nov 17 10:55:26 node2 cib: [20601]: WARN: send_via_callback_channel: Delivery 
of reply to client 22808/3210be2d-0165-4c05-8c43-8945feea0692 failed
Nov 17 10:55:26 node2 crmd: [22808]: info: do_exit: [crmd] stopped (2)
Nov 17 10:55:26 node2 cib: [20601]: WARN: do_local_notify: A-Sync reply to crmd 
failed: reply failed
Nov 17 10:55:26 node2 corosync[20586]:   [pcmk  ] info: pcmk_ipc_exit: Client 
crmd (conn=0x1d98d00, async-conn=0x1d98d00) left
Nov 17 10:55:27 node2 corosync[20586]:   [pcmk  ] ERROR: pcmk_wait_dispatch: 
Child process crmd exited (pid=22808, rc=2) 
Nov 17 10:55:27 node2 corosync[20586]:   [pcmk  ] pcmk_wait_dispatch: Call to 
wait4(crmd) failed: (10) No child processes
Nov 17 10:55:27 node2 corosync[20586]:   [pcmk  ] ERROR: pcmk_wait_dispatch: 
Child respawn count exceeded by crmd
Nov 17 10:55:27 node2 corosync[20586]:   [pcmk  ] info: update_member: Node 
node2 now has process list: 0002 (69906)

-- 
 do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
https://bugs.launchpad.net/bugs/676391
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 676391] Re: do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick

2010-11-17 Thread flashydave
Also on node 1 (where I think lrmd is running) I have

crm_mon[26251]: 2010/11/17_21:00:32 info: determine_online_status: Node node1 
is standby
crm_mon[26251]: 2010/11/17_21:00:32 debug: unpack_rsc_op: failover-ip_monitor_0 
on node1 returned 0 (ok) instead of the expected value: 7 (not running)
crm_mon[26251]: 2010/11/17_21:00:32 notice: unpack_rsc_op: Operation 
failover-ip_monitor_0 found resource failover-ip active on node1
crm_mon[26251]: 2010/11/17_21:00:32 debug: unpack_rsc_op: drbd_disk:0_monitor_0 
on node1 returned 8 (master) instead of the expected value: 7 (not running)
crm_mon[26251]: 2010/11/17_21:00:32 notice: unpack_rsc_op: Operation 
drbd_disk:0_monitor_0 found resource drbd_disk:0 active in master mode on node1
crm_mon[26251]: 2010/11/17_21:00:32 debug: unpack_rsc_op: fs_drbd_monitor_0 on 
node1 returned 0 (ok) instead of the expected value: 7 (not running)
crm_mon[26251]: 2010/11/17_21:00:32 notice: unpack_rsc_op: Operation 
fs_drbd_monitor_0 found resource fs_drbd active on node1
crm_mon[26251]: 2010/11/17_21:00:32 debug: unpack_rsc_op: vsftpd_monitor_0 on 
node1 returned 0 (ok) instead of the expected value: 7 (not running)
crm_mon[26251]: 2010/11/17_21:00:32 notice: unpack_rsc_op: Operation 
vsftpd_monitor_0 found resource vsftpd active on node1

which suggests that the ra's arent reporting correctly

-- 
 do_lrm_control: Failed to sign on to the LRM after upgrade to Maverick
https://bugs.launchpad.net/bugs/676391
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs