Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-22 Thread Andrew Beekhof

On 17/05/2013, at 1:15 PM, Andrew Widdersheim awiddersh...@hotmail.com wrote:

 I'm attaching 3 patches I made fairly quickly to fix the installation issues 
 and also an issue I noticed with the ping ocf from the latest pacemaker. 
 
 One is for cluster-glue to prevent lrmd from building and later installing. 
 May also want to modify this patch to take lrmd out of both spec files 
 included when you download the source if you plan to build an rpm. I'm not 
 sure if what I did here is the best way to approach this problem so if anyone 
 has anything better please let me know.
 
 One is for pacemaker to create the lrmd symlink when building with heartbeat 
 support.

I can't apply this one until the cluster-glue one is in common use.
Otherwise rpm will instead refuse to install pacemaker because both it and 
cluster-glue contain the same file.

 Note the spec does not need anything changed here.
 
 Finally, saw the following errors in messages with the latest ping ocf and 
 the attached patch seems to fix the issue.

This is a slightly better fix:

diff --git a/extra/resources/ping b/extra/resources/ping
index abb631e..b9a69b8 100755
--- a/extra/resources/ping
+++ b/extra/resources/ping
@@ -305,6 +305,7 @@ ping_update() {
 : ${OCF_RESKEY_attempts:=3}
 : ${OCF_RESKEY_multiplier:=1}
 : ${OCF_RESKEY_debug:=false}
+: ${OCF_RESKEY_failure_score:=0}
 
 : ${OCF_RESKEY_CRM_meta_timeout:=2}
 : ${OCF_RESKEY_CRM_meta_globally_unique:=true}


 
 May 16 01:10:13 node2 lrmd[16133]:   notice: operation_finished: 
 p_ping_monitor_5000:17758 [ /usr/lib/ocf/resource.d/pacemaker/ping: line 296: 
 [: : integer expression expected ]   
 cluster-glue-no-lrmd.patchpacemaker-lrmd-hb.patchpacemaker-ping-failure.patch___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-17 Thread Nikita Michalko
I'm just wondering: why is lrm gone?

TIA!

Nikita Michalko


Am Freitag, 17. Mai 2013 05:15:10 schrieb Andrew Widdersheim:
 I'm attaching 3 patches I made fairly quickly to fix the installation
  issues and also an issue I noticed with the ping ocf from the latest
  pacemaker. 
 
 One is for cluster-glue to prevent lrmd from building and later installing.
  May also want to modify this patch to take lrmd out of both spec files
  included when you download the source if you plan to build an rpm. I'm not
  sure if what I did here is the best way to approach this problem so if
  anyone has anything better please let me know.
 
 One is for pacemaker to create the lrmd symlink when building with
  heartbeat support. Note the spec does not need anything changed here.
 
 Finally, saw the following errors in messages with the latest ping ocf and
  the attached patch seems to fix the issue.
 
 May 16 01:10:13 node2 lrmd[16133]:   notice: operation_finished:
  p_ping_monitor_5000:17758 [ /usr/lib/ocf/resource.d/pacemaker/ping: line
  296: [: : integer expression expected ]
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-17 Thread Lars Marowsky-Bree
On 2013-05-17T14:15:00, Nikita Michalko michalko.sys...@a-i-p.com wrote:

 I'm just wondering: why is lrm gone?

Rewritten by the pacemaker project upstream, which prefers to no longer
build with cluster-glue at all.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-16 Thread Andrew Widdersheim
Just tried the patch you gave and it worked fine. Any plans on putting this 
patch in officially or was this a one off? Aside from this patch I guess the 
only thing to get things to work is to install things slightly differently and 
adding a symlink from cluster-glue's lrmd to pacemakers.

 Subject: Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the 
 LRM 7
 From: and...@beekhof.net
 Date: Thu, 16 May 2013 15:20:59 +1000
 CC: pacemaker@oss.clusterlabs.org
 To: awiddersh...@hotmail.com
 
 
 On 16/05/2013, at 3:16 PM, Andrew Widdersheim awiddersh...@hotmail.com 
 wrote:
 
  I'll look into moving over to the cman option since that is preferred for 
  RHEL6.4 now if I'm not mistaken.
 
 Correct
 
  I'll also try out the patch provided and see how that goes. So was LRMD not 
  apart of pacemaker previously and later added? Was it originally apart of 
  heartbeat/cluster-glue? I'm just trying to figure out all of the pieces so 
  that I know how to fix if I choose to go down that road.
 
 
 Originally everything was part of heartbeat.
 Then what was then called the crm became pacemaker and the lrmd v1 became 
 part of cluster-glue (because the theory was that someone might use it for a 
 pacemaker alternative).
 That never happened and we stopped using almost everything else from 
 cluster-glue, so when lrmd v2 was written, it was done so as part of 
 pacemaker.
 
 or, tl;dr - yes and yes :)
  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-16 Thread Andrew Beekhof

On 17/05/2013, at 11:38 AM, Andrew Widdersheim awiddersh...@hotmail.com wrote:

 Just tried the patch you gave and it worked fine. Any plans on putting this 
 patch in officially or was this a one off?

It will be in 1.1.10-rc3 soon

 Aside from this patch I guess the only thing to get things to work is to 
 install things slightly differently and adding a symlink from cluster-glue's 
 lrmd to pacemakers.

Excellent

 
  Subject: Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to 
  the LRM 7
  From: and...@beekhof.net
  Date: Thu, 16 May 2013 15:20:59 +1000
  CC: pacemaker@oss.clusterlabs.org
  To: awiddersh...@hotmail.com
  
  
  On 16/05/2013, at 3:16 PM, Andrew Widdersheim awiddersh...@hotmail.com 
  wrote:
  
   I'll look into moving over to the cman option since that is preferred for 
   RHEL6.4 now if I'm not mistaken.
  
  Correct
  
   I'll also try out the patch provided and see how that goes. So was LRMD 
   not apart of pacemaker previously and later added? Was it originally 
   apart of heartbeat/cluster-glue? I'm just trying to figure out all of the 
   pieces so that I know how to fix if I choose to go down that road.
  
  
  Originally everything was part of heartbeat.
  Then what was then called the crm became pacemaker and the lrmd v1 became 
  part of cluster-glue (because the theory was that someone might use it for 
  a pacemaker alternative).
  That never happened and we stopped using almost everything else from 
  cluster-glue, so when lrmd v2 was written, it was done so as part of 
  pacemaker.
  
  or, tl;dr - yes and yes :)
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-16 Thread Andrew Widdersheim
I'm attaching 3 patches I made fairly quickly to fix the installation issues 
and also an issue I noticed with the ping ocf from the latest pacemaker. 

One is for cluster-glue to prevent lrmd from building and later installing. May 
also want to modify this patch to take lrmd out of both spec files included 
when you download the source if you plan to build an rpm. I'm not sure if what 
I did here is the best way to approach this problem so if anyone has anything 
better please let me know.

One is for pacemaker to create the lrmd symlink when building with heartbeat 
support. Note the spec does not need anything changed here.

Finally, saw the following errors in messages with the latest ping ocf and the 
attached patch seems to fix the issue.

May 16 01:10:13 node2 lrmd[16133]:   notice: operation_finished: 
p_ping_monitor_5000:17758 [ /usr/lib/ocf/resource.d/pacemaker/ping: line 296: 
[: : integer expression expected ] 

cluster-glue-no-lrmd.patch
Description: Binary data


pacemaker-lrmd-hb.patch
Description: Binary data


pacemaker-ping-failure.patch
Description: Binary data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Widdersheim
I am running the following versions:

pacemaker-1.1.10-rc2
cluster-glue-1.0.11
heartbeat-3.0.5

I was running pacemaker-1.1.6 and things were working fine but after updating 
to the latest I could not get pacemaker to start with the following message 
repeated in the logs:

crmd[8456]:  warning: do_lrm_control: Failed to sign on to the LRM 7 (30 max) 
times

Here is strace output from the crmd process:

0.23 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
0.21 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
0.000574 socket(PF_FILE, SOCK_STREAM, 0) = 6
0.42 fcntl(6, F_GETFD)         = 0
0.25 fcntl(6, F_SETFD, FD_CLOEXEC) = 0
0.21 fcntl(6, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
0.55 connect(6, {sa_family=AF_FILE, path=@lrmd}, 110) = -1 ECONNREFUSED 
(Connection refused)
0.50 close(6)                  = 0
0.31 shutdown(4294967295, 2 /* send and receive */) = -1 EBADF (Bad file 
descriptor)
0.24 close(4294967295)         = -1 EBADF (Bad file descriptor)
0.39 write(2, Could not establish lrmd connect..., 62) = 62
0.58 sendto(3, 28May 14 18:54:51 crmd[8456]: ..., 104, MSG_NOSIGNAL, 
NULL, 0) = 104
0.000327 times({tms_utime=0, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 
430616237
0.28 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
0.25 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
0.26 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
0.23 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
0.23 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
0.23 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)

I'm not quite sure what the issue is. At first I thought it might have been 
some type of permissions issues but I'm not quite sure that is the case 
anymore. Any help would be appreciated. I can forward a long any more details 
to help in troubleshooting. 
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread David Vossel
- Original Message -
 From: Andrew Widdersheim awiddersh...@hotmail.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Wednesday, May 15, 2013 7:53:56 AM
 Subject: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the 
 LRM 7
 
 I am running the following versions:
 
 pacemaker-1.1.10-rc2
 cluster-glue-1.0.11
 heartbeat-3.0.5

what libqb version do you have?

 
 I was running pacemaker-1.1.6 and things were working fine but after updating
 to the latest I could not get pacemaker to start with the following message
 repeated in the logs:
 
 crmd[8456]:  warning: do_lrm_control: Failed to sign on to the LRM 7 (30 max)
 times
 
 Here is strace output from the crmd process:

 0.23 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource
 temporarily unavailable)
 0.21 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
 0.000574 socket(PF_FILE, SOCK_STREAM, 0) = 6
 0.42 fcntl(6, F_GETFD)         = 0
 0.25 fcntl(6, F_SETFD, FD_CLOEXEC) = 0
 0.21 fcntl(6, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
 0.55 connect(6, {sa_family=AF_FILE, path=@lrmd}, 110) = -1 ECONNREFUSED
 (Connection refused)
 0.50 close(6)                  = 0
 0.31 shutdown(4294967295, 2 /* send and receive */) = -1 EBADF (Bad file
 descriptor)
 0.24 close(4294967295)         = -1 EBADF (Bad file descriptor)
 0.39 write(2, Could not establish lrmd connect..., 62) = 62
 0.58 sendto(3, 28May 14 18:54:51 crmd[8456]: ..., 104, MSG_NOSIGNAL,
 NULL, 0) = 104
 0.000327 times({tms_utime=0, tms_stime=1, tms_cutime=0, tms_cstime=0}) =
 430616237
 0.28 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource
 temporarily unavailable)
 0.25 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
 0.26 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource
 temporarily unavailable)
 0.23 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
 0.23 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource
 temporarily unavailable)
 0.23 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout)
 
 I'm not quite sure what the issue is. At first I thought it might have been
 some type of permissions issues but I'm not quite sure that is the case
 anymore. Any help would be appreciated. I can forward a long any more
 details to help in troubleshooting.

Are there anything in the logs that indicate a problem with the lrmd component? 
Do you see lrmd listed in 'ps -axf' output? 

-- Vossel

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Widdersheim
These are the libqb versions:
libqb-devel-0.14.2-3.el6.x86_64libqb-0.14.2-3.el6.x86_64
Here is a process listing where lrmd is running:[root@node1 ~]# ps auxwww | 
egrep heartbeat|pacemakerroot  9553  0.1  0.7  52420  7424 ?SLs  
May14   1:39 heartbeat: master control processroot  9556  0.0  0.7  52260  
7264 ?SL   May14   0:10 heartbeat: FIFO readerroot  9557  0.0  0.7  
52256  7260 ?SL   May14   1:01 heartbeat: write: mcast eth0root  
9558  0.0  0.7  52256  7260 ?SL   May14   0:14 heartbeat: read: mcast 
eth0root  9559  0.0  0.7  52256  7260 ?SL   May14   0:23 heartbeat: 
write: bcast eth1root  9560  0.0  0.7  52256  7260 ?SL   May14   
0:13 heartbeat: read: bcast eth1498   9563  0.0  0.2  36908  2392 ?
SMay14   0:10 /usr/lib64/heartbeat/ccm498   9564  0.0  1.0  85084 10704 
?SMay14   0:25 /usr/lib64/heartbeat/cibroot  9565  0.0  0.1  
44588  1896 ?SMay14   0:04 /usr/lib64/heartbeat/lrmd -rroot  
9566  0.0  0.3  83544  3988 ?SMay14   0:10 
/usr/lib64/heartbeat/stonithd498   9567  0.0  0.3  78668  3248 ?S   
 May14   0:10 /usr/lib64/heartbeat/attrd498  26534  0.0  0.3  92364  3748 ? 
   S16:05   0:00 /usr/lib64/heartbeat/crmd498  26535  0.0  0.2  
72840  2708 ?S16:05   0:00 /usr/libexec/pacemaker/pengine

Here are the logs at startup until the Failed to sign on message just starts 
to repeat over and over:May 15 16:07:06 node1 crmd[26621]:   notice: main: CRM 
Git Version: b060caeMay 15 16:07:06 node1 attrd[26620]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 
16:07:06 node1 attrd[26620]:   notice: main: Starting mainloop...May 15 
16:07:06 node1 stonith-ng[26619]:   notice: crm_cluster_connect: Connecting to 
cluster infrastructure: heartbeatMay 15 16:07:06 node1 cib[26617]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 
16:07:06 node1 lrmd: [26618]: WARN: Initializing connection to logging daemon 
failed. Logging daemon may not be runningMay 15 16:07:06 node1 lrmd: [26618]: 
info: max-children set to 4 (1 processors online)May 15 16:07:06 node1 lrmd: 
[26618]: info: enabling coredumpsMay 15 16:07:06 node1 lrmd: [26618]: info: 
Started.May 15 16:07:06 node1 cib[26617]:  warning: ccm_connect: CCM Activation 
failedMay 15 16:07:06 node1 cib[26617]:  warning: ccm_connect: CCM Connection 
failed 1 times (30 max)May 15 16:07:06 node1 ccm: [26616]: WARN: Initializing 
connection to logging daemon failed. Logging daemon may not be runningMay 15 
16:07:06 node1 ccm: [26616]: info: Hostname: node1May 15 16:07:07 node1 
crmd[26621]:  warning: do_cib_control: Couldn't complete CIB registration 1 
times... pause and retryMay 15 16:07:09 node1 cib[26617]:  warning: 
ccm_connect: CCM Activation failedMay 15 16:07:09 node1 cib[26617]:  warning: 
ccm_connect: CCM Connection failed 2 times (30 max)May 15 16:07:10 node1 
crmd[26621]:  warning: do_cib_control: Couldn't complete CIB registration 2 
times... pause and retryMay 15 16:07:13 node1 crmd[26621]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 
16:07:14 node1 cib[26617]:   notice: crm_update_peer_state: 
crm_update_ccm_node: Node node2[1] - state is now member (was (null))May 15 
16:07:14 node1 cib[26617]:   notice: crm_update_peer_state: 
crm_update_ccm_node: Node node1[0] - state is now member (was (null))May 15 
16:07:15 node1 crmd[26621]:  warning: do_lrm_control: Failed to sign on to the 
LRM 1 (30 max) times
Here is the repeating message peices:May 15 16:06:09 node1 crmd[26534]:
error: do_lrm_control: Failed to sign on to the LRM 30 (max) timesMay 15 
16:06:09 node1 crmd[26534]:error: do_log: FSA: Input I_ERROR from 
do_lrm_control() received in state S_STARTINGMay 15 16:06:09 node1 crmd[26534]: 
 warning: do_state_transition: State transition S_STARTING - S_RECOVERY [ 
input=I_ERROR cause=C_FSA_INTERNAL origin=do_lrm_control ]May 15 16:06:09 node1 
crmd[26534]:  warning: do_recover: Fast-tracking shutdown in response to 
errorsMay 15 16:06:09 node1 crmd[26534]:error: do_started: Start 
cancelled... S_RECOVERYMay 15 16:06:09 node1 crmd[26534]:error: do_log: 
FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERYMay 15 
16:06:09 node1 crmd[26534]:   notice: do_lrm_control: Disconnected from the 
LRMMay 15 16:06:09 node1 ccm: [9563]: info: client (pid=26534) removed from 
ccmMay 15 16:06:09 node1 crmd[26534]:error: do_exit: Could not recover from 
internal errorMay 15 16:06:09 node1 crmd[26534]:error: crm_abort: 
crm_glib_handler: Forked child 26540 to record non-fatal assert at logging.c:63 
: g_hash_table_size: assertion `hash_table != NULL' failedMay 15 16:06:09 node1 
crmd[26534]:error: crm_abort: crm_glib_handler: Forked child 26541 to 
record non-fatal assert at logging.c:63 : g_hash_table_destroy: assertion 
`hash_table 

Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Beekhof

On 16/05/2013, at 10:21 AM, Andrew Widdersheim awiddersh...@hotmail.com wrote:

 These are the libqb versions:
 
 libqb-devel-0.14.2-3.el6.x86_64
 libqb-0.14.2-3.el6.x86_64
 
 Here is a process listing where lrmd is running:
 [root@node1 ~]# ps auxwww | egrep heartbeat|pacemaker
 root  9553  0.1  0.7  52420  7424 ?SLs  May14   1:39 heartbeat: 
 master control process
 root  9556  0.0  0.7  52260  7264 ?SL   May14   0:10 heartbeat: 
 FIFO reader
 root  9557  0.0  0.7  52256  7260 ?SL   May14   1:01 heartbeat: 
 write: mcast eth0
 root  9558  0.0  0.7  52256  7260 ?SL   May14   0:14 heartbeat: 
 read: mcast eth0
 root  9559  0.0  0.7  52256  7260 ?SL   May14   0:23 heartbeat: 
 write: bcast eth1
 root  9560  0.0  0.7  52256  7260 ?SL   May14   0:13 heartbeat: 
 read: bcast eth1
 498   9563  0.0  0.2  36908  2392 ?SMay14   0:10 
 /usr/lib64/heartbeat/ccm
 498   9564  0.0  1.0  85084 10704 ?SMay14   0:25 
 /usr/lib64/heartbeat/cib
 root  9565  0.0  0.1  44588  1896 ?SMay14   0:04 
 /usr/lib64/heartbeat/lrmd -r

Heartbeat is starting the wrong lrmd by the looks of it.
Is /usr/lib64/heartbeat/lrmd the same as /usr/libexec/pacemaker/lrmd ?

 root  9566  0.0  0.3  83544  3988 ?SMay14   0:10 
 /usr/lib64/heartbeat/stonithd
 498   9567  0.0  0.3  78668  3248 ?SMay14   0:10 
 /usr/lib64/heartbeat/attrd
 498  26534  0.0  0.3  92364  3748 ?S16:05   0:00 
 /usr/lib64/heartbeat/crmd
 498  26535  0.0  0.2  72840  2708 ?S16:05   0:00 
 /usr/libexec/pacemaker/pengine
 
 
 Here are the logs at startup until the Failed to sign on message just 
 starts to repeat over and over:
 May 15 16:07:06 node1 crmd[26621]:   notice: main: CRM Git Version: b060cae
 May 15 16:07:06 node1 attrd[26620]:   notice: crm_cluster_connect: Connecting 
 to cluster infrastructure: heartbeat
 May 15 16:07:06 node1 attrd[26620]:   notice: main: Starting mainloop...
 May 15 16:07:06 node1 stonith-ng[26619]:   notice: crm_cluster_connect: 
 Connecting to cluster infrastructure: heartbeat
 May 15 16:07:06 node1 cib[26617]:   notice: crm_cluster_connect: Connecting 
 to cluster infrastructure: heartbeat
 May 15 16:07:06 node1 lrmd: [26618]: WARN: Initializing connection to logging 
 daemon failed. Logging daemon may not be running
 May 15 16:07:06 node1 lrmd: [26618]: info: max-children set to 4 (1 
 processors online)
 May 15 16:07:06 node1 lrmd: [26618]: info: enabling coredumps
 May 15 16:07:06 node1 lrmd: [26618]: info: Started.
 May 15 16:07:06 node1 cib[26617]:  warning: ccm_connect: CCM Activation failed
 May 15 16:07:06 node1 cib[26617]:  warning: ccm_connect: CCM Connection 
 failed 1 times (30 max)
 May 15 16:07:06 node1 ccm: [26616]: WARN: Initializing connection to logging 
 daemon failed. Logging daemon may not be running
 May 15 16:07:06 node1 ccm: [26616]: info: Hostname: node1
 May 15 16:07:07 node1 crmd[26621]:  warning: do_cib_control: Couldn't 
 complete CIB registration 1 times... pause and retry
 May 15 16:07:09 node1 cib[26617]:  warning: ccm_connect: CCM Activation failed
 May 15 16:07:09 node1 cib[26617]:  warning: ccm_connect: CCM Connection 
 failed 2 times (30 max)
 May 15 16:07:10 node1 crmd[26621]:  warning: do_cib_control: Couldn't 
 complete CIB registration 2 times... pause and retry
 May 15 16:07:13 node1 crmd[26621]:   notice: crm_cluster_connect: Connecting 
 to cluster infrastructure: heartbeat
 May 15 16:07:14 node1 cib[26617]:   notice: crm_update_peer_state: 
 crm_update_ccm_node: Node node2[1] - state is now member (was (null))
 May 15 16:07:14 node1 cib[26617]:   notice: crm_update_peer_state: 
 crm_update_ccm_node: Node node1[0] - state is now member (was (null))
 May 15 16:07:15 node1 crmd[26621]:  warning: do_lrm_control: Failed to sign 
 on to the LRM 1 (30 max) times
 
 Here is the repeating message peices:
 May 15 16:06:09 node1 crmd[26534]:error: do_lrm_control: Failed to sign 
 on to the LRM 30 (max) times
 May 15 16:06:09 node1 crmd[26534]:error: do_log: FSA: Input I_ERROR from 
 do_lrm_control() received in state S_STARTING
 May 15 16:06:09 node1 crmd[26534]:  warning: do_state_transition: State 
 transition S_STARTING - S_RECOVERY [ input=I_ERROR cause=C_FSA
 _INTERNAL origin=do_lrm_control ]
 May 15 16:06:09 node1 crmd[26534]:  warning: do_recover: Fast-tracking 
 shutdown in response to errors
 May 15 16:06:09 node1 crmd[26534]:error: do_started: Start cancelled... 
 S_RECOVERY
 May 15 16:06:09 node1 crmd[26534]:error: do_log: FSA: Input I_TERMINATE 
 from do_recover() received in state S_RECOVERY
 May 15 16:06:09 node1 crmd[26534]:   notice: do_lrm_control: Disconnected 
 from the LRM
 May 15 16:06:09 node1 ccm: [9563]: info: client (pid=26534) removed from ccm
 May 15 16:06:09 node1 crmd[26534]:error: do_exit: Could not recover from 
 internal error
 May 15 16:06:09 node1 crmd[26534]:error: 

Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Widdersheim
There are quite a few symlinks of heartbeat pieces back to pacemaker pieces 
like crmd as an example but lrmd was not one of them:
[root@node1 ~]# ls -lha /usr/lib64/heartbeat/crmdlrwxrwxrwx 1 root root 27 May 
14 17:31 /usr/lib64/heartbeat/crmd - /usr/libexec/pacemaker/crmd
[root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmd-rwxr-xr-x 1 root root 85K May 
14 17:19 /usr/lib64/heartbeat/lrmd
I just tried to symlink it back by hand but when I started heartbeat the logs 
had nothing about lrmd starting/trying to start nor did lrmd show in the 
process list anymore. Just more failure messages.
[root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmdlrwxrwxrwx 1 root root 27 May 
15 19:38 /usr/lib64/heartbeat/lrmd - /usr/libexec/pacemaker/lrmd
I then started lrmd manually as root with the verbose option turned on and 
looks like things started to connect and the cluster on node1 where I started 
lrmd manually began coming online and work a bit. I noticed when running 
pacemakers lrmd there is no longer a -r option which looking at my old ps 
command was how it was getting started:
[root@node1 ~]# /usr/libexec/pacemaker/lrmd --helplrmd - Pacemaker Remote 
daemon for extending pacemaker functionality to remote nodes.Usage: lrmd 
[options]Options: -?, --help This text -$, --version  
Version information -V, --verbose  Increase debug output -l, 
--logfile=valueSend logs to the additional named logfile
This is what heartbeat's lrmd looks like.
[root@node1 ~]# /usr/lib64/heartbeat/lrmd.bak 
--help/usr/lib64/heartbeat/lrmd.bak: invalid option -- '-'usage: lrmd [-srkhv]  
  s: statusr: restartk: killm: register to apphbd   
 i: the interval of apphbh: helpv: debug
Previous ps output:root  9565  0.0  0.1  44588  1896 ?SMay14   
0:04 /usr/lib64/heartbeat/lrmd -r
I'm not sure what initially tries to spawn lrmd but it is likely that will need 
to change as well. Is all of this the result of a bad installation or did I 
need to compile things differently or is pacemaker too new and heartbeat too 
old? Basically, what do I need to do to fix.
 ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Beekhof

On 16/05/2013, at 2:03 PM, Andrew Widdersheim awiddersh...@hotmail.com wrote:

 There are quite a few symlinks of heartbeat pieces back to pacemaker pieces 
 like crmd as an example but lrmd was not one of them:
 
 [root@node1 ~]# ls -lha /usr/lib64/heartbeat/crmd
 lrwxrwxrwx 1 root root 27 May 14 17:31 /usr/lib64/heartbeat/crmd - 
 /usr/libexec/pacemaker/crmd
 
 [root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmd
 -rwxr-xr-x 1 root root 85K May 14 17:19 /usr/lib64/heartbeat/lrmd
 
 I just tried to symlink it back by hand but when I started heartbeat the logs 
 had nothing about lrmd starting/trying to start nor did lrmd show in the 
 process list anymore. Just more failure messages.
 
 [root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmd
 lrwxrwxrwx 1 root root 27 May 15 19:38 /usr/lib64/heartbeat/lrmd - 
 /usr/libexec/pacemaker/lrmd
 
 I then started lrmd manually as root with the verbose option turned on and 
 looks like things started to connect and the cluster on node1 where I started 
 lrmd manually began coming online and work a bit. I noticed when running 
 pacemakers lrmd there is no longer a -r option which looking at my old ps 
 command was how it was getting started:
 
 [root@node1 ~]# /usr/libexec/pacemaker/lrmd --help
 lrmd - Pacemaker Remote daemon for extending pacemaker functionality to 
 remote nodes.
 Usage: lrmd [options]
 Options:
  -?, --help This text
  -$, --version  Version information
  -V, --verbose  Increase debug output
  -l, --logfile=valueSend logs to the additional named logfile
 
 This is what heartbeat's lrmd looks like.
 
 [root@node1 ~]# /usr/lib64/heartbeat/lrmd.bak --help
 /usr/lib64/heartbeat/lrmd.bak: invalid option -- '-'
 usage: lrmd [-srkhv]
 s: status
 r: restart
 k: kill
 m: register to apphbd
 i: the interval of apphb
 h: help
 v: debug
 
 Previous ps output:
 root  9565  0.0  0.1  44588  1896 ?SMay14   0:04 
 /usr/lib64/heartbeat/lrmd -r
 
 I'm not sure what initially tries to spawn lrmd

In your case, Heartbeat.

 but it is likely that will need to change as well. Is all of this the result 
 of a bad installation or did I need to compile things differently or is 
 pacemaker too new and heartbeat too old? Basically, what do I need to do to 
 fix.

Honestly, I'd probably recommend to just stop fighting the distro you're on :-)
Just follow http://clusterlabs.org/quickstart-redhat.html to get what comes 
with and was tested for RHEL 6.4
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Beekhof

On 16/05/2013, at 2:52 PM, Andrew Beekhof and...@beekhof.net wrote:

 
 On 16/05/2013, at 2:03 PM, Andrew Widdersheim awiddersh...@hotmail.com 
 wrote:
 
 There are quite a few symlinks of heartbeat pieces back to pacemaker pieces 
 like crmd as an example but lrmd was not one of them:
 
 [root@node1 ~]# ls -lha /usr/lib64/heartbeat/crmd
 lrwxrwxrwx 1 root root 27 May 14 17:31 /usr/lib64/heartbeat/crmd - 
 /usr/libexec/pacemaker/crmd
 
 [root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmd
 -rwxr-xr-x 1 root root 85K May 14 17:19 /usr/lib64/heartbeat/lrmd
 
 I just tried to symlink it back by hand but when I started heartbeat the 
 logs had nothing about lrmd starting/trying to start nor did lrmd show in 
 the process list anymore. Just more failure messages.
 
 [root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmd
 lrwxrwxrwx 1 root root 27 May 15 19:38 /usr/lib64/heartbeat/lrmd - 
 /usr/libexec/pacemaker/lrmd
 
 I then started lrmd manually as root with the verbose option turned on and 
 looks like things started to connect and the cluster on node1 where I 
 started lrmd manually began coming online and work a bit. I noticed when 
 running pacemakers lrmd there is no longer a -r option which looking at my 
 old ps command was how it was getting started:
 
 [root@node1 ~]# /usr/libexec/pacemaker/lrmd --help
 lrmd - Pacemaker Remote daemon for extending pacemaker functionality to 
 remote nodes.
 Usage: lrmd [options]
 Options:
 -?, --help This text
 -$, --version  Version information
 -V, --verbose  Increase debug output
 -l, --logfile=valueSend logs to the additional named logfile
 
 This is what heartbeat's lrmd looks like.
 
 [root@node1 ~]# /usr/lib64/heartbeat/lrmd.bak --help
 /usr/lib64/heartbeat/lrmd.bak: invalid option -- '-'
 usage: lrmd [-srkhv]
s: status
r: restart
k: kill
m: register to apphbd
i: the interval of apphb
h: help
v: debug
 
 Previous ps output:
 root  9565  0.0  0.1  44588  1896 ?SMay14   0:04 
 /usr/lib64/heartbeat/lrmd -r
 
 I'm not sure what initially tries to spawn lrmd
 
 In your case, Heartbeat.
 
 but it is likely that will need to change as well. Is all of this the result 
 of a bad installation or did I need to compile things differently or is 
 pacemaker too new and heartbeat too old? Basically, what do I need to do to 
 fix.
 
 Honestly, I'd probably recommend to just stop fighting the distro you're on 
 :-)
 Just follow http://clusterlabs.org/quickstart-redhat.html to get what comes 
 with and was tested for RHEL 6.4

Although building with this patch would probably help:

   https://github.com/beekhof/pacemaker/commit/064b19e


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Widdersheim
I'll look into moving over to the cman option since that is preferred for 
RHEL6.4 now if I'm not mistaken. I'll also try out the patch provided and see 
how that goes. So was LRMD not apart of pacemaker previously and later added? 
Was it originally apart of heartbeat/cluster-glue? I'm just trying to figure 
out all of the pieces so that I know how to fix if I choose to go down that 
road.  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7

2013-05-15 Thread Andrew Beekhof

On 16/05/2013, at 3:16 PM, Andrew Widdersheim awiddersh...@hotmail.com wrote:

 I'll look into moving over to the cman option since that is preferred for 
 RHEL6.4 now if I'm not mistaken.

Correct

 I'll also try out the patch provided and see how that goes. So was LRMD not 
 apart of pacemaker previously and later added? Was it originally apart of 
 heartbeat/cluster-glue? I'm just trying to figure out all of the pieces so 
 that I know how to fix if I choose to go down that road.


Originally everything was part of heartbeat.
Then what was then called the crm became pacemaker and the lrmd v1 became 
part of cluster-glue (because the theory was that someone might use it for a 
pacemaker alternative).
That never happened and we stopped using almost everything else from 
cluster-glue, so when lrmd v2 was written, it was done so as part of pacemaker.

or, tl;dr - yes and yes :)
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org