[Pacemaker] Broken links and broken bugzilla
I noticed a few broken links and the bugzilla seems broken as well: http://bugs.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker http://oss.clusterlabs.org/mailman/options/pacemaker Pretty much a lot of stuff on the following seems to need some love: http://clusterlabs.org/wiki/Mailing_lists ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
After setting the crmd-transition-delay to 2 * my ping monitor interval the issues I was seeing before in testing have not re-occurred. Thanks again for the help. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
Have I just run into a shortcoming with pacemaker? Should I file a bug or RFE somewhere? Seems like there should be another parameter when setting up a pingd resource to tell the DC/policy engine to wait x amount of seconds so that all nodes have shared their connection state before it makes a decision about moving resources. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
The cluster has 3 connections total. The first connection is the outside interface where services can communicate and is also used for cluster communication using mcast. The second interface is a cross-over that is solely for cluster communication. The third connection is another cross-over solely for DRBD replication. This issue happens when the first connection that is used for both the services and cluster communication is pulled on both nodes at the same time. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
Thanks for the help. Adding another node to the ping host_list may help in some situations but the root issues doesn't really get solved. Also, the location constraint you posted is very different than mine. Your constraint requires connectivity where as the one I am trying to use looks for best connectivity. I have used the location constraint you posted with success in the past but I don't want my resource to be shut off in the event of a network outage that is happening across all nodes at the same time. Don't get me wrong in some cluster configurationss I do use the configuration you posted but this setup is not one of them for specific reasons. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7
Just tried the patch you gave and it worked fine. Any plans on putting this patch in officially or was this a one off? Aside from this patch I guess the only thing to get things to work is to install things slightly differently and adding a symlink from cluster-glue's lrmd to pacemakers. Subject: Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7 From: and...@beekhof.net Date: Thu, 16 May 2013 15:20:59 +1000 CC: pacemaker@oss.clusterlabs.org To: awiddersh...@hotmail.com On 16/05/2013, at 3:16 PM, Andrew Widdersheim awiddersh...@hotmail.com wrote: I'll look into moving over to the cman option since that is preferred for RHEL6.4 now if I'm not mistaken. Correct I'll also try out the patch provided and see how that goes. So was LRMD not apart of pacemaker previously and later added? Was it originally apart of heartbeat/cluster-glue? I'm just trying to figure out all of the pieces so that I know how to fix if I choose to go down that road. Originally everything was part of heartbeat. Then what was then called the crm became pacemaker and the lrmd v1 became part of cluster-glue (because the theory was that someone might use it for a pacemaker alternative). That never happened and we stopped using almost everything else from cluster-glue, so when lrmd v2 was written, it was done so as part of pacemaker. or, tl;dr - yes and yes :) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7
I'm attaching 3 patches I made fairly quickly to fix the installation issues and also an issue I noticed with the ping ocf from the latest pacemaker. One is for cluster-glue to prevent lrmd from building and later installing. May also want to modify this patch to take lrmd out of both spec files included when you download the source if you plan to build an rpm. I'm not sure if what I did here is the best way to approach this problem so if anyone has anything better please let me know. One is for pacemaker to create the lrmd symlink when building with heartbeat support. Note the spec does not need anything changed here. Finally, saw the following errors in messages with the latest ping ocf and the attached patch seems to fix the issue. May 16 01:10:13 node2 lrmd[16133]: notice: operation_finished: p_ping_monitor_5000:17758 [ /usr/lib/ocf/resource.d/pacemaker/ping: line 296: [: : integer expression expected ] cluster-glue-no-lrmd.patch Description: Binary data pacemaker-lrmd-hb.patch Description: Binary data pacemaker-ping-failure.patch Description: Binary data ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7
I am running the following versions: pacemaker-1.1.10-rc2 cluster-glue-1.0.11 heartbeat-3.0.5 I was running pacemaker-1.1.6 and things were working fine but after updating to the latest I could not get pacemaker to start with the following message repeated in the logs: crmd[8456]: warning: do_lrm_control: Failed to sign on to the LRM 7 (30 max) times Here is strace output from the crmd process: 0.23 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 0.21 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) 0.000574 socket(PF_FILE, SOCK_STREAM, 0) = 6 0.42 fcntl(6, F_GETFD) = 0 0.25 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 0.21 fcntl(6, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 0.55 connect(6, {sa_family=AF_FILE, path=@lrmd}, 110) = -1 ECONNREFUSED (Connection refused) 0.50 close(6) = 0 0.31 shutdown(4294967295, 2 /* send and receive */) = -1 EBADF (Bad file descriptor) 0.24 close(4294967295) = -1 EBADF (Bad file descriptor) 0.39 write(2, Could not establish lrmd connect..., 62) = 62 0.58 sendto(3, 28May 14 18:54:51 crmd[8456]: ..., 104, MSG_NOSIGNAL, NULL, 0) = 104 0.000327 times({tms_utime=0, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 430616237 0.28 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 0.25 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) 0.26 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 0.23 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) 0.23 recvfrom(5, 0xc513f9, 2487, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 0.23 poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) I'm not quite sure what the issue is. At first I thought it might have been some type of permissions issues but I'm not quite sure that is the case anymore. Any help would be appreciated. I can forward a long any more details to help in troubleshooting. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
Sorry to bring up old issues but I am having the exact same problem as the original poster. A simultaneous disconnect on my two node cluster causes the resources to start to transition to the other node but mid flight the transition is aborted and resources are started again on the original node when the cluster realizes connectivity is same between the two nodes. I have tried various dampen settings without having any luck. Seems like the nodes report the outages at slightly different times which results in a partial transition of resources instead of waiting to know the connectivity of all of the nodes in the cluster before taking action which is what I would have thought dampen would help solve. Ideally the cluster wouldn't start the transition if another cluster node is having a connectivity issue as well and connectivity status is shared between all cluster nodes. Find my configuration below. Let me know there is something I can change to fix or if this behavior is expected. primitive p_drbd ocf:linbit:drbd \ params drbd_resource=r1 \ op monitor interval=30s role=Slave \ op monitor interval=10s role=Master primitive p_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/r1 directory=/drbd/r1 fstype=ext4 options=noatime \ op start interval=0 timeout=60s \ op stop interval=0 timeout=180s \ op monitor interval=30s timeout=40s primitive p_mysql ocf:heartbeat:mysql \ params binary=/usr/libexec/mysqld config=/drbd/r1/mysql/my.cnf datadir=/drbd/r1/mysql \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s \ op monitor interval=30s \ meta target-role=Started primitive p_ping ocf:pacemaker:ping \ params host_list=192.168.5.1 dampen=30s multiplier=1000 debug=true \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s \ op monitor interval=5s timeout=10s group g_mysql_group p_fs p_mysql \ meta target-role=Started ms ms_drbd p_drbd \ meta notify=true master-max=1 clone-max=2 target-role=Started clone cl_ping p_ping location l_connected g_mysql \ rule $id=l_connected-rule pingd: defined pingd colocation c_mysql_on_drbd inf: g_mysql ms_drbd:Master order o_drbd_before_mysql inf: ms_drbd:promote g_mysql:start property $id=cib-bootstrap-options \ dc-version=1.1.6-1.el6-8b6c6b9b6dc2627713f870850d20163fad4cc2a2 \ cluster-infrastructure=Heartbeat \ no-quorum-policy=ignore \ stonith-enabled=false \ cluster-recheck-interval=5m \ last-lrm-refresh=1368632470 rsc_defaults $id=rsc-options \ migration-threshold=5 \ resource-stickiness=200 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7
These are the libqb versions: libqb-devel-0.14.2-3.el6.x86_64libqb-0.14.2-3.el6.x86_64 Here is a process listing where lrmd is running:[root@node1 ~]# ps auxwww | egrep heartbeat|pacemakerroot 9553 0.1 0.7 52420 7424 ?SLs May14 1:39 heartbeat: master control processroot 9556 0.0 0.7 52260 7264 ?SL May14 0:10 heartbeat: FIFO readerroot 9557 0.0 0.7 52256 7260 ?SL May14 1:01 heartbeat: write: mcast eth0root 9558 0.0 0.7 52256 7260 ?SL May14 0:14 heartbeat: read: mcast eth0root 9559 0.0 0.7 52256 7260 ?SL May14 0:23 heartbeat: write: bcast eth1root 9560 0.0 0.7 52256 7260 ?SL May14 0:13 heartbeat: read: bcast eth1498 9563 0.0 0.2 36908 2392 ? SMay14 0:10 /usr/lib64/heartbeat/ccm498 9564 0.0 1.0 85084 10704 ?SMay14 0:25 /usr/lib64/heartbeat/cibroot 9565 0.0 0.1 44588 1896 ?SMay14 0:04 /usr/lib64/heartbeat/lrmd -rroot 9566 0.0 0.3 83544 3988 ?SMay14 0:10 /usr/lib64/heartbeat/stonithd498 9567 0.0 0.3 78668 3248 ?S May14 0:10 /usr/lib64/heartbeat/attrd498 26534 0.0 0.3 92364 3748 ? S16:05 0:00 /usr/lib64/heartbeat/crmd498 26535 0.0 0.2 72840 2708 ?S16:05 0:00 /usr/libexec/pacemaker/pengine Here are the logs at startup until the Failed to sign on message just starts to repeat over and over:May 15 16:07:06 node1 crmd[26621]: notice: main: CRM Git Version: b060caeMay 15 16:07:06 node1 attrd[26620]: notice: crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 16:07:06 node1 attrd[26620]: notice: main: Starting mainloop...May 15 16:07:06 node1 stonith-ng[26619]: notice: crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 16:07:06 node1 cib[26617]: notice: crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 16:07:06 node1 lrmd: [26618]: WARN: Initializing connection to logging daemon failed. Logging daemon may not be runningMay 15 16:07:06 node1 lrmd: [26618]: info: max-children set to 4 (1 processors online)May 15 16:07:06 node1 lrmd: [26618]: info: enabling coredumpsMay 15 16:07:06 node1 lrmd: [26618]: info: Started.May 15 16:07:06 node1 cib[26617]: warning: ccm_connect: CCM Activation failedMay 15 16:07:06 node1 cib[26617]: warning: ccm_connect: CCM Connection failed 1 times (30 max)May 15 16:07:06 node1 ccm: [26616]: WARN: Initializing connection to logging daemon failed. Logging daemon may not be runningMay 15 16:07:06 node1 ccm: [26616]: info: Hostname: node1May 15 16:07:07 node1 crmd[26621]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retryMay 15 16:07:09 node1 cib[26617]: warning: ccm_connect: CCM Activation failedMay 15 16:07:09 node1 cib[26617]: warning: ccm_connect: CCM Connection failed 2 times (30 max)May 15 16:07:10 node1 crmd[26621]: warning: do_cib_control: Couldn't complete CIB registration 2 times... pause and retryMay 15 16:07:13 node1 crmd[26621]: notice: crm_cluster_connect: Connecting to cluster infrastructure: heartbeatMay 15 16:07:14 node1 cib[26617]: notice: crm_update_peer_state: crm_update_ccm_node: Node node2[1] - state is now member (was (null))May 15 16:07:14 node1 cib[26617]: notice: crm_update_peer_state: crm_update_ccm_node: Node node1[0] - state is now member (was (null))May 15 16:07:15 node1 crmd[26621]: warning: do_lrm_control: Failed to sign on to the LRM 1 (30 max) times Here is the repeating message peices:May 15 16:06:09 node1 crmd[26534]: error: do_lrm_control: Failed to sign on to the LRM 30 (max) timesMay 15 16:06:09 node1 crmd[26534]:error: do_log: FSA: Input I_ERROR from do_lrm_control() received in state S_STARTINGMay 15 16:06:09 node1 crmd[26534]: warning: do_state_transition: State transition S_STARTING - S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_lrm_control ]May 15 16:06:09 node1 crmd[26534]: warning: do_recover: Fast-tracking shutdown in response to errorsMay 15 16:06:09 node1 crmd[26534]:error: do_started: Start cancelled... S_RECOVERYMay 15 16:06:09 node1 crmd[26534]:error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERYMay 15 16:06:09 node1 crmd[26534]: notice: do_lrm_control: Disconnected from the LRMMay 15 16:06:09 node1 ccm: [9563]: info: client (pid=26534) removed from ccmMay 15 16:06:09 node1 crmd[26534]:error: do_exit: Could not recover from internal errorMay 15 16:06:09 node1 crmd[26534]:error: crm_abort: crm_glib_handler: Forked child 26540 to record non-fatal assert at logging.c:63 : g_hash_table_size: assertion `hash_table != NULL' failedMay 15 16:06:09 node1 crmd[26534]:error: crm_abort: crm_glib_handler: Forked child 26541 to record non-fatal assert at logging.c:63 : g_hash_table_destroy: assertion `hash_table
Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
I attached logs from both nodes. Yes, we compiled 1.1.6 with heartbeat support for RHEL6.4. I tried 1.1.10 but had issues. I have another thread open on the mailing list for that issue as well. I'm not opposed to doing CMAN or corosync if those fix the problem. We have been using this setup or very similar for about 2-3 years. Florian Haas actually came to our company to do a training for us when he was still at Linbit and this is how we set it up then and have continued to do so since. We have never had an issue up until this point because all of our clusters in the past were setup so that connectivity was required and it was expected that resources would shut down during an event like this. May 15 13:42:00 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) logd is not running May 15 13:42:00 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 2013/05/15_13:42:00 WARNING: 192.168.5.1 is inactive May 15 13:42:09 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) logd is not running May 15 13:42:09 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 2013/05/15_13:42:09 WARNING: 192.168.5.1 is inactive May 15 13:42:18 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) logd is not running May 15 13:42:18 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 2013/05/15_13:42:18 WARNING: 192.168.5.1 is inactive May 15 13:42:27 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) logd is not running May 15 13:42:27 node1 lrmd: [27346]: info: RA output: (p_ping:1:monitor:stderr) 2013/05/15_13:42:27 WARNING: 192.168.5.1 is inactive May 15 13:42:30 node1 attrd: [27348]: notice: attrd_trigger_update: Sending flush op to all hosts for: pingd (0) May 15 13:42:30 node1 attrd: [27348]: notice: attrd_perform_update: Sent update 238: pingd=0 May 15 13:42:30 node1 crmd: [27349]: info: abort_transition_graph: te_update_diff:164 - Triggered transition abort (complete=1, tag=nvpair, id=status-f5a576b5-003b-447d-8029-19202823bbfa-pingd, name=pingd, value=0, magic=NA, cib=0.75.78) : Transient attribute: update May 15 13:42:30 node1 crmd: [27349]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] May 15 13:42:30 node1 crmd: [27349]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. May 15 13:42:30 node1 crmd: [27349]: info: do_pe_invoke: Query 362: Requesting the current CIB: S_POLICY_ENGINE May 15 13:42:30 node1 crmd: [27349]: info: do_pe_invoke_callback: Invoking the PE: query=362, ref=pe_calc-dc-1368639750-564, seq=8, quorate=1 May 15 13:42:30 node1 pengine: [6643]: notice: unpack_config: On loss of CCM Quorum: Ignore May 15 13:42:30 node1 pengine: [6643]: notice: unpack_rsc_op: Operation p_drbd:1_last_failure_0 found resource p_drbd:1 active on node2 May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp: Start recurring monitor (30s) for p_fs on node2 May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp: Start recurring monitor (30s) for p_mysql on node2 May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd:0 on node1 May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp: Start recurring monitor (10s) for p_drbd:1 on node2 May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp: Start recurring monitor (30s) for p_drbd:0 on node1 May 15 13:42:30 node1 pengine: [6643]: notice: RecurringOp: Start recurring monitor (10s) for p_drbd:1 on node2 May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Move p_fs#011(Started node1 - node2) May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Move p_mysql#011(Started node1 - node2) May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Demote p_drbd:0#011(Master - Slave node1) May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Promote p_drbd:1#011(Slave - Master node2) May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Leave p_ping:0#011(Started node2) May 15 13:42:30 node1 pengine: [6643]: notice: LogActions: Leave p_ping:1#011(Started node1) May 15 13:42:30 node1 crmd: [27349]: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] May 15 13:42:30 node1 crmd: [27349]: info: unpack_graph: Unpacked transition 56: 40 actions in 40 synapses May 15 13:42:30 node1 crmd: [27349]: info: do_te_invoke: Processing graph 56 (ref=pe_calc-dc-1368639750-564) derived from /var/lib/pengine/pe-input-64.bz2 May 15 13:42:30 node1 crmd: [27349]: info: te_pseudo_action: Pseudo action 23 fired and confirmed May 15 13:42:30 node1 crmd: [27349]: info: te_rsc_command: Initiating action 7: cancel p_drbd:0_monitor_1 on node1 (local) May 15 13:42:30 node1 lrmd: [18185]: WARN: For LSB init script, no additional
Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7
There are quite a few symlinks of heartbeat pieces back to pacemaker pieces like crmd as an example but lrmd was not one of them: [root@node1 ~]# ls -lha /usr/lib64/heartbeat/crmdlrwxrwxrwx 1 root root 27 May 14 17:31 /usr/lib64/heartbeat/crmd - /usr/libexec/pacemaker/crmd [root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmd-rwxr-xr-x 1 root root 85K May 14 17:19 /usr/lib64/heartbeat/lrmd I just tried to symlink it back by hand but when I started heartbeat the logs had nothing about lrmd starting/trying to start nor did lrmd show in the process list anymore. Just more failure messages. [root@node1 ~]# ls -lha /usr/lib64/heartbeat/lrmdlrwxrwxrwx 1 root root 27 May 15 19:38 /usr/lib64/heartbeat/lrmd - /usr/libexec/pacemaker/lrmd I then started lrmd manually as root with the verbose option turned on and looks like things started to connect and the cluster on node1 where I started lrmd manually began coming online and work a bit. I noticed when running pacemakers lrmd there is no longer a -r option which looking at my old ps command was how it was getting started: [root@node1 ~]# /usr/libexec/pacemaker/lrmd --helplrmd - Pacemaker Remote daemon for extending pacemaker functionality to remote nodes.Usage: lrmd [options]Options: -?, --help This text -$, --version Version information -V, --verbose Increase debug output -l, --logfile=valueSend logs to the additional named logfile This is what heartbeat's lrmd looks like. [root@node1 ~]# /usr/lib64/heartbeat/lrmd.bak --help/usr/lib64/heartbeat/lrmd.bak: invalid option -- '-'usage: lrmd [-srkhv] s: statusr: restartk: killm: register to apphbd i: the interval of apphbh: helpv: debug Previous ps output:root 9565 0.0 0.1 44588 1896 ?SMay14 0:04 /usr/lib64/heartbeat/lrmd -r I'm not sure what initially tries to spawn lrmd but it is likely that will need to change as well. Is all of this the result of a bad installation or did I need to compile things differently or is pacemaker too new and heartbeat too old? Basically, what do I need to do to fix. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7
I'll look into moving over to the cman option since that is preferred for RHEL6.4 now if I'm not mistaken. I'll also try out the patch provided and see how that goes. So was LRMD not apart of pacemaker previously and later added? Was it originally apart of heartbeat/cluster-glue? I'm just trying to figure out all of the pieces so that I know how to fix if I choose to go down that road. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Resource fails to stop
One of my resources failed to stop due to it hitting the timeout setting. The resource went into a failed state and froze the cluster until I manually fixed the problem. My question is what is pacemaker's default action when it encounters a stop failure and STONITH is not enabled? Is it what I saw where the resource goes into a failed state and doesn't try to start it anywhere until manual intervention or does it continually try to stop it? The reason I ask is I found the following link which suggests to me that after the failure timeout is reached when stopping a resource and STONITH is not enabled pacemaker will continually try to stop the resource until it succeeds: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-failure-migration.html If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout. I am using pacemaker 1.1.5. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Resource fails to stop
Ah, that makes sense. Thanks for helping me wrap my head around it. Working on setting up STONITH now to avoid this in the future. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Resource active on both nodes
Update... I did a crm resource cleanup of everything and these messages started to look more sane. The logs were now saying this were active only on what is currently the active node. From: awiddersh...@hotmail.com To: pacemaker@oss.clusterlabs.org Date: Thu, 5 Jul 2012 14:47:57 -0400 Subject: [Pacemaker] Resource active on both nodes I'm seeing messages similar to the following: Jul 5 14:34:06 server1 pengine: [423]: notice: unpack_rsc_op: Operation p_syslog-ng_monitor_0 found resource p_syslog-ng active on server2 Jul 5 14:34:06 server1 pengine: [423]: notice: unpack_rsc_op: Operation p_bacula_monitor_0 found resource p_bacula active on server2 The cluster is running fine currently and none of these resources are started on the server2 (currently the passive node) so I'm not clear on why I'm seeing these messages. All of the start and stop scripts are LSB compliant per: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html No resource failures when doing crm_mon -rf. Is it possible at some point these resources were active on both nodes and was corrected but theses messages persist? Also none of these resources are set to start at boot or anything of that nature. OS: RHEL6 Pacemaker: 1.1.5 Heartbeat: 3.0.5 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Resource active on both nodes
I'm seeing messages similar to the following: Jul 5 14:34:06 server1 pengine: [423]: notice: unpack_rsc_op: Operation p_syslog-ng_monitor_0 found resource p_syslog-ng active on server2 Jul 5 14:34:06 server1 pengine: [423]: notice: unpack_rsc_op: Operation p_bacula_monitor_0 found resource p_bacula active on server2 The cluster is running fine currently and none of these resources are started on the server2 (currently the passive node) so I'm not clear on why I'm seeing these messages. All of the start and stop scripts are LSB compliant per: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html No resource failures when doing crm_mon -rf. Is it possible at some point these resources were active on both nodes and was corrected but theses messages persist? Also none of these resources are set to start at boot or anything of that nature. OS: RHEL6 Pacemaker: 1.1.5 Heartbeat: 3.0.5 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org