On Sun, Oct 10, 2010 at 11:20 PM, Shravan Mishra <[email protected]> wrote: > Andrew, > > We were able to solve our problem. Obviously if no one else is having > it then it has to be our environment. It's just that time pressure and > mgmt pressure was causing us to go really bonkers. > > We had been struggling with this for past 4 days. > So here is the story: > > We had following versions of HA libs existing on our appliance: > > heartbeat=3.0.0 > openais=1.0.0 > pacemaker=1.0.9 > > When I started installing glue=1.0.3 on top of it I started getting > bunch of conflicts so I basically > uninstalled the heartbeat and openais and proceeded to install the > following in the given order: > > 1. glue=1.0.3 > 2. corosync=1.1.1 > 3. pacemaker=1.0.9 > 4. agents=1.0.3 > > > > And that's when we started seeing this problem. > So after 2 days of going nowhere with this we said let's leave the > packages as such try to install using --replace-files option. > > We are using a build tool called conary which has this option and not > standard make/make install. > > So we let the above heartbeat and openais remain as such and installed > glue,corosync and pacemaker on top of it with the --replace-files > options , this time with no conflicts and bingo it all works fine. > > So that sort of confused me as to why do we still need heartbeat given > the above 4 packages.
strictly speaking you don't. but at least on fedora, the policy is that $x-libs always requires $x so just building against heartbeat-libs means that yum will suck in the main heartbeat package :-( glad you found a path forward though > understand that /usr/lib/ocf/resource.d/heartbeat has ocf scripts > provided by heartbeat but that can be part of the "Reusable cluster > agents" subsystem. > > Frankly I thought the way I had installed the system by erasing and > installing the fresh packages it should have worked. > > But all said and done I learned a lot of cluster code by gdbing it. > I'll be having a peaceful thanksgiving. > > Thanks and happy thanks giving. > Shravan > > > > > > > > > > On Sun, Oct 10, 2010 at 2:46 PM, Andrew Beekhof <[email protected]> wrote: >> Not enough information. >> We'd need more than just the lrmd's logs, they only show what happened not >> why. >> >> On Thu, Oct 7, 2010 at 11:02 PM, Shravan Mishra >> <[email protected]> wrote: >>> Hi, >>> >>> Description of my environment: >>> corosync=1.2.8 >>> pacemaker=1.1.3 >>> Linux= 2.6.29.6-0.6.smp.gcc4.1.x86_64 #1 SMP >>> >>> >>> We are having a problem with our pacemaker which is continuously >>> canceling the monitoring operation of our stonith devices. >>> >>> We ran: >>> >>> stonith -d -t external/safe/ipmi hostname=ha2.itactics.com >>> ipaddr=192.168.2.7 userid=hellouser passwd=hello interface=lanplus -S >>> >>> it's output is attached as stonith.output. >>> >>> We have been trying to debug this issue for a few days now with no success. >>> We are hoping that someone can help us as we are under immense >>> pressure to move to RCS unless we can solve this issue in a day or two >>> ,which I personally don't want to because we like the product. >>> >>> Any help will be greatly appreciated. >>> >>> >>> Here is an excerpt from the /var/log/messages: >>> ========================= >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11155: start >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11156: monitor >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation >>> monitor[11156] on >>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584, >>> its parameters: CRM_meta_interval=[20000] target_role=[started] >>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000] >>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor] >>> hostname=[ha2.itactics.com] passwd=[ft01st0...@] >>> userid=[safe_ipmi_admin] cancelled >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11157: >>> stop >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11158: start >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11159: monitor >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation >>> monitor[11159] on >>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584, >>> its parameters: CRM_meta_interval=[20000] target_role=[started] >>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000] >>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor] >>> hostname=[ha2.itactics.com] passwd=[ft01st0...@] >>> userid=[safe_ipmi_admin] cancelled >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11160: >>> stop >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11161: start >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11162: monitor >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation >>> monitor[11162] on >>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584, >>> its parameters: CRM_meta_interval=[20000] target_role=[started] >>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000] >>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor] >>> hostname=[ha2.itactics.com] passwd=[ft01st0...@] >>> userid=[safe_ipmi_admin] cancelled >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11163: >>> stop >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11164: start >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11165: monitor >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation >>> monitor[11165] on >>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584, >>> its parameters: CRM_meta_interval=[20000] target_role=[started] >>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000] >>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor] >>> hostname=[ha2.itactics.com] passwd=[ft01st0...@] >>> userid=[safe_ipmi_admin] cancelled >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11166: >>> stop >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11167: start >>> Oct 7 16:58:29 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11168: monitor >>> Oct 7 16:58:30 ha1 lrmd: [3581]: info: cancel_op: operation >>> monitor[11168] on >>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584, >>> its parameters: CRM_meta_interval=[20000] target_role=[started] >>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000] >>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor] >>> hostname=[ha2.itactics.com] passwd=[ft01st0...@] >>> userid=[safe_ipmi_admin] cancelled >>> Oct 7 16:58:30 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11169: >>> stop >>> Oct 7 16:58:30 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11170: start >>> Oct 7 16:58:30 ha1 lrmd: [3581]: info: stonithRA plugin: got >>> metadata: <?xml version="1.0"?> <!DOCTYPE resource-agent SYSTEM >>> "ra-api-1.dtd"> <resource-agent name="external/safe/ipmi"> >>> <version>1.0</version> <longdesc lang="en"> ipmitool based power >>> management. Apparently, the power off method of ipmitool is >>> intercepted by ACPI which then makes a regular shutdown. If case of a >>> split brain on a two-node it may happen that no node survives. For >>> two-node clusters use only the reset method. </longdesc> >>> <shortdesc lang="en">IPMI STONITH external device </shortdesc> >>> <parameters> <parameter name="hostname" unique="1"> <content >>> type="string" /> <shortdesc lang="en"> Hostname </shortdesc> <longdesc >>> lang="en"> The name of the host to be managed by this STONITH device. >>> </longdesc> </parameter> <parameter name="ipaddr" unique="1"> >>> <content type="string" /> <shortdesc lang="en"> IP Address >>> </shortdesc> <longdesc lang="en"> The IP address of the STONITH >>> device. </longdesc> </parameter> <parameter name="userid" unique="1"> >>> <content type="string" /> <shortdesc lang="en"> Login </shortdesc> >>> <longdesc lang="en"> The username used for logging in to the STONITH >>> device. </longdesc> </parameter> <parameter name="passwd" unique="1"> >>> <content type="string" /> <shortdesc lang="en"> Password </shortdesc> >>> <longdesc lang="en"> The password used for logging in to the STONITH >>> device. </longdesc> </parameter> <parameter name="interface" >>> unique="1"> <content type="string" default="lan"/> <shortdesc >>> lang="en"> IPMI interface </shortdesc> <longdesc lang="en"> IPMI >>> interface to use, such as "lan" or "lanplus". </longdesc> </parameter> >>> </parameters> <actions> <action name="start" timeout="15" /> >>> <action name="stop" timeout="15" /> <action name="status" >>> timeout="15" /> <action name="monitor" timeout="15" interval="15" >>> start-delay="15" /> <action name="meta-data" timeout="15" /> >>> </actions> <special tag="heartbeat"> <version>2.0</version> >>> </special> </resource-agent> >>> Oct 7 16:58:30 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11171: monitor >>> Oct 7 16:58:30 ha1 lrmd: [3581]: info: cancel_op: operation >>> monitor[11171] on >>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584, >>> its parameters: CRM_meta_interval=[20000] target_role=[started] >>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000] >>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor] >>> hostname=[ha2.itactics.com] passwd=[ft01st0...@] >>> userid=[safe_ipmi_admin] cancelled >>> Oct 7 16:58:30 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11172: >>> stop >>> Oct 7 16:58:30 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11173: start >>> Oct 7 16:58:30 ha1 lrmd: [3581]: info: >>> rsc:ha2.itactics.com-stonith:11174: monitor >>> >>> ========================== >>> >>> Thanks >>> >>> Shravan >>> >>> _______________________________________________ >>> Pacemaker mailing list: [email protected] >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >>> >>> >> >> _______________________________________________ >> Pacemaker mailing list: [email protected] >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
