Re: [Pacemaker] Cluster Test with resource down
Yes I did it. then apache stop and cluster status shows that Failed actions apache not running. but it will not migrate to node2. so when I kill process node 2 resource will be not run automatically. Regards, Nuwan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Cluster Test with resource down
did you try: killall -9 httpd ? On Tue, Apr 17, 2012 at 2:40 PM, Nuwan Silva wrote: > > Thanks Andrew for reply. Yes I want to know how to test it. > Regards. > Nuwan > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Periodically appear non-existent nodes
2012/4/17 Proskurin Kirill > On 04/17/2012 03:46 PM, ruslan usifov wrote: > >> 2012/4/17 Andreas Kurz mailto:andr...@hastexo.com>> >> >> >>On 04/14/2012 11:14 PM, ruslan usifov wrote: >> > Hello >> > >> > I remove 2 nodes from cluster, with follow sequence: >> > >> > crm_node --force -R >> > crm_node --force -R >> > cibadmin --delete --obj_type nodes --crm_xml '' >> > cibadmin --delete --obj_type status --crm_xml '>uname="node1"/>' >> > cibadmin --delete --obj_type nodes --crm_xml '' >> > cibadmin --delete --obj_type status --crm_xml '>uname="node2"/>' >> > >> > >> > Nodes after this deleted, but if for example i restart (reboot) >>one of >> > existent nodes in working cluster, this deleted nodes appear again >> in >> > OFFLINE state >> > > I have this problem some time ago. > I "solved" it something like that: > > crm node delete NODENAME > crm_node --force --remove NODENAME > cibadmin --delete --obj_type nodes --crm_xml '' > cibadmin --delete --obj_type status --crm_xml ' uname="NODENAME"/>' > > -- > I do the same, but some times after cluster reconfiguration (node failed due power supply failure) removed nodes appear again, and this happens 3-4 times ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Using Cluster Test Suite (CTS) with "debug: on" in corosync.conf failed
Am 17.04.2012 03:12, schrieb Andrew Beekhof: > On Fri, Apr 13, 2012 at 11:26 PM, Timo Schäpe wrote: >> Am 10.04.2012 23:43, schrieb Andrew Beekhof: >>> On Tue, Apr 10, 2012 at 11:05 PM, Timo Schäpe wrote: Hi, for whom it may interest, here's something that cost me a whole day of work: I used CTS to test my cluster configuration and it worked fine. For debugging a resource agent I switched on the debug output in corosync.conf: logging { [...] debug: on [...] } After I fixed the bug, I forgot to switch debug output off. This caused most of the CTS tests to fail with this warning: CTS: Warn: Startup pattern not found: myhost crmd:.*The local CRM is operational After I switched debug output off, the CTS worked fine as before. >>> >>> We've since added a BadNews pattern that looks for syslog messages >>> being dropped/throttled. >>> >>> How was your experience with CTS otherwise? >>> Periodically I try to improve the usability so that eventually >>> non-developers can use it, it would be interesting to hear how we're >>> doing. >> >> It was easy for me to work with CTS. I read about the basics of CTS in >> the book of Michael Schwartzkopff (Clusterbau). Some deeper information >> about the configuration I got from the README and that was enough to >> start some tests. >> >> What I missed is a short explanation of the tests and the meaning of >> their failures. For example my cluster fails completly at the >> ResourceRecover and Reattach tests. But unfortunately I only know the >> meaning of the ResourceRecover test, because there is an explanation in >> the Schwartzkopff book. Maybe there is an online resource that I had >> overlooked till now. > > There is > http://www.clusterlabs.org/wiki/ReleaseTesting#List_of_Automated_Test_Cases > But its all but impossible to describe the failures as that would be a > list of every possible bug - basically there shouldn't be any > failures. Yes, thank you. That's what I want :). >> And I am not sure, how I can test a cluster with STONITH resources with >> the CTS. Should I use the stonith-ssh? > > As Dejan mentioned, you can (and should) use whichever stonith device > you would normally define. > Are you having CTS create a configuration or using the one you plan to > use in production? Yep, thanks to Dejan. I used the configuration which I want to use in production. -- Dipl.-Inf. Timo Schaepe (Projekt- u. Entwicklungsteam) Phone:+49 40 808077-650 Fax:+49 40 808077556Mail:scha...@dfn-cert.de DFN-CERT Services GmbH, https://www.dfn-cert.de/, Phone +49 40 808077-555 Sitz / Register: Hamburg, AG Hamburg, HRB 88805, Ust-IdNr.: DE 232129737 Sachsenstrasse 5, 20097 Hamburg, Germany. CEO: Dr. Klaus-Peter Kossakowski smime.p7s Description: S/MIME Cryptographic Signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] HA setup need to configure node transition if a resouce fails.
crm configure show please give us more information Il giorno 17 aprile 2012 15:22, Manish Punjabi ha scritto: > Dear All, > > I have configured DRBD and pacemaker for highly available setup with 2 > nodes. Each node has oracle and jboss resources along with ClusterIp > resource. I have used resource stickiness to infinity to make services run > on a stable node. But also I need to make sure that if any one service > fails then the entire service must move on to other node. I have added > collocation and ordering to ensure all to run on a single node at a time. > When i fail one service like jboss it just restarts it back on same machine > . I have set resource threshold to 1. What other configuration should I > make to enable entire service move to another node on a single service > failure. > > > Thanks and Regards > Manish > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] HA setup need to configure node transition if a resouce fails.
Dear All, I have configured DRBD and pacemaker for highly available setup with 2 nodes. Each node has oracle and jboss resources along with ClusterIp resource. I have used resource stickiness to infinity to make services run on a stable node. But also I need to make sure that if any one service fails then the entire service must move on to other node. I have added collocation and ordering to ensure all to run on a single node at a time. When i fail one service like jboss it just restarts it back on same machine . I have set resource threshold to 1. What other configuration should I make to enable entire service move to another node on a single service failure. Thanks and Regards Manish ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Periodically appear non-existent nodes
On 04/17/2012 03:46 PM, ruslan usifov wrote: 2012/4/17 Andreas Kurz mailto:andr...@hastexo.com>> On 04/14/2012 11:14 PM, ruslan usifov wrote: > Hello > > I remove 2 nodes from cluster, with follow sequence: > > crm_node --force -R > crm_node --force -R > cibadmin --delete --obj_type nodes --crm_xml '' > cibadmin --delete --obj_type status --crm_xml '' > cibadmin --delete --obj_type nodes --crm_xml '' > cibadmin --delete --obj_type status --crm_xml '' > > > Nodes after this deleted, but if for example i restart (reboot) one of > existent nodes in working cluster, this deleted nodes appear again in > OFFLINE state I have this problem some time ago. I "solved" it something like that: crm node delete NODENAME crm_node --force --remove NODENAME cibadmin --delete --obj_type nodes --crm_xml '' cibadmin --delete --obj_type status --crm_xml 'uname="NODENAME"/>' -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Periodically appear non-existent nodes
2012/4/17 Andreas Kurz > On 04/14/2012 11:14 PM, ruslan usifov wrote: > > Hello > > > > I remove 2 nodes from cluster, with follow sequence: > > > > crm_node --force -R > > crm_node --force -R > > cibadmin --delete --obj_type nodes --crm_xml '' > > cibadmin --delete --obj_type status --crm_xml ' uname="node1"/>' > > cibadmin --delete --obj_type nodes --crm_xml '' > > cibadmin --delete --obj_type status --crm_xml ' uname="node2"/>' > > > > > > Nodes after this deleted, but if for example i restart (reboot) one of > > existent nodes in working cluster, this deleted nodes appear again in > > OFFLINE state > > Just to double check ... corosync was already stopped (on these > to-be-deleted nodes) prior to the deletion and it's still stopped on the > removed nodes? ... and no cman involved? > > This nodes doesn't present physically:-)) (we remove this from network), so no corosync no cman not anything else ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] primitive resource start timeout ignored by monitor-operation
On 04/17/2012 12:41 PM, Rainer Maier wrote: > hi, > > this is my first post to this list, therefor i ask you to be lenient towards > me. > > my problem is, that i configured a primitive resource like this: > > > primitive p_fuseesb_cellx ocf:thales:fuseesb \ > params instance="cell1" fuseesb_home="/usr/lib/fuseesb" > javahome="/usr/lib/jdk1.6.0_31" \ > op monitor interval="60s" timeout="45s" \ > op start interval="0" timeout="45s" \ > op stop interval="0" timeout="20s" > > Now when i start the resource from crm, it gets started, and immediately it > gets > stopped and restarted. this happens in a cycle every 1-2 seconds. > > inside the corosync-log i get the following output: > > Apr 17 10:48:46 c6 lrmd: [28224]: info: operation start[1538] on > p_fuseesb_cellx > for client 28227: pid 27751 exited with return code 0 > Apr 17 10:48:46 c6 crmd: [28227]: info: process_lrm_event: LRM operation > p_fuseesb_cellx_start_0 (call=1538, rc=0, cib-update=1633, confirmed=true) ok > Apr 17 10:48:46 c6 crmd: [28227]: info: do_lrm_rsc_op: Performing > key=1:1017:0:084c0a4a-562e-46b2-bd13-df30802c2bd5 > op=p_fuseesb_cellx_monitor_6 ) > Apr 17 10:48:46 c6 lrmd: [28224]: info: rsc:p_fuseesb_cellx monitor[1539] > (pid 27830) > Apr 17 10:48:46 c6 lrmd: [28224]: info: operation monitor[1539] on > p_fuseesb_cellx for client 28227: pid 27830 exited with return code 7 > Apr 17 10:48:46 c6 crmd: [28227]: info: process_lrm_event: LRM operation > p_fuseesb_cellx_monitor_6 (call=1539, rc=7, cib-update=1634, > confirmed=false) > not running > Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_ais_dispatch: Update > relayed from c7 > Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_local_callback: Expanded > fail-count-p_fuseesb_cellx=value++ to 225 > Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_trigger_update: Sending flush > op to all hosts for: fail-count-p_fuseesb_cellx (225) > Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_perform_update: Sent update > 2420: fail-count-p_fuseesb_cellx=225 > Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_ais_dispatch: Update relayed > from c7 > Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_trigger_update: Sending flush > op to all hosts for: last-failure-p_fuseesb_cellx (1334652551) > Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_perform_update: Sent update > 2422: last-failure-p_fuseesb_cellx=1334652551 > Apr 17 10:48:46 c6 lrmd: [28224]: info: cancel_op: operation monitor[1539] > on p_fuseesb_cellx for client 28227, its parameters: CRM_meta_name=[monitor] > crm_feature_set=[3.0.1] fuseesb_home=[/usr/lib/fuseesb] > CRM_meta_timeout=[45000] CRM_meta_interval=[6] > javahome=[/usr/lib/jdk1.6.0_31] instance=[cell1] > cancelled > Apr 17 10:48:46 c6 crmd: [28227]: info: do_lrm_rsc_op: Performing > key=2:1019:0:084c0a4a-562e-46b2-bd13-df30802c2bd5 op=p_fuseesb_cellx_stop_0 ) > Apr 17 10:48:46 c6 lrmd: [28224]: info: rsc:p_fuseesb_cellx stop[1540] > (pid 27897) > Apr 17 10:48:46 c6 crmd: [28227]: info: process_lrm_event: LRM operation > p_fuseesb_cellx_monitor_6 (call=1539, status=1, cib-update=0, > confirmed=true) > Cancelled > Apr 17 10:48:46 c6 lrmd: [28224]: info: RA output: > (p_fuseesb_cellx:stop:stdout) Stop FUSE ESB: fuse-esb > > > from what i can see, the monitor-operation is started immediately after the > start-operation. as the start-operation is not finished, the monitor detects > that it's not running and therefore, the resource get's immediately stopped > and restarted - the circle starts from the beginning. > > what i don't understand is, why does pacemaker ignore the timeouts defined? You already correctly identified the problem: your resource agent returns too early on start ... as this is your own RA it should be quite easy for you to fix that. The timeouts for start and stop are only the maximum to wait for a response from the resource agent ... if it returns earlier, fine. There is a workaround for "buggy" scripts: you could add a "start-delay" to the monitor operation ... but better fix your script Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > regards > Rainer > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] primitive resource start timeout ignored by monitor-operation
hi, this is my first post to this list, therefor i ask you to be lenient towards me. my problem is, that i configured a primitive resource like this: primitive p_fuseesb_cellx ocf:thales:fuseesb \ params instance="cell1" fuseesb_home="/usr/lib/fuseesb" javahome="/usr/lib/jdk1.6.0_31" \ op monitor interval="60s" timeout="45s" \ op start interval="0" timeout="45s" \ op stop interval="0" timeout="20s" Now when i start the resource from crm, it gets started, and immediately it gets stopped and restarted. this happens in a cycle every 1-2 seconds. inside the corosync-log i get the following output: Apr 17 10:48:46 c6 lrmd: [28224]: info: operation start[1538] on p_fuseesb_cellx for client 28227: pid 27751 exited with return code 0 Apr 17 10:48:46 c6 crmd: [28227]: info: process_lrm_event: LRM operation p_fuseesb_cellx_start_0 (call=1538, rc=0, cib-update=1633, confirmed=true) ok Apr 17 10:48:46 c6 crmd: [28227]: info: do_lrm_rsc_op: Performing key=1:1017:0:084c0a4a-562e-46b2-bd13-df30802c2bd5 op=p_fuseesb_cellx_monitor_6 ) Apr 17 10:48:46 c6 lrmd: [28224]: info: rsc:p_fuseesb_cellx monitor[1539] (pid 27830) Apr 17 10:48:46 c6 lrmd: [28224]: info: operation monitor[1539] on p_fuseesb_cellx for client 28227: pid 27830 exited with return code 7 Apr 17 10:48:46 c6 crmd: [28227]: info: process_lrm_event: LRM operation p_fuseesb_cellx_monitor_6 (call=1539, rc=7, cib-update=1634, confirmed=false) not running Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_ais_dispatch: Update relayed from c7 Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_local_callback: Expanded fail-count-p_fuseesb_cellx=value++ to 225 Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_fuseesb_cellx (225) Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_perform_update: Sent update 2420: fail-count-p_fuseesb_cellx=225 Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_ais_dispatch: Update relayed from c7 Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-p_fuseesb_cellx (1334652551) Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_perform_update: Sent update 2422: last-failure-p_fuseesb_cellx=1334652551 Apr 17 10:48:46 c6 lrmd: [28224]: info: cancel_op: operation monitor[1539] on p_fuseesb_cellx for client 28227, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1] fuseesb_home=[/usr/lib/fuseesb] CRM_meta_timeout=[45000] CRM_meta_interval=[6] javahome=[/usr/lib/jdk1.6.0_31] instance=[cell1] cancelled Apr 17 10:48:46 c6 crmd: [28227]: info: do_lrm_rsc_op: Performing key=2:1019:0:084c0a4a-562e-46b2-bd13-df30802c2bd5 op=p_fuseesb_cellx_stop_0 ) Apr 17 10:48:46 c6 lrmd: [28224]: info: rsc:p_fuseesb_cellx stop[1540] (pid 27897) Apr 17 10:48:46 c6 crmd: [28227]: info: process_lrm_event: LRM operation p_fuseesb_cellx_monitor_6 (call=1539, status=1, cib-update=0, confirmed=true) Cancelled Apr 17 10:48:46 c6 lrmd: [28224]: info: RA output: (p_fuseesb_cellx:stop:stdout) Stop FUSE ESB: fuse-esb from what i can see, the monitor-operation is started immediately after the start-operation. as the start-operation is not finished, the monitor detects that it's not running and therefore, the resource get's immediately stopped and restarted - the circle starts from the beginning. what i don't understand is, why does pacemaker ignore the timeouts defined? regards Rainer ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] How to mount a iscsi device in a cluster?
On 04/17/2012 04:18 AM, cherish wrote: > use the "iscsi" resource agent to import an iscsi device with the help of open-iscsi Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Periodically appear non-existent nodes
On 04/14/2012 11:14 PM, ruslan usifov wrote: > Hello > > I remove 2 nodes from cluster, with follow sequence: > > crm_node --force -R > crm_node --force -R > cibadmin --delete --obj_type nodes --crm_xml '' > cibadmin --delete --obj_type status --crm_xml '' > cibadmin --delete --obj_type nodes --crm_xml '' > cibadmin --delete --obj_type status --crm_xml '' > > > Nodes after this deleted, but if for example i restart (reboot) one of > existent nodes in working cluster, this deleted nodes appear again in > OFFLINE state Just to double check ... corosync was already stopped (on these to-be-deleted nodes) prior to the deletion and it's still stopped on the removed nodes? ... and no cman involved? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > PS: >OS ubuntu 10.0.4(2.6.32-40) >pacemaker 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c >corosync 1.4.2 > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org