Re: [Pacemaker] [crmsh][Question] The order of resources is changed.
Hi Kristoffer, > It is possible that there is a bug in crmsh, I will investigate. > > Could you file an issue for this problem at > http://github.com/crmsh/crmsh/issues ? This would help me track the > problem. Okay! Many Thanks! Hideo Yamauchi. - Original Message - > From: Kristoffer Grönlund > To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > ; PaceMaker-ML > Cc: > Date: 2015/1/21, Wed 18:50 > Subject: Re: [Pacemaker] [crmsh][Question] The order of resources is changed. > > Hello, > > > renayama19661...@ybb.ne.jp writes: > >> Hi All, >> >> We confirmed a function of crmsh by the next combination. >> * corosync-2.3.4 >> * pacemaker-Pacemaker-1.1.12 >> * crmsh-2.1.0 >> >> By new crmsh, does "options sort-elements no" not work? >> Is there the option which does not change order elsewhere? > > > It is possible that there is a bug in crmsh, I will investigate. > > Could you file an issue for this problem at > http://github.com/crmsh/crmsh/issues ? This would help me track the > problem. > > Thank you! > > -- > // Kristoffer Grönlund > // kgronl...@suse.com > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [crmsh][Question] The order of resources is changed.
Hi All, We confirmed a function of crmsh by the next combination. * corosync-2.3.4 * pacemaker-Pacemaker-1.1.12 * crmsh-2.1.0 We prepared the following cli file. --- ### Cluster Option ### property no-quorum-policy="ignore" \ stonith-enabled="true" \ startup-fencing="false" ### Resource Defaults ### rsc_defaults resource-stickiness="INFINITY" \ migration-threshold="1" ### Group Configuration ### group grpPostgreSQLDB \ prmExPostgreSQLDB \ prmApPostgreSQLDB ### Group Configuration ### group grpStonith1 \ prmStonith1-2 group grpStonith2 \ prmStonith2-2 ### Fencing Topology ### fencing_topology \ rh66-01: prmStonith1-2 \ rh66-02: prmStonith2-2 ### Primitive Configuration ### primitive prmExPostgreSQLDB ocf:heartbeat:Dummy \ op start interval="0s" timeout="90s" on-fail="restart" \ op monitor interval="10s" timeout="60s" on-fail="restart" \ op stop interval="0s" timeout="60s" on-fail="fence" primitive prmApPostgreSQLDB ocf:heartbeat:Dummy \ op start interval="0s" timeout="300s" on-fail="restart" \ op monitor interval="10s" timeout="60s" on-fail="restart" \ op stop interval="0s" timeout="300s" on-fail="fence" primitive prmStonith1-2 stonith:external/ssh \ params \ hostlist="rh66-01" \ op start interval="0s" timeout="60s" on-fail="restart" \ op monitor interval="3600s" timeout="60s" on-fail="restart" \ op stop interval="0s" timeout="60s" on-fail="ignore" primitive prmStonith2-2 stonith:external/ssh \ params \ hostlist="rh66-02" \ op start interval="0s" timeout="60s" on-fail="restart" \ op monitor interval="3600s" timeout="60s" on-fail="restart" \ op stop interval="0s" timeout="60s" on-fail="ignore" ### Resource Location ### location rsc_location-grpPostgreSQLDB-1 grpPostgreSQLDB \ rule 200: #uname eq rh66-01 \ rule 100: #uname eq rh66-02 location rsc_location-grpStonith1-2 grpStonith1 \ rule -INFINITY: #uname eq rh66-01 location rsc_location-grpStonith2-3 grpStonith2 \ rule -INFINITY: #uname eq rh66-02 --- We set "sort-elements no" and read cli file in crm. [root@rh66-01 ~]# crm options sort-elements no [root@rh66-01 ~]# cat .config/crm/crm.conf [core] sort_elements = no We thought that a resource was sent to pacemaker in order of cli file. [root@rh66-01 ~]# crm configure load update trac2980.crm However, the order of resources seems to be changed by crm. (We thought that grpPostgreSQLDB was displayed by the top.) [root@rh66-01 ~]# crm_mon -1 -Af Last updated: Thu Jan 22 02:02:15 2015 Last change: Thu Jan 22 02:01:59 2015 Stack: corosync Current DC: rh66-01 (3232256178) - partition with quorum Version: 1.1.12-561c4cf 2 Nodes configured 4 Resources configured Online: [ rh66-01 rh66-02 ] Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh66-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh66-02 Resource Group: grpPostgreSQLDB prmExPostgreSQLDB (ocf::heartbeat:Dummy): Started rh66-01 prmApPostgreSQLDB (ocf::heartbeat:Dummy): Started rh66-01 We sent similar cli file in environment of pacemaker1.0, but the order of resources was not changed by crm. (The grpPostgreSQLDB is displayed by the top.) [root@rh64-heartbeat1 ~]# crm_mon -1 -Af Last updated: Wed Jan 21 15:12:08 2015 Stack: Heartbeat Current DC: rh64-heartbeat2 (eac5dbcb-78ae-4450-8d44-b8175e5638dd) - partition with quorum Version: 1.0.13-9227e89 2 Nodes configured, unknown expected votes 6 Resources configured. Online: [ rh64-heartbeat1 rh64-heartbeat2 ] Resource Group: grpPostgreSQLDB prmExPostgreSQLDB (ocf::heartbeat:Dummy): Started rh64-heartbeat1 prmFsPostgreSQLDB1 (ocf::heartbeat:Dummy): Started rh64-heartbeat1 prmIpPostgreSQLDB (ocf::heartbeat:Dummy): Started rh64-heartbeat1 prmApPostgreSQLDB (ocf::heartbeat:Dummy): Started rh64-heartbeat1 Resource Group: grpStonith1 prmStonith1-1 (stonith:external/ssh): Started rh64-heartbeat2 prmStonith1-2 (stonith:external/ssh): Started rh64-heartbeat2 Resource Group: grpStonith2 prmStonith2-1 (stonith:external/ssh): Started rh64-heartbeat1 prmStonith2-2 (stonith:external/ssh): Started rh64-heartbeat1 By new crmsh, does "options sort-elements no" not work? Is there the option which does not change order elsewhere? Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Patch]Memory leak of Pacemakerd.
Hi David, I made pullrequest. * https://github.com/ClusterLabs/pacemaker/pull/620 And ... My pullrequest before the next waits for your approval, too. Please confirm it. * https://github.com/ClusterLabs/pacemaker/pull/594 Best Regards, Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp" > To: David Vossel ; The Pacemaker cluster resource manager > > Cc: > Date: 2014/12/23, Tue 00:07 > Subject: Re: [Pacemaker] [Patch]Memory leak of Pacemakerd. > > Hi David, > > Okay. > I will send pullrequest the day after tomorrow. > > > Many Thanks! > Hideo Yamauchi. > > > - Original Message - >> From: David Vossel >> To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > >> Cc: >> Date: 2014/12/22, Mon 23:43 >> Subject: Re: [Pacemaker] [Patch]Memory leak of Pacemakerd. >> >> >> >> - Original Message - >>> Hi All, >>> >>> Whenever a node to constitute a cluster repeats start and a stop, >> Pacemakerd >>> of the node not to stop leaks out memory. >>> I attached a patch. >> >> this patch looks correct. Can you create a pull request to our master > branch? >> https://github.com/ClusterLabs/pacemaker >> >> -- Vossel >> >> >>> Best Regards, >>> Hideo Yamauchi. >>> ___ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Patch]Memory leak of Pacemakerd.
Hi David, Okay. I will send pullrequest the day after tomorrow. Many Thanks! Hideo Yamauchi. - Original Message - > From: David Vossel > To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > > Cc: > Date: 2014/12/22, Mon 23:43 > Subject: Re: [Pacemaker] [Patch]Memory leak of Pacemakerd. > > > > - Original Message - >> Hi All, >> >> Whenever a node to constitute a cluster repeats start and a stop, > Pacemakerd >> of the node not to stop leaks out memory. >> I attached a patch. > > this patch looks correct. Can you create a pull request to our master branch? > https://github.com/ClusterLabs/pacemaker > > -- Vossel > > >> Best Regards, >> Hideo Yamauchi. >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Patch]Memory leak of Pacemakerd.
Hi All, Whenever a node to constitute a cluster repeats start and a stop, Pacemakerd of the node not to stop leaks out memory. I attached a patch. Best Regards, Hideo Yamauchi. pacemakerd.patch Description: Binary data ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Problem] The crmd reboots by the parameter mistake of the cibadmin command.
Hi All, Our user operated cibadmin command by mistake. By an operation error, reboot of crmd occurs. Step 1) Start a cluster. [root@rh70-node1 ~]# crm_mon -1 -Af Last updated: Wed Nov 5 10:26:51 2014 Last change: Wed Nov 5 10:23:39 2014 Stack: corosync Current DC: rh70-node1 (3232238160) - partition WITHOUT quorum Version: 1.1.12-85c093e 1 Nodes configured 0 Resources configured Online: [ rh70-node1 ] Node Attributes: * Node rh70-node1: Migration summary: * Node rh70-node1: Step 2) A user adds a node by wrong designation. cibadmin -C -o nodes -X '' The crmd core-dump and reboots. Nov 5 10:28:17 rh70-node1 cib[2167]: info: cib_process_request: Forwarding cib_create operation for section nodes to master (origin=local/cibadmin/2) Nov 5 10:28:17 rh70-node1 cib[2167]: info: cib_perform_op: Diff: --- 0.2.7 2 Nov 5 10:28:17 rh70-node1 cib[2167]: info: cib_perform_op: Diff: +++ 0.3.0 92153f86c58ed569196d946612f0dab8 Nov 5 10:28:17 rh70-node1 cib[2167]: info: cib_perform_op: + /cib: @epoch=3, @num_updates=0 Nov 5 10:28:17 rh70-node1 cib[2167]: info: cib_perform_op: ++ /cib/configuration/nodes: Nov 5 10:28:17 rh70-node1 cib[2167]: info: cib_process_request: Completed cib_create operation for section nodes: OK (rc=0, origin=rh70-node1/cibadmin/2, version=0.3.0) Nov 5 10:28:17 rh70-node1 crmd[2172]: error: crm_int_helper: Characters left over after parsing 'hpg604': 'hpg604' Nov 5 10:28:17 rh70-node1 crmd[2172]: error: crm_abort: crm_find_peer: Triggered fatal assert at membership.c:338 : id > 0 || uname != NULL Nov 5 10:28:17 rh70-node1 cib[2223]: info: write_cib_contents: Archived previous version as /var/lib/pacemaker/cib/cib-2.raw Nov 5 10:28:17 rh70-node1 cib[2223]: info: write_cib_contents: Wrote version 0.3.0 of the CIB to disk (digest: fd92fe00a0f0478246b1c9f1d2be83a8) Nov 5 10:28:17 rh70-node1 cib[2223]: info: retrieveCib: Reading cluster configuration from: /var/lib/pacemaker/cib/cib.CARj72 (digest: /var/lib/pacemaker/cib/cib.XK4ybJ) Nov 5 10:28:17 rh70-node1 abrt-hook-ccpp: Saved core dump of pid 2172 (/usr/libexec/pacemaker/crmd) to /var/tmp/abrt/ccpp-2014-11-05-10:28:17-2172 (18141184 bytes) Nov 5 10:28:18 rh70-node1 abrt-server: Executable '/usr/libexec/pacemaker/crmd' doesn't belong to any package and ProcessUnpackaged is set to 'no' Nov 5 10:28:18 rh70-node1 abrt-server: 'post-create' on '/var/tmp/abrt/ccpp-2014-11-05-10:28:17-2172' exited with 1 Nov 5 10:28:18 rh70-node1 abrt-server: Deleting problem directory '/var/tmp/abrt/ccpp-2014-11-05-10:28:17-2172' Nov 5 10:28:18 rh70-node1 pacemakerd[2166]: error: child_waitpid: Managed process 2172 (crmd) dumped core Nov 5 10:28:18 rh70-node1 pacemakerd[2166]: error: pcmk_child_exit: The crmd process (2172) terminated with signal 6 (core=1) Nov 5 10:28:18 rh70-node1 pacemakerd[2166]: notice: pcmk_process_exit: Respawning failed child process: crmd Nov 5 10:28:18 rh70-node1 pacemakerd[2166]: info: start_child: Using uid=992 and group=990 for process crmd Nov 5 10:28:18 rh70-node1 pacemakerd[2166]: info: start_child: Forked child 2228 for process crmd Nov 5 10:28:18 rh70-node1 crmd[2228]: info: crm_log_init: Changed active directory to /usr/var/lib/heartbeat/cores/hacluster Nov 5 10:28:18 rh70-node1 crmd[2228]: notice: main: CRM Git Version: 85c093e Nov 5 10:28:18 rh70-node1 crmd[2228]: info: do_log: FSA: Input I_STARTUP from crmd_init() received in state S_STARTING Nov 5 10:28:18 rh70-node1 crmd[2228]: info: get_cluster_type: Verifying cluster type: 'corosync' It is an operation error of the user, but it is not desirable for crmd to reboot. We request the improvement that crmd does not reboot. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] Error message of crm_failcount is not right.
Hi Andrew, > I would suggest neither is correct. > I've changed it to: > > [root@rh70-node1 ~]# crm_failcount > You must supply a resource name to check. See 'crm_failcount --help' for > details This is the message which is plain with kindness. Many Thanks! Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > > Cc: > Date: 2014/11/5, Wed 11:42 > Subject: Re: [Pacemaker] [Problem] Error message of crm_failcount is not > right. > > >> On 5 Nov 2014, at 1:05 pm, renayama19661...@ybb.ne.jp wrote: >> >> Hi All, >> >> The next error is displayed when I carry out crm_failcount of Pacemaker. >> >> [root@rh70-node1 ~]# crm_failcount >> error: crm_abort: read_attr_delegate: Triggered assert at > cib_attrs.c:342 : attr_name != NULL || attr_id != NULL >> >> However, I think that the next error should be displayed. >> >> [root@rh70-node1 ~]# crm_failcount >> scope=status value=(null) >> >> Error performing operation: Invalid argument > > I would suggest neither is correct. > I've changed it to: > > [root@rh70-node1 ~]# crm_failcount > You must supply a resource name to check. See 'crm_failcount --help' for > details > > >> >> I request a correction to inform an operation error of the user definitely. >> >> Best Regards, >> Hideo Yamauchi. >> >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Problem] Error message of crm_failcount is not right.
Hi All, The next error is displayed when I carry out crm_failcount of Pacemaker. [root@rh70-node1 ~]# crm_failcount error: crm_abort: read_attr_delegate: Triggered assert at cib_attrs.c:342 : attr_name != NULL || attr_id != NULL However, I think that the next error should be displayed. [root@rh70-node1 ~]# crm_failcount scope=status value=(null) Error performing operation: Invalid argument I request a correction to inform an operation error of the user definitely. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi Andrew, The problem was settled with your patch. Please merge a patch into master. Please confirm whether there is not a problem in other points either concerning g_timeout_add() and g_source_remove() if possible. Many Thanks! Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp" > To: The Pacemaker cluster resource manager > Cc: > Date: 2014/10/10, Fri 15:34 > Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, > g_source_remove fails. > > Hi Andrew, > > Thank you for comments. > >> diff --git a/lib/services/services_linux.c b/lib/services/services_linux.c >> index 961ff18..2279e4e 100644 >> --- a/lib/services/services_linux.c >> +++ b/lib/services/services_linux.c >> @@ -227,6 +227,7 @@ recurring_action_timer(gpointer data) >> op->stdout_data = NULL; >> free(op->stderr_data); >> op->stderr_data = NULL; >> + op->opaque->repeat_timer = 0; >> >> services_action_async(op, NULL); >> return FALSE; > > > I confirm a correction again. > > > > Many Thanks! > Hideo Yamauchi. > > > > - Original Message - >> From: Andrew Beekhof >> To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > >> Cc: >> Date: 2014/10/10, Fri 15:19 >> Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of > glib, g_source_remove fails. >> >> /me slaps forhead >> >> this one should work >> >> diff --git a/lib/services/services.c b/lib/services/services.c >> index 8590b56..753e257 100644 >> --- a/lib/services/services.c >> +++ b/lib/services/services.c >> @@ -313,6 +313,7 @@ services_action_free(svc_action_t * op) >> >> if (op->opaque->repeat_timer) { >> g_source_remove(op->opaque->repeat_timer); >> + op->opaque->repeat_timer = 0; >> } >> if (op->opaque->stderr_gsource) { >> mainloop_del_fd(op->opaque->stderr_gsource); >> @@ -425,6 +426,7 @@ services_action_kick(const char *name, const char > *action, >> int interval /* ms */ >> } else { >> if (op->opaque->repeat_timer) { >> g_source_remove(op->opaque->repeat_timer); >> + op->opaque->repeat_timer = 0; >> } >> recurring_action_timer(op); >> return TRUE; >> @@ -459,6 +461,7 @@ handle_duplicate_recurring(svc_action_t * op, void >> (*action_callback) (svc_actio >> if (dup->pid != 0) { >> if (op->opaque->repeat_timer) { >> g_source_remove(op->opaque->repeat_timer); >> + op->opaque->repeat_timer = 0; >> } >> recurring_action_timer(dup); >> } >> diff --git a/lib/services/services_linux.c b/lib/services/services_linux.c >> index 961ff18..2279e4e 100644 >> --- a/lib/services/services_linux.c >> +++ b/lib/services/services_linux.c >> @@ -227,6 +227,7 @@ recurring_action_timer(gpointer data) >> op->stdout_data = NULL; >> free(op->stderr_data); >> op->stderr_data = NULL; >> + op->opaque->repeat_timer = 0; >> >> services_action_async(op, NULL); >> return FALSE; >> >> >> On 10 Oct 2014, at 4:45 pm, renayama19661...@ybb.ne.jp wrote: >> >>> Hi Andrew, >>> >>> I applied three corrections that you made and checked movement. >>> I picked all "abort" processing with g_source_remove() of >> services.c just to make sure. >>> * I set following "abort" in four places that carried out >> g_source_remove >>> >> if > (g_source_remove(op->opaque->repeat_timer) == >> FALSE) { >> abort(); >> } >>> >>> >>> As a result, "abort" still occurred. >>> >>> >>> The problem does not seem to be yet settled by your correction. >>> >>> >>> (gdb) where >>> #0 0x7fdd923e1f79 in __GI_raise (sig=sig@entry=6) at >> ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >>> #1 0x7fdd923e5388 in __GI_abort () at abort.c:89 >>> #2 0x7fdd92b9fe77 in crm_abort (file=file@entry=0x7fdd92bd352b >> "logging.c", >>> function=function@entry=0x7fdd92bd48c0 <__FUNCTION__.23262> >> "crm_glib_handler", line=line@entry=73, >>> assert_condition=assert_condition@entry=0xe20b80 "Source ID > 40 was >> not found when attempting to remove it", do_core=do_core@entry=1, >>> do_fork=, do_fork@entry=1) at utils.c:1195 >>> #3 0x7fdd92bc7ca7 in crm_glib_handler (log_domain=0x7fdd92130b6e >> "GLib", flags=, >>> message=0xe20b80 "Source ID 40 was not found when attempting > to >> remove it", user_data=) at logging.c:73 >>> #4 0x7fdd920f2ae1 in g_logv () from >> /lib/x86_64-linux-gnu/libglib-2.0.so.0 >>> #5 0x7fdd920f2d72 in g_log () from >> /lib/x86_64-linux-gnu/libglib-2.0.so.0 >>> #6 0x7fdd920eac5c in g_source_remove () from >> /lib/x86_64-linux-gnu/libglib-2.0.so.0 >>> #7 0x7fdd92984b55 in cancel_recurring_action > (op=op@entry=0xe19b90) at >> serv
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi Andrew, Thank you for comments. > diff --git a/lib/services/services_linux.c b/lib/services/services_linux.c > index 961ff18..2279e4e 100644 > --- a/lib/services/services_linux.c > +++ b/lib/services/services_linux.c > @@ -227,6 +227,7 @@ recurring_action_timer(gpointer data) > op->stdout_data = NULL; > free(op->stderr_data); > op->stderr_data = NULL; > + op->opaque->repeat_timer = 0; > > services_action_async(op, NULL); > return FALSE; I confirm a correction again. Many Thanks! Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > > Cc: > Date: 2014/10/10, Fri 15:19 > Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, > g_source_remove fails. > > /me slaps forhead > > this one should work > > diff --git a/lib/services/services.c b/lib/services/services.c > index 8590b56..753e257 100644 > --- a/lib/services/services.c > +++ b/lib/services/services.c > @@ -313,6 +313,7 @@ services_action_free(svc_action_t * op) > > if (op->opaque->repeat_timer) { > g_source_remove(op->opaque->repeat_timer); > + op->opaque->repeat_timer = 0; > } > if (op->opaque->stderr_gsource) { > mainloop_del_fd(op->opaque->stderr_gsource); > @@ -425,6 +426,7 @@ services_action_kick(const char *name, const char > *action, > int interval /* ms */ > } else { > if (op->opaque->repeat_timer) { > g_source_remove(op->opaque->repeat_timer); > + op->opaque->repeat_timer = 0; > } > recurring_action_timer(op); > return TRUE; > @@ -459,6 +461,7 @@ handle_duplicate_recurring(svc_action_t * op, void > (*action_callback) (svc_actio > if (dup->pid != 0) { > if (op->opaque->repeat_timer) { > g_source_remove(op->opaque->repeat_timer); > + op->opaque->repeat_timer = 0; > } > recurring_action_timer(dup); > } > diff --git a/lib/services/services_linux.c b/lib/services/services_linux.c > index 961ff18..2279e4e 100644 > --- a/lib/services/services_linux.c > +++ b/lib/services/services_linux.c > @@ -227,6 +227,7 @@ recurring_action_timer(gpointer data) > op->stdout_data = NULL; > free(op->stderr_data); > op->stderr_data = NULL; > + op->opaque->repeat_timer = 0; > > services_action_async(op, NULL); > return FALSE; > > > On 10 Oct 2014, at 4:45 pm, renayama19661...@ybb.ne.jp wrote: > >> Hi Andrew, >> >> I applied three corrections that you made and checked movement. >> I picked all "abort" processing with g_source_remove() of > services.c just to make sure. >> * I set following "abort" in four places that carried out > g_source_remove >> > if (g_source_remove(op->opaque->repeat_timer) == > FALSE) >>> { > abort(); > } >> >> >> As a result, "abort" still occurred. >> >> >> The problem does not seem to be yet settled by your correction. >> >> >> (gdb) where >> #0 0x7fdd923e1f79 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >> #1 0x7fdd923e5388 in __GI_abort () at abort.c:89 >> #2 0x7fdd92b9fe77 in crm_abort (file=file@entry=0x7fdd92bd352b > "logging.c", >> function=function@entry=0x7fdd92bd48c0 <__FUNCTION__.23262> > "crm_glib_handler", line=line@entry=73, >> assert_condition=assert_condition@entry=0xe20b80 "Source ID 40 was > not found when attempting to remove it", do_core=do_core@entry=1, >> do_fork=, do_fork@entry=1) at utils.c:1195 >> #3 0x7fdd92bc7ca7 in crm_glib_handler (log_domain=0x7fdd92130b6e > "GLib", flags=, >> message=0xe20b80 "Source ID 40 was not found when attempting to > remove it", user_data=) at logging.c:73 >> #4 0x7fdd920f2ae1 in g_logv () from > /lib/x86_64-linux-gnu/libglib-2.0.so.0 >> #5 0x7fdd920f2d72 in g_log () from > /lib/x86_64-linux-gnu/libglib-2.0.so.0 >> #6 0x7fdd920eac5c in g_source_remove () from > /lib/x86_64-linux-gnu/libglib-2.0.so.0 >> #7 0x7fdd92984b55 in cancel_recurring_action (op=op@entry=0xe19b90) at > services.c:365 >> #8 0x7fdd92984bee in services_action_cancel (name=name@entry=0xe1d2d0 > "dummy2", action=, interval=interval@entry=1) >> at services.c:387 >> #9 0x0040405a in cancel_op (rsc_id=rsc_id@entry=0xe1d2d0 > "dummy2", action=action@entry=0xe10d90 "monitor", > interval=1) >> at lrmd.c:1404 >> #10 0x0040614f in process_lrmd_rsc_cancel (client=0xe17290, id=74, > request=0xe1be10) at lrmd.c:1468 >> #11 process_lrmd_message (client=client@entry=0xe17290, id=74, > request=request@entry=0xe1be10) at lrmd.c:1507 >> #12 0x00402bac in lrmd_ipc_dispatch (c=0xe169c0, data= out>, size=361) at main.c:148 >> #13 0x7fdd91e4d4d9 in qb_ipcs_dispatch_connection_request () from > /usr/lib/libqb.so.0 >>
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi Andrew, I applied three corrections that you made and checked movement. I picked all "abort" processing with g_source_remove() of services.c just to make sure. * I set following "abort" in four places that carried out g_source_remove >>> if (g_source_remove(op->opaque->repeat_timer) == FALSE) > { >>> abort(); >>> } As a result, "abort" still occurred. The problem does not seem to be yet settled by your correction. (gdb) where #0 0x7fdd923e1f79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7fdd923e5388 in __GI_abort () at abort.c:89 #2 0x7fdd92b9fe77 in crm_abort (file=file@entry=0x7fdd92bd352b "logging.c", function=function@entry=0x7fdd92bd48c0 <__FUNCTION__.23262> "crm_glib_handler", line=line@entry=73, assert_condition=assert_condition@entry=0xe20b80 "Source ID 40 was not found when attempting to remove it", do_core=do_core@entry=1, do_fork=, do_fork@entry=1) at utils.c:1195 #3 0x7fdd92bc7ca7 in crm_glib_handler (log_domain=0x7fdd92130b6e "GLib", flags=, message=0xe20b80 "Source ID 40 was not found when attempting to remove it", user_data=) at logging.c:73 #4 0x7fdd920f2ae1 in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #5 0x7fdd920f2d72 in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #6 0x7fdd920eac5c in g_source_remove () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #7 0x7fdd92984b55 in cancel_recurring_action (op=op@entry=0xe19b90) at services.c:365 #8 0x7fdd92984bee in services_action_cancel (name=name@entry=0xe1d2d0 "dummy2", action=, interval=interval@entry=1) at services.c:387 #9 0x0040405a in cancel_op (rsc_id=rsc_id@entry=0xe1d2d0 "dummy2", action=action@entry=0xe10d90 "monitor", interval=1) at lrmd.c:1404 #10 0x0040614f in process_lrmd_rsc_cancel (client=0xe17290, id=74, request=0xe1be10) at lrmd.c:1468 #11 process_lrmd_message (client=client@entry=0xe17290, id=74, request=request@entry=0xe1be10) at lrmd.c:1507 #12 0x00402bac in lrmd_ipc_dispatch (c=0xe169c0, data=, size=361) at main.c:148 #13 0x7fdd91e4d4d9 in qb_ipcs_dispatch_connection_request () from /usr/lib/libqb.so.0 #14 0x7fdd92bc409d in gio_read_socket (gio=, condition=G_IO_IN, data=0xe158a8) at mainloop.c:437 #15 0x7fdd920ebce5 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 ---Type to continue, or q to quit--- #16 0x7fdd920ec048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #17 0x7fdd920ec30a in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #18 0x00402774 in main (argc=, argv=0x7fff22cac268) at main.c:344 Best Regards, Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp" > To: Andrew Beekhof ; The Pacemaker cluster resource > manager > Cc: > Date: 2014/10/10, Fri 10:55 > Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, > g_source_remove fails. > > Hi Andrew, > > Okay! > > I test your patch. > And I inform you of a result. > > Many thanks! > Hideo Yamauchi. > > > > - Original Message - >> From: Andrew Beekhof >> To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > >> Cc: >> Date: 2014/10/10, Fri 10:47 >> Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of > glib, g_source_remove fails. >> >> Perfect! >> >> Can you try this: >> >> diff --git a/lib/services/services.c b/lib/services/services.c >> index 8590b56..cb0f0ae 100644 >> --- a/lib/services/services.c >> +++ b/lib/services/services.c >> @@ -417,6 +417,7 @@ services_action_kick(const char *name, const char > *action, >> int interval /* ms */ >> free(id); >> >> if (op == NULL) { >> + op->opaque->repeat_timer = 0; >> return FALSE; >> } >> >> @@ -425,6 +426,7 @@ services_action_kick(const char *name, const char > *action, >> int interval /* ms */ >> } else { >> if (op->opaque->repeat_timer) { >> g_source_remove(op->opaque->repeat_timer); >> + op->opaque->repeat_timer = 0; >> } >> recurring_action_timer(op); >> return TRUE; >> @@ -459,6 +461,7 @@ handle_duplicate_recurring(svc_action_t * op, void >> (*action_callback) (svc_actio >> if (dup->pid != 0) { >> if (op->opaque->repeat_timer) { >> g_source_remove(op->opaque->repeat_timer); >> + op->opaque->repeat_timer = 0; >> } >> recurring_action_timer(dup); >> } >> >> >> On 10 Oct 2014, at 12:16 pm, renayama19661...@ybb.ne.jp wrote: >> >>> Hi Andrew, >>> >>> Setting of gdb of the Ubuntu environment does not yet go well and I > touch >> lrmd and cannot acquire trace. >>> Please wait for this a little more. >>> >>> >>> But.. I let lrmd terminate abnormally when g_source_rem
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi Andrew, Okay! I test your patch. And I inform you of a result. Many thanks! Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > > Cc: > Date: 2014/10/10, Fri 10:47 > Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, > g_source_remove fails. > > Perfect! > > Can you try this: > > diff --git a/lib/services/services.c b/lib/services/services.c > index 8590b56..cb0f0ae 100644 > --- a/lib/services/services.c > +++ b/lib/services/services.c > @@ -417,6 +417,7 @@ services_action_kick(const char *name, const char > *action, > int interval /* ms */ > free(id); > > if (op == NULL) { > + op->opaque->repeat_timer = 0; > return FALSE; > } > > @@ -425,6 +426,7 @@ services_action_kick(const char *name, const char > *action, > int interval /* ms */ > } else { > if (op->opaque->repeat_timer) { > g_source_remove(op->opaque->repeat_timer); > + op->opaque->repeat_timer = 0; > } > recurring_action_timer(op); > return TRUE; > @@ -459,6 +461,7 @@ handle_duplicate_recurring(svc_action_t * op, void > (*action_callback) (svc_actio > if (dup->pid != 0) { > if (op->opaque->repeat_timer) { > g_source_remove(op->opaque->repeat_timer); > + op->opaque->repeat_timer = 0; > } > recurring_action_timer(dup); > } > > > On 10 Oct 2014, at 12:16 pm, renayama19661...@ybb.ne.jp wrote: > >> Hi Andrew, >> >> Setting of gdb of the Ubuntu environment does not yet go well and I touch > lrmd and cannot acquire trace. >> Please wait for this a little more. >> >> >> But.. I let lrmd terminate abnormally when g_source_remove() of > cancel_recurring_action() returned FALSE. >> - >> gboolean >> cancel_recurring_action(svc_action_t * op) >> { >> crm_info("Cancelling operation %s", op->id); >> >> if (recurring_actions) { >> g_hash_table_remove(recurring_actions, op->id); >> } >> >> if (op->opaque->repeat_timer) { >> if (g_source_remove(op->opaque->repeat_timer) == FALSE) { >> abort(); >> } >> (snip) >> ---core >> #0 0x7f30aa60ff79 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >> >> 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. >> (gdb) where >> #0 0x7f30aa60ff79 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >> #1 0x7f30aa613388 in __GI_abort () at abort.c:89 >> #2 0x7f30aadcde77 in crm_abort (file=file@entry=0x7f30aae0152b > "logging.c", >> function=function@entry=0x7f30aae028c0 <__FUNCTION__.23262> > "crm_glib_handler", line=line@entry=73, >> assert_condition=assert_condition@entry=0x19d2ad0 "Source ID 63 > was not found when attempting to remove it", do_core=do_core@entry=1, >> do_fork=, do_fork@entry=1) at utils.c:1195 >> #3 0x7f30aadf5ca7 in crm_glib_handler (log_domain=0x7f30aa35eb6e > "GLib", flags=, >> message=0x19d2ad0 "Source ID 63 was not found when attempting to > remove it", user_data=) at logging.c:73 >> #4 0x7f30aa320ae1 in g_logv () from > /lib/x86_64-linux-gnu/libglib-2.0.so.0 >> #5 0x7f30aa320d72 in g_log () from > /lib/x86_64-linux-gnu/libglib-2.0.so.0 >> #6 0x7f30aa318c5c in g_source_remove () from > /lib/x86_64-linux-gnu/libglib-2.0.so.0 >> #7 0x7f30aabb2b55 in cancel_recurring_action (op=op@entry=0x19caa90) > at services.c:363 >> #8 0x7f30aabb2bee in services_action_cancel (name=name@entry=0x19d0530 > "dummy3", action=, interval=interval@entry=1) >> at services.c:385 >> #9 0x0040405a in cancel_op (rsc_id=rsc_id@entry=0x19d0530 > "dummy3", action=action@entry=0x19cec10 "monitor", > interval=1) >> at lrmd.c:1404 >> #10 0x0040614f in process_lrmd_rsc_cancel (client=0x19c8290, id=74, > request=0x19ca8a0) at lrmd.c:1468 >> #11 process_lrmd_message (client=client@entry=0x19c8290, id=74, > request=request@entry=0x19ca8a0) at lrmd.c:1507 >> #12 0x00402bac in lrmd_ipc_dispatch (c=0x19c79c0, > data=, size=361) at main.c:148 >> #13 0x7f30aa07b4d9 in qb_ipcs_dispatch_connection_request () from > /usr/lib/libqb.so.0 >> #14 0x7f30aadf209d in gio_read_socket (gio=, > condition=G_IO_IN, data=0x19c68a8) at mainloop.c:437 >> #15 0x7f30aa319ce5 in g_main_context_dispatch () from > /lib/x86_64-linux-gnu/libglib-2.0.so.0 >> ---Type to continue, or q to quit--- >> #16 0x7f30aa31a048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 >> #17 0x7f30aa31a30a in g_main_loop_run () from > /lib/x86_64-linux-gnu/libglib-2.0.so.0 >> #18 0x00402774 in main (argc=, > argv=0x7fffcdd90b88) at main.c:344 >> - >> >> Best Regards, >> Hideo Yamauchi. >> >> >>
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi Andrew, Setting of gdb of the Ubuntu environment does not yet go well and I touch lrmd and cannot acquire trace. Please wait for this a little more. But.. I let lrmd terminate abnormally when g_source_remove() of cancel_recurring_action() returned FALSE. - gboolean cancel_recurring_action(svc_action_t * op) { crm_info("Cancelling operation %s", op->id); if (recurring_actions) { g_hash_table_remove(recurring_actions, op->id); } if (op->opaque->repeat_timer) { if (g_source_remove(op->opaque->repeat_timer) == FALSE) { abort(); } (snip) ---core #0 0x7f30aa60ff79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) where #0 0x7f30aa60ff79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7f30aa613388 in __GI_abort () at abort.c:89 #2 0x7f30aadcde77 in crm_abort (file=file@entry=0x7f30aae0152b "logging.c", function=function@entry=0x7f30aae028c0 <__FUNCTION__.23262> "crm_glib_handler", line=line@entry=73, assert_condition=assert_condition@entry=0x19d2ad0 "Source ID 63 was not found when attempting to remove it", do_core=do_core@entry=1, do_fork=, do_fork@entry=1) at utils.c:1195 #3 0x7f30aadf5ca7 in crm_glib_handler (log_domain=0x7f30aa35eb6e "GLib", flags=, message=0x19d2ad0 "Source ID 63 was not found when attempting to remove it", user_data=) at logging.c:73 #4 0x7f30aa320ae1 in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #5 0x7f30aa320d72 in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #6 0x7f30aa318c5c in g_source_remove () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #7 0x7f30aabb2b55 in cancel_recurring_action (op=op@entry=0x19caa90) at services.c:363 #8 0x7f30aabb2bee in services_action_cancel (name=name@entry=0x19d0530 "dummy3", action=, interval=interval@entry=1) at services.c:385 #9 0x0040405a in cancel_op (rsc_id=rsc_id@entry=0x19d0530 "dummy3", action=action@entry=0x19cec10 "monitor", interval=1) at lrmd.c:1404 #10 0x0040614f in process_lrmd_rsc_cancel (client=0x19c8290, id=74, request=0x19ca8a0) at lrmd.c:1468 #11 process_lrmd_message (client=client@entry=0x19c8290, id=74, request=request@entry=0x19ca8a0) at lrmd.c:1507 #12 0x00402bac in lrmd_ipc_dispatch (c=0x19c79c0, data=, size=361) at main.c:148 #13 0x7f30aa07b4d9 in qb_ipcs_dispatch_connection_request () from /usr/lib/libqb.so.0 #14 0x7f30aadf209d in gio_read_socket (gio=, condition=G_IO_IN, data=0x19c68a8) at mainloop.c:437 #15 0x7f30aa319ce5 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 ---Type to continue, or q to quit--- #16 0x7f30aa31a048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #17 0x7f30aa31a30a in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #18 0x00402774 in main (argc=, argv=0x7fffcdd90b88) at main.c:344 - Best Regards, Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp" > To: Andrew Beekhof > Cc: The Pacemaker cluster resource manager > Date: 2014/10/7, Tue 11:15 > Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, > g_source_remove fails. > > Hi Andrew, > >> Not quite. Returning FALSE from the callback also removes the source from > glib. >> So your test case effectively removes t1 twice: once implicitly by > returning >> FALSE in timer_func1() and then again explicitly in timer_func3() > > > Your opinion is right. > > > If Pacemaker repeats and does not remove the resources which timer concluded > in > FALSE, glib does not return the error. > > > Many Thanks, > Hideo Yamauchi. > > > - Original Message - >> From: Andrew Beekhof >> To: renayama19661...@ybb.ne.jp >> Cc: The Pacemaker cluster resource manager > >> Date: 2014/10/7, Tue 11:06 >> Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of > glib, g_source_remove fails. >> >> >> On 7 Oct 2014, at 1:03 pm, renayama19661...@ybb.ne.jp wrote: >> >>> Hi Andrew, >>> > These problems seem to be due to a correction of next glib > somehow >> or other. > * >> > https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c The glib behaviour on unbuntu seems reasonable, removing a source >> multiple times IS a valid error. I need the stack trace to know where/how this situation can occur > in >> pacemaker. >>> >>> >>> Pacemaker does not remove resources several times as far as I > confirmed it. >>> In Ubuntu(glib2.40), an error occurs just to remove resources first. >> >> Not quite. Returning FALSE from the callback also removes the source from > glib. >> So your test case effectively removes t1 twice: once implicitly by
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi Andrew, > Not quite. Returning FALSE from the callback also removes the source from > glib. > So your test case effectively removes t1 twice: once implicitly by returning > FALSE in timer_func1() and then again explicitly in timer_func3() Your opinion is right. If Pacemaker repeats and does not remove the resources which timer concluded in FALSE, glib does not return the error. Many Thanks, Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp > Cc: The Pacemaker cluster resource manager > Date: 2014/10/7, Tue 11:06 > Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, > g_source_remove fails. > > > On 7 Oct 2014, at 1:03 pm, renayama19661...@ybb.ne.jp wrote: > >> Hi Andrew, >> These problems seem to be due to a correction of next glib somehow > or >>> other. * >>> > https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c >>> >>> The glib behaviour on unbuntu seems reasonable, removing a source > multiple times >>> IS a valid error. >>> I need the stack trace to know where/how this situation can occur in > pacemaker. >> >> >> Pacemaker does not remove resources several times as far as I confirmed it. >> In Ubuntu(glib2.40), an error occurs just to remove resources first. > > Not quite. Returning FALSE from the callback also removes the source from > glib. > So your test case effectively removes t1 twice: once implicitly by returning > FALSE in timer_func1() and then again explicitly in timer_func3() > >> >> Confirmation and the deletion of resources seem to be necessary not to > produce an error in Ubuntu. >> And this works well in glib of RHEL6.x.(and RHEL7.0) >> >> if (g_main_context_find_source_by_id (NULL, t1) != NULL) { >> g_source_remove(t1); >> } >> >> I send it to you after acquiring stack trace. >> >> Many Thanks! >> Hideo Yamauchi. >> >> - Original Message - >>> From: Andrew Beekhof >>> To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > >>> Cc: >>> Date: 2014/10/7, Tue 09:44 >>> Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of > glib, g_source_remove fails. >>> >>> >>> On 6 Oct 2014, at 4:09 pm, renayama19661...@ybb.ne.jp wrote: >>> Hi All, When I move the next sample in RHEL6.5(glib2-2.22.5-7.el6) and >>> Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2), movement is different. * Sample : test2.c {{{ #include #include #include #include guint t1, t2, t3; gboolean timer_func2(gpointer data){ printf("TIMER EXPIRE!2\n"); fflush(stdout); return FALSE; } gboolean timer_func1(gpointer data){ clock_t ret; struct tms buff; ret = times(&buff); printf("TIMER EXPIRE!1 %d\n", (int)ret); fflush(stdout); return FALSE; } gboolean timer_func3(gpointer data){ printf("TIMER EXPIRE 3!\n"); fflush(stdout); printf("remove timer1!\n"); fflush(stdout); g_source_remove(t1); printf("remove timer2!\n"); fflush(stdout); g_source_remove(t2); printf("remove timer3!\n"); fflush(stdout); g_source_remove(t3); return FALSE; } int main(int argc, char** argv){ GMainLoop *m; clock_t ret; struct tms buff; gint64 t; m = g_main_new(FALSE); t1 = g_timeout_add(1000, timer_func1, NULL); t2 = g_timeout_add(6, timer_func2, NULL); t3 = g_timeout_add(5000, timer_func3, NULL); ret = times(&buff); printf("START! %d\n", (int)ret); g_main_run(m); } }}} * Result RHEL6.5(glib2-2.22.5-7.el6) [root@snmp1 ~]# ./test2 START! 429576012 TIMER EXPIRE!1 429576112 TIMER EXPIRE 3! remove timer1! remove timer2! remove timer3! Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2) root@a1be102:~# ./test2 START! 1718163089 TIMER EXPIRE!1 1718163189 TIMER EXPIRE 3! remove timer1! (process:1410): GLib-CRITICAL **: Source ID 1 was not found when > attempting >>> to remove it remove timer2! remove timer3! These problems seem to be due to a correction of next glib somehow > or >>> other. * >>> > https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c >>> >>> The glib behaviour on unbuntu seems reasonable, removing a source > multiple times >>> IS a valid error. >>> I need the stack trace to k
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi Andrew, >> These problems seem to be due to a correction of next glib somehow or > other. >> * > https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c > > The glib behaviour on unbuntu seems reasonable, removing a source multiple > times > IS a valid error. > I need the stack trace to know where/how this situation can occur in > pacemaker. Pacemaker does not remove resources several times as far as I confirmed it. In Ubuntu(glib2.40), an error occurs just to remove resources first. Confirmation and the deletion of resources seem to be necessary not to produce an error in Ubuntu. And this works well in glib of RHEL6.x.(and RHEL7.0) if (g_main_context_find_source_by_id (NULL, t1) != NULL) { g_source_remove(t1); } I send it to you after acquiring stack trace. Many Thanks! Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > > Cc: > Date: 2014/10/7, Tue 09:44 > Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, > g_source_remove fails. > > > On 6 Oct 2014, at 4:09 pm, renayama19661...@ybb.ne.jp wrote: > >> Hi All, >> >> When I move the next sample in RHEL6.5(glib2-2.22.5-7.el6) and > Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2), movement is different. >> >> * Sample : test2.c >> {{{ >> #include >> #include >> #include >> #include >> guint t1, t2, t3; >> gboolean timer_func2(gpointer data){ >> printf("TIMER EXPIRE!2\n"); >> fflush(stdout); >> return FALSE; >> } >> gboolean timer_func1(gpointer data){ >> clock_t ret; >> struct tms buff; >> >> ret = times(&buff); >> printf("TIMER EXPIRE!1 %d\n", (int)ret); >> fflush(stdout); >> return FALSE; >> } >> gboolean timer_func3(gpointer data){ >> printf("TIMER EXPIRE 3!\n"); >> fflush(stdout); >> printf("remove timer1!\n"); >> >> fflush(stdout); >> g_source_remove(t1); >> printf("remove timer2!\n"); >> fflush(stdout); >> g_source_remove(t2); >> printf("remove timer3!\n"); >> fflush(stdout); >> g_source_remove(t3); >> return FALSE; >> } >> int main(int argc, char** argv){ >> GMainLoop *m; >> clock_t ret; >> struct tms buff; >> gint64 t; >> m = g_main_new(FALSE); >> t1 = g_timeout_add(1000, timer_func1, NULL); >> t2 = g_timeout_add(6, timer_func2, NULL); >> t3 = g_timeout_add(5000, timer_func3, NULL); >> ret = times(&buff); >> printf("START! %d\n", (int)ret); >> g_main_run(m); >> } >> >> }}} >> * Result >> RHEL6.5(glib2-2.22.5-7.el6) >> [root@snmp1 ~]# ./test2 >> START! 429576012 >> TIMER EXPIRE!1 429576112 >> TIMER EXPIRE 3! >> remove timer1! >> remove timer2! >> remove timer3! >> >> Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2) >> root@a1be102:~# ./test2 >> START! 1718163089 >> TIMER EXPIRE!1 1718163189 >> TIMER EXPIRE 3! >> remove timer1! >> >> (process:1410): GLib-CRITICAL **: Source ID 1 was not found when attempting > to remove it >> remove timer2! >> remove timer3! >> >> >> These problems seem to be due to a correction of next glib somehow or > other. >> * > https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c > > The glib behaviour on unbuntu seems reasonable, removing a source multiple > times > IS a valid error. > I need the stack trace to know where/how this situation can occur in > pacemaker. > >> >> In g_source_remove() until before change, the deletion of the timer which > practice completed is possible, but g_source_remove() after the change causes > an > error. >> >> Under this influence, we get the following crit error in the environment of > Pacemaker using a new version of glib. >> >> lrmd[1632]: error: crm_abort: crm_glib_handler: Forked child 1840 to >> record non-fatal assert at logging.c:73 : Source ID 51 was not found when >> attempting to remove it >> lrmd[1632]: crit: crm_glib_handler: GLib: Source ID 51 was not found >> when attempting to remove it >> >> It seems that some kind of coping is necessary in Pacemaker when I think > about next. >> * Distribution using a new version of glib including Ubuntu. >> * Version up of future glib of RHEL. >> >> A similar problem is reported in the ML. >> * http://www.gossamer-threads.com/lists/linuxha/pacemaker/91333#91333 >> * http://www.gossamer-threads.com/lists/linuxha/pacemaker/92408 >> >> Best Regards, >> Hideo Yamauchi. >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.
[Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi All, When I move the next sample in RHEL6.5(glib2-2.22.5-7.el6) and Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2), movement is different. * Sample : test2.c {{{ #include #include #include #include guint t1, t2, t3; gboolean timer_func2(gpointer data){ printf("TIMER EXPIRE!2\n"); fflush(stdout); return FALSE; } gboolean timer_func1(gpointer data){ clock_t ret; struct tms buff; ret = times(&buff); printf("TIMER EXPIRE!1 %d\n", (int)ret); fflush(stdout); return FALSE; } gboolean timer_func3(gpointer data){ printf("TIMER EXPIRE 3!\n"); fflush(stdout); printf("remove timer1!\n"); fflush(stdout); g_source_remove(t1); printf("remove timer2!\n"); fflush(stdout); g_source_remove(t2); printf("remove timer3!\n"); fflush(stdout); g_source_remove(t3); return FALSE; } int main(int argc, char** argv){ GMainLoop *m; clock_t ret; struct tms buff; gint64 t; m = g_main_new(FALSE); t1 = g_timeout_add(1000, timer_func1, NULL); t2 = g_timeout_add(6, timer_func2, NULL); t3 = g_timeout_add(5000, timer_func3, NULL); ret = times(&buff); printf("START! %d\n", (int)ret); g_main_run(m); } }}} * Result RHEL6.5(glib2-2.22.5-7.el6) [root@snmp1 ~]# ./test2 START! 429576012 TIMER EXPIRE!1 429576112 TIMER EXPIRE 3! remove timer1! remove timer2! remove timer3! Ubuntu14.04(libglib2.0-0:amd64 2.40.0-2) root@a1be102:~# ./test2 START! 1718163089 TIMER EXPIRE!1 1718163189 TIMER EXPIRE 3! remove timer1! (process:1410): GLib-CRITICAL **: Source ID 1 was not found when attempting to remove it remove timer2! remove timer3! These problems seem to be due to a correction of next glib somehow or other. * https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c In g_source_remove() until before change, the deletion of the timer which practice completed is possible, but g_source_remove() after the change causes an error. Under this influence, we get the following crit error in the environment of Pacemaker using a new version of glib. lrmd[1632]: error: crm_abort: crm_glib_handler: Forked child 1840 to record non-fatal assert at logging.c:73 : Source ID 51 was not found when attempting to remove it lrmd[1632]: crit: crm_glib_handler: GLib: Source ID 51 was not found when attempting to remove it It seems that some kind of coping is necessary in Pacemaker when I think about next. * Distribution using a new version of glib including Ubuntu. * Version up of future glib of RHEL. A similar problem is reported in the ML. * http://www.gossamer-threads.com/lists/linuxha/pacemaker/91333#91333 * http://www.gossamer-threads.com/lists/linuxha/pacemaker/92408 Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Lot of errors after update
Hi Andrew, >> lrmd[1632]: error: crm_abort: crm_glib_handler: Forked child 1840 to > record non-fatal assert at logging.c:73 : Source ID 51 was not found when > attempting to remove it >> lrmd[1632]: crit: crm_glib_handler: GLib: Source ID 51 was not found > when attempting to remove it > > stack trace of child 1840? No. I don't get it. But, I have a simple method to confirm a problem of glib. I register a problem with Bugzilla by the end of today and contact you. Best Regards, Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > > Cc: > Date: 2014/10/6, Mon 10:40 > Subject: Re: [Pacemaker] Lot of errors after update > > > On 3 Oct 2014, at 11:18 am, renayama19661...@ybb.ne.jp wrote: > >> Hi Andrew, >> >> About a similar problem, we confirmed it in Pacemaker1.1.12. >> The problem occurs in (glib2.40.0) in Ubuntu14.04. >> >> lrmd[1632]: error: crm_abort: crm_glib_handler: Forked child 1840 to > record non-fatal assert at logging.c:73 : Source ID 51 was not found when > attempting to remove it >> lrmd[1632]: crit: crm_glib_handler: GLib: Source ID 51 was not found > when attempting to remove it > > stack trace of child 1840? > >> >> >> This problem does not happen in RHEL6. >> >> >> The cause of the version of glib seem to be different. >> >> >> When g_source_remove does timer processing to return FALSE, it becomes the > error in glib2.40.0.(Probably as for the subsequent version too) >> >> It seems to be necessary to revise Pacemaker to solve a problem. >> >> Best Regards, >> Hideo Yamauchi. >> >> >> - Original Message - >>> From: Andrew Beekhof >>> To: The Pacemaker cluster resource manager > >>> Cc: >>> Date: 2014/10/3, Fri 08:06 >>> Subject: Re: [Pacemaker] Lot of errors after update >>> >>> >>> On 3 Oct 2014, at 12:10 am, Riccardo Bicelli > wrote: >>> I'm running pacemaker-1.0.10 >>> >>> well and truly time to get off the 1.0.x series >>> and glib-2.40.0-r1:2 on gentoo Il 30/09/2014 23:23, Andrew Beekhof ha scritto: > On 30 Sep 2014, at 11:36 pm, Riccardo Bicelli >>> > wrote: > > >> Hello, >> I've just updated my cluster nodes and now I see lot of > these >>> errors in syslog: >> >> Sep 30 15:32:43 localhost cib: [2870]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 28573 to record non-fatal assert at > utils.c:449 : >>> Source ID 128394 was not found when attempting to remove it >> Sep 30 15:32:55 localhost cib: [2870]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 28753 to record non-fatal assert at > utils.c:449 : >>> Source ID 128395 was not found when attempting to remove it >> Sep 30 15:32:55 localhost attrd: [2872]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 28756 to record non-fatal assert at > utils.c:449 : >>> Source ID 58434 was not found when attempting to remove it >> Sep 30 15:32:55 localhost cib: [2870]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 28757 to record non-fatal assert at > utils.c:449 : >>> Source ID 128396 was not found when attempting to remove it >> Sep 30 15:33:04 localhost cib: [2870]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 28876 to record non-fatal assert at > utils.c:449 : >>> Source ID 128397 was not found when attempting to remove it >> Sep 30 15:33:04 localhost attrd: [2872]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 28877 to record non-fatal assert at > utils.c:449 : >>> Source ID 58435 was not found when attempting to remove it >> Sep 30 15:33:04 localhost cib: [2870]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 28878 to record non-fatal assert at > utils.c:449 : >>> Source ID 128398 was not found when attempting to remove it >> Sep 30 15:33:11 localhost cib: [2870]: 29010 to record > non-fatal >>> assert at utils.c:449 : Source ID 128399 was not found when attempting > to remove >>> it >> Sep 30 15:33:11 localhost attrd: [2872]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 29011 to record non-fatal assert at > utils.c:449 : >>> Source ID 58436 was not found when attempting to remove it >> Sep 30 15:33:11 localhost cib: [2870]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 29012 to record non-fatal assert at > utils.c:449 : >>> Source ID 128400 was not found when attempting to remove it >> Sep 30 15:33:14 localhost cib: [2870]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 29060 to record non-fatal assert at > utils.c:449 : >>> Source ID 128401 was not found when attempting to remove it >> Sep 30 15:33:14 localhost attrd: [2872]: ERROR: crm_abort: >>> crm_glib_handler: Forked child 29061 to record non-fatal assert at > utils.c:449 : >>> Source ID 58437 was not found when attempting to remove it >> >> I do
Re: [Pacemaker] Lot of errors after update
Hi Andrew, About a similar problem, we confirmed it in Pacemaker1.1.12. The problem occurs in (glib2.40.0) in Ubuntu14.04. lrmd[1632]: error: crm_abort: crm_glib_handler: Forked child 1840 to record non-fatal assert at logging.c:73 : Source ID 51 was not found when attempting to remove it lrmd[1632]: crit: crm_glib_handler: GLib: Source ID 51 was not found when attempting to remove it This problem does not happen in RHEL6. The cause of the version of glib seem to be different. When g_source_remove does timer processing to return FALSE, it becomes the error in glib2.40.0.(Probably as for the subsequent version too) It seems to be necessary to revise Pacemaker to solve a problem. Best Regards, Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: The Pacemaker cluster resource manager > Cc: > Date: 2014/10/3, Fri 08:06 > Subject: Re: [Pacemaker] Lot of errors after update > > > On 3 Oct 2014, at 12:10 am, Riccardo Bicelli wrote: > >> I'm running pacemaker-1.0.10 > > well and truly time to get off the 1.0.x series > >> and glib-2.40.0-r1:2 on gentoo >> >> Il 30/09/2014 23:23, Andrew Beekhof ha scritto: >>> On 30 Sep 2014, at 11:36 pm, Riccardo Bicelli > >>> wrote: >>> >>> Hello, I've just updated my cluster nodes and now I see lot of these > errors in syslog: Sep 30 15:32:43 localhost cib: [2870]: ERROR: crm_abort: > crm_glib_handler: Forked child 28573 to record non-fatal assert at > utils.c:449 : > Source ID 128394 was not found when attempting to remove it Sep 30 15:32:55 localhost cib: [2870]: ERROR: crm_abort: > crm_glib_handler: Forked child 28753 to record non-fatal assert at > utils.c:449 : > Source ID 128395 was not found when attempting to remove it Sep 30 15:32:55 localhost attrd: [2872]: ERROR: crm_abort: > crm_glib_handler: Forked child 28756 to record non-fatal assert at > utils.c:449 : > Source ID 58434 was not found when attempting to remove it Sep 30 15:32:55 localhost cib: [2870]: ERROR: crm_abort: > crm_glib_handler: Forked child 28757 to record non-fatal assert at > utils.c:449 : > Source ID 128396 was not found when attempting to remove it Sep 30 15:33:04 localhost cib: [2870]: ERROR: crm_abort: > crm_glib_handler: Forked child 28876 to record non-fatal assert at > utils.c:449 : > Source ID 128397 was not found when attempting to remove it Sep 30 15:33:04 localhost attrd: [2872]: ERROR: crm_abort: > crm_glib_handler: Forked child 28877 to record non-fatal assert at > utils.c:449 : > Source ID 58435 was not found when attempting to remove it Sep 30 15:33:04 localhost cib: [2870]: ERROR: crm_abort: > crm_glib_handler: Forked child 28878 to record non-fatal assert at > utils.c:449 : > Source ID 128398 was not found when attempting to remove it Sep 30 15:33:11 localhost cib: [2870]: 29010 to record non-fatal > assert at utils.c:449 : Source ID 128399 was not found when attempting to > remove > it Sep 30 15:33:11 localhost attrd: [2872]: ERROR: crm_abort: > crm_glib_handler: Forked child 29011 to record non-fatal assert at > utils.c:449 : > Source ID 58436 was not found when attempting to remove it Sep 30 15:33:11 localhost cib: [2870]: ERROR: crm_abort: > crm_glib_handler: Forked child 29012 to record non-fatal assert at > utils.c:449 : > Source ID 128400 was not found when attempting to remove it Sep 30 15:33:14 localhost cib: [2870]: ERROR: crm_abort: > crm_glib_handler: Forked child 29060 to record non-fatal assert at > utils.c:449 : > Source ID 128401 was not found when attempting to remove it Sep 30 15:33:14 localhost attrd: [2872]: ERROR: crm_abort: > crm_glib_handler: Forked child 29061 to record non-fatal assert at > utils.c:449 : > Source ID 58437 was not found when attempting to remove it I don't understand what does it mean. >>> It means glib is bitching about something it didn't used to. >>> >>> What version of pacemaker did you update to? I'm reasonably > confident they're fixed in 1.1.12 >>> >>> >>> >>> ___ >>> Pacemaker mailing list: >>> Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> >>> Project Home: >>> http://www.clusterlabs.org >>> >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> >>> Bugs: >>> http://bugs.clusterlabs.org >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] query ?
Hi Alex, Because recheck_timer moves by default every 15 minutes, state transition is calculated in pengine. - { XML_CONFIG_ATTR_RECHECK, "cluster_recheck_interval", "time", "Zero disables polling. Positive values are an interval in seconds (unless other SI units are specified. eg. 5min)", "15min", &check_timer, "Polling interval for time based changes to options, resource parameters and constraints.", "The Cluster is primarily event driven, however the configuration can have elements that change based on time." " To ensure these changes take effect, we can optionally poll the cluster's status for changes." }, { "load-threshold", NULL, "percentage", NULL, "80%", &check_utilization, "The maximum amount of system resources that should be used by nodes in the cluster", "The cluster will slow down its recovery process when the amount of system resources used" " (currently CPU) approaches this limit", }, - Best Regards, Hideo Yamauchi. - Original Message - > From: Alex Samad - Yieldbroker > To: "pacemaker@oss.clusterlabs.org" > Cc: > Date: 2014/9/29, Mon 10:56 > Subject: [Pacemaker] query ? > > Hi > > Is this normal logging ? > > Not sure if I need to investigate any thing > > Sep 29 11:35:15 gsdmz1 crmd[2481]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED > origin=crm_timer_popped ] > Sep 29 11:35:15 gsdmz1 pengine[2480]: notice: unpack_config: On loss of CCM > Quorum: Ignore > Sep 29 11:35:15 gsdmz1 pengine[2480]: notice: process_pe_message: > Calculated > Transition 196: /var/lib/pacer/pengine/pe-input-247.bz2 > Sep 29 11:35:15 gsdmz1 crmd[2481]: notice: run_graph: Transition 196 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacer/pengine/pe-input-247.bz2): Complete > Sep 29 11:35:15 gsdmz1 crmd[2481]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Sep 29 11:50:15 gsdmz1 crmd[2481]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED > origin=crm_timer_popped ] > Sep 29 11:50:15 gsdmz1 pengine[2480]: notice: unpack_config: On loss of CCM > Quorum: Ignore > Sep 29 11:50:15 gsdmz1 pengine[2480]: notice: process_pe_message: > Calculated > Transition 197: /var/lib/pacer/pengine/pe-input-247.bz2 > Sep 29 11:50:15 gsdmz1 crmd[2481]: notice: run_graph: Transition 197 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacer/pengine/pe-input-247.bz2): Complete > Sep 29 11:50:15 gsdmz1 crmd[2481]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] About a process name to output in log.
Hi Andrew, >> Is my understanding wrong? > > Without the above commit, the lrmd logs as 'paceamker_remoted' and > pacemaker_remoted logs as 'lrmd'. > We just needed to swap the two cases. Which is what the commit achieves. Okay! We use it with the thing which invalidated the commit mentioned above for a while. Many Thanks, Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp > Cc: The Pacemaker cluster resource manager > Date: 2014/9/22, Mon 10:05 > Subject: Re: [Pacemaker] About a process name to output in log. > > > On 22 Sep 2014, at 10:54 am, renayama19661...@ybb.ne.jp wrote: > >> Hi Andrew, >> >> Thank you for comments. >> In the log of latest Pacemaker, the name of the lrmd process is > output by >>> the name of the pacemaker_remoted process. We like that log is output by default as lrmd. >>> >>> I think you just need: > https://github.com/beekhof/pacemaker/commit/ad083a8 >> >> >> But, I did not understand the meaning of your answer. >> >> >> It is SUPPORT_REMOTE macro to cause the macro of ENABLE_PCMK_REMOTE. >> If I cannot change SUPPORT_REMOTE by "configure" command, it is > necessary to change Makefile.am every time. >> >> Is my understanding wrong? > > Without the above commit, the lrmd logs as 'paceamker_remoted' and > pacemaker_remoted logs as 'lrmd'. > We just needed to swap the two cases. Which is what the commit achieves. > >> >> >> Best Regards, >> Hideo Yamauchi. >> >> >> >> - Original Message - >>> From: Andrew Beekhof >>> To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > >>> Cc: >>> Date: 2014/9/19, Fri 20:50 >>> Subject: Re: [Pacemaker] About a process name to output in log. >>> >>> >>> On 19 Sep 2014, at 5:04 pm, renayama19661...@ybb.ne.jp wrote: >>> Hi All, In the log of latest Pacemaker, the name of the lrmd process is > output by >>> the name of the pacemaker_remoted process. We like that log is output by default as lrmd. >>> >>> I think you just need: > https://github.com/beekhof/pacemaker/commit/ad083a8 >>> These names seem to be changed on a macro. However, the option which even "configure" command > changes this >>> macro to does not seem to exist. * lrmd/Makefile.am (snip) pacemaker_remoted_CFLAGS= -DSUPPORT_REMOTE (snip) * lrmd/main.c (snip) #if defined(HAVE_GNUTLS_GNUTLS_H) && > defined(SUPPORT_REMOTE) # define ENABLE_PCMK_REMOTE #endif (snip) #ifndef ENABLE_PCMK_REMOTE crm_log_preinit("lrmd", argc, argv); crm_set_options(NULL, "[options]", long_options, "Daemon for controlling services > confirming to >>> different standards"); #else crm_log_preinit("pacemaker_remoted", argc, argv); crm_set_options(NULL, "[options]", long_options, "Pacemaker Remote daemon for extending > pacemaker >>> functionality to remote nodes."); #endif (snip) Please examine the option of the "configure" command to > give >>> macro. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org >>> > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] About a process name to output in log.
Hi Andrew, Thank you for comments. >> In the log of latest Pacemaker, the name of the lrmd process is output by > the name of the pacemaker_remoted process. >> We like that log is output by default as lrmd. > > I think you just need: https://github.com/beekhof/pacemaker/commit/ad083a8 But, I did not understand the meaning of your answer. It is SUPPORT_REMOTE macro to cause the macro of ENABLE_PCMK_REMOTE. If I cannot change SUPPORT_REMOTE by "configure" command, it is necessary to change Makefile.am every time. Is my understanding wrong? Best Regards, Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > > Cc: > Date: 2014/9/19, Fri 20:50 > Subject: Re: [Pacemaker] About a process name to output in log. > > > On 19 Sep 2014, at 5:04 pm, renayama19661...@ybb.ne.jp wrote: > >> Hi All, >> >> In the log of latest Pacemaker, the name of the lrmd process is output by > the name of the pacemaker_remoted process. >> We like that log is output by default as lrmd. > > I think you just need: https://github.com/beekhof/pacemaker/commit/ad083a8 > >> These names seem to be changed on a macro. >> However, the option which even "configure" command changes this > macro to does not seem to exist. >> >> * lrmd/Makefile.am >> (snip) >> pacemaker_remoted_CFLAGS= -DSUPPORT_REMOTE >> (snip) >> * lrmd/main.c >> (snip) >> #if defined(HAVE_GNUTLS_GNUTLS_H) && defined(SUPPORT_REMOTE) >> # define ENABLE_PCMK_REMOTE >> #endif >> (snip) >> #ifndef ENABLE_PCMK_REMOTE >> crm_log_preinit("lrmd", argc, argv); >> crm_set_options(NULL, "[options]", long_options, >> "Daemon for controlling services confirming to > different standards"); >> #else >> crm_log_preinit("pacemaker_remoted", argc, argv); >> crm_set_options(NULL, "[options]", long_options, >> "Pacemaker Remote daemon for extending pacemaker > functionality to remote nodes."); >> #endif >> (snip) >> >> >> Please examine the option of the "configure" command to give > macro. >> >> Best Regards, >> Hideo Yamauchi. >> >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] About a process name to output in log.
Hi All, In the log of latest Pacemaker, the name of the lrmd process is output by the name of the pacemaker_remoted process. We like that log is output by default as lrmd. These names seem to be changed on a macro. However, the option which even "configure" command changes this macro to does not seem to exist. * lrmd/Makefile.am (snip) pacemaker_remoted_CFLAGS= -DSUPPORT_REMOTE (snip) * lrmd/main.c (snip) #if defined(HAVE_GNUTLS_GNUTLS_H) && defined(SUPPORT_REMOTE) # define ENABLE_PCMK_REMOTE #endif (snip) #ifndef ENABLE_PCMK_REMOTE crm_log_preinit("lrmd", argc, argv); crm_set_options(NULL, "[options]", long_options, "Daemon for controlling services confirming to different standards"); #else crm_log_preinit("pacemaker_remoted", argc, argv); crm_set_options(NULL, "[options]", long_options, "Pacemaker Remote daemon for extending pacemaker functionality to remote nodes."); #endif (snip) Please examine the option of the "configure" command to give macro. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi Andrew, Thank you for comments. > I'll file a bug against glib on RHEL6 so that it gets fixed there. > Can you send me your simple reproducer program? I make revision during practice of timer_func2() at the time When timer_func2() is carried out, time-out of timer_func() is completed before planned time. - #include #include #include gboolean timer_func(gpointer data){ printf("TIMER EXPIRE!\n"); fflush(stdout); exit(1); // return FALSE; } gboolean timer_func2(gpointer data){ clock_t ret; struct tms buff; ret = times(&buff); printf("TIMER2 EXPIRE! %d\n", ret); fflush(stdout); return TRUE; } int main(int argc, char** argv){ GMainLoop *m; clock_t ret; struct tms buff; gint64 t; // t = g_get_monotonic_time(); m = g_main_new(FALSE); g_timeout_add(5000, timer_func2, NULL); g_timeout_add(6, timer_func, NULL); ret = times(&buff); printf("START! %d\n", ret);] g_main_run(m); } - Many Thanks, Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp > Cc: The Pacemaker cluster resource manager > Date: 2014/9/10, Wed 13:56 > Subject: Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision > of the system time. > > > On 10 Sep 2014, at 2:48 pm, renayama19661...@ybb.ne.jp wrote: > >> Hi Andrew, >> >> I confirmed it in various ways. >> >> The conclusion varies in movement by a version of glib. >> * The problem occurs in RHEL6.x. >> * The problem does not occur in RHEL7.0. >> >> And this problem is solved in glib of a new version. >> >> A change of next glib seems to solve a problem in a new version. >> * > https://github.com/GNOME/glib/commit/91113a8aeea40cc2d7dda65b09537980bb602a06#diff-fc9b4bb280a13f8e51c51b434e7d26fd >> >> Many users expect right movement in old glib. >> * Till it shifts to RHEL7... >> >> Do you not make modifications in Pacemaker to support an old version? >> * Model it on old G_() function. > > I'll file a bug against glib on RHEL6 so that it gets fixed there. > Can you send me your simple reproducer program? > >> >> Best Regards, >> Hideo Yamauchi. >> >> >> >> - Original Message - >>> From: Andrew Beekhof >>> To: renayama19661...@ybb.ne.jp >>> Cc: The Pacemaker cluster resource manager > >>> Date: 2014/9/8, Mon 19:55 >>> Subject: Re: [Pacemaker] [Problem] lrmd detects monitor time-out by > revision of the system time. >>> >>> >>> On 8 Sep 2014, at 7:12 pm, renayama19661...@ybb.ne.jp wrote: >>> Hi Andrew, I confirmed some problems, but seem to be caused by > the >>> fact that > an event >>> occurs somehow or other in g_main_loop of lrmd in the > period >>> when it is > shorter >>> than a monitor. >>> >>> So if you create a trivial program with g_main_loop and > a >>> timer, and > then change >>> the system time, does the timer expire early? >> >> Yes. > > That sounds like a glib bug. Ideally we'd get it fixed > there rather >>> than > work-around it in pacemaker. > Have you spoken to them at all? > No. I investigate glib library a little more. And I talk with community of glib. I may talk again afterwards. >>> >>> Cool. I somewhat expect them to say "working as designed". >>> Which would be unfortunate, but it shouldn't be too hard to work > around. >>> > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi Andrew, I confirmed it in various ways. The conclusion varies in movement by a version of glib. * The problem occurs in RHEL6.x. * The problem does not occur in RHEL7.0. And this problem is solved in glib of a new version. A change of next glib seems to solve a problem in a new version. * https://github.com/GNOME/glib/commit/91113a8aeea40cc2d7dda65b09537980bb602a06#diff-fc9b4bb280a13f8e51c51b434e7d26fd Many users expect right movement in old glib. * Till it shifts to RHEL7... Do you not make modifications in Pacemaker to support an old version? * Model it on old G_() function. Best Regards, Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp > Cc: The Pacemaker cluster resource manager > Date: 2014/9/8, Mon 19:55 > Subject: Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision > of the system time. > > > On 8 Sep 2014, at 7:12 pm, renayama19661...@ybb.ne.jp wrote: > >> Hi Andrew, >> >> I confirmed some problems, but seem to be caused by the > fact that >> >>> an event > occurs somehow or other in g_main_loop of lrmd in the period > when it is >>> shorter > than a monitor. > > So if you create a trivial program with g_main_loop and a > timer, and >>> then change > the system time, does the timer expire early? Yes. >>> >>> That sounds like a glib bug. Ideally we'd get it fixed there rather > than >>> work-around it in pacemaker. >>> Have you spoken to them at all? >>> >> >> >> No. >> I investigate glib library a little more. >> And I talk with community of glib. >> >> I may talk again afterwards. > > Cool. I somewhat expect them to say "working as designed". > Which would be unfortunate, but it shouldn't be too hard to work around. > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi Andrew, I confirmed some problems, but seem to be caused by the fact that > an event >>> occurs somehow or other in g_main_loop of lrmd in the period when it is > shorter >>> than a monitor. >>> >>> So if you create a trivial program with g_main_loop and a timer, and > then change >>> the system time, does the timer expire early? >> >> Yes. > > That sounds like a glib bug. Ideally we'd get it fixed there rather than > work-around it in pacemaker. > Have you spoken to them at all? > No. I investigate glib library a little more. And I talk with community of glib. I may talk again afterwards. Many Thanks, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi Andrew, Thank you for comments. >> I confirmed some problems, but seem to be caused by the fact that an event > occurs somehow or other in g_main_loop of lrmd in the period when it is > shorter > than a monitor. > > So if you create a trivial program with g_main_loop and a timer, and then > change > the system time, does the timer expire early? Yes. >> This problem does not seem to happen somehow or other in lrmd of PM1.0. > > cluster-glue was probably using custom timeout code. I watched implementation of glue, too. The time-out handling of new lrmd seems to have to perform implementation similar to glue somehow or other. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi All, We confirmed that lrmd caused the time-out of the monitor when the time of the system was revised. When a system considers revision of the time when I used ntpd, it is a problem very much. We can confirm this problem in the next procedure. Step1) Start Pacemaker in a single node. [root@snmp1 ~]# start pacemaker.combined pacemaker.combined start/running, process 11382 Step2) Send simple crm. trac2915-3.crm primitive prmDummyA ocf:pacemaker:Dummy1 \ op start interval="0s" timeout="60s" on-fail="restart" \ op monitor interval="10s" timeout="30s" on-fail="restart" \ op stop interval="0s" timeout="60s" on-fail="block" group grpA prmDummyA location rsc_location-grpA-1 grpA \ rule $id="rsc_location-grpA-1-rule" 200: #uname eq snmp1 \ rule $id="rsc_location-grpA-1-rule-0" 100: #uname eq snmp2 property $id="cib-bootstrap-options" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ crmd-transition-delay="2s" rsc_defaults $id="rsc-options" \ resource-stickiness="INFINITY" \ migration-threshold="1" -- [root@snmp1 ~]# crm configure load update trac2915-3.crm WARNING: rsc_location-grpA-1: referenced node snmp2 does not exist [root@snmp1 ~]# crm_mon -1 -Af Last updated: Fri Sep 5 13:09:45 2014 Last change: Fri Sep 5 13:09:13 2014 Stack: corosync Current DC: snmp1 (3232238180) - partition WITHOUT quorum Version: 1.1.12-561c4cf 1 Nodes configured 1 Resources configured Online: [ snmp1 ] Resource Group: grpA prmDummyA (ocf::pacemaker:Dummy1): Started snmp1 Node Attributes: * Node snmp1: Migration summary: * Node snmp1: Step3) After the monitor of the resource just began, we push forward time than the timeout(timeout="30s") of the monitor. [root@snmp1 ~]# date -s +40sec Fri Sep 5 13:11:04 JST 2014 Step4) The time-out of the monitor occurs. [root@snmp1 ~]# crm_mon -1 -Af Last updated: Fri Sep 5 13:11:24 2014 Last change: Fri Sep 5 13:09:13 2014 Stack: corosync Current DC: snmp1 (3232238180) - partition WITHOUT quorum Version: 1.1.12-561c4cf 1 Nodes configured 1 Resources configured Online: [ snmp1 ] Node Attributes: * Node snmp1: Migration summary: * Node snmp1: prmDummyA: migration-threshold=1 fail-count=1 last-failure='Fri Sep 5 13:11:04 2014' Failed actions: prmDummyA_monitor_1 on snmp1 'unknown error' (1): call=7, status=Timed Out, last-rc-change='Fri Sep 5 13:11:04 2014', queued=0ms, exec=0ms I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. This problem does not seem to happen somehow or other in lrmd of PM1.0. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question] About snmp trap of crm_mon.
Hi Andrew, >> Perhaps someone feels like testing this: >> https://github.com/beekhof/pacemaker/commit/3df6aff >> >> Otherwise I'll do it on monday I confirmed the output of the SNMP trap of the resource and the SNMP trap of STONITH. By your correction, the crm_mon command came to send trap. Please reflect a correction in Master repository. Best Regards, Hideo Yamauchi. - Original Message - >From: "renayama19661...@ybb.ne.jp" >To: Andrew Beekhof ; The Pacemaker cluster resource >manager >Date: 2014/7/25, Fri 14:21 >Subject: Re: [Pacemaker] [Question] About snmp trap of crm_mon. > >Hi Andrew, > >> Perhaps someone feels like testing this: >> https://github.com/beekhof/pacemaker/commit/3df6aff >> >> Otherwise I'll do it on monday > > >An immediate correction, thank you. >I confirm snmp by the end of Monday. > >Many Thanks! >Hideo Yamauchi. > > > >- Original Message - >> From: Andrew Beekhof >> To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager >> >> Cc: >> Date: 2014/7/25, Fri 14:02 >> Subject: Re: [Pacemaker] [Question] About snmp trap of crm_mon. >> >> >> On 24 Jul 2014, at 6:32 pm, Andrew Beekhof wrote: >> >>> >>> On 24 Jul 2014, at 11:54 am, renayama19661...@ybb.ne.jp wrote: >>> Hi All, We were going to confirm snmptrap function in crm_mon of >> Pacemaker1.1.12. However, crm_mon does not seem to support a message for a new >> difference of cib. >>> >>> dammit :( >> >> Perhaps someone feels like testing this: >> https://github.com/beekhof/pacemaker/commit/3df6aff >> >> Otherwise I'll do it on monday >> >>> void crm_diff_update(const char *event, xmlNode * msg) { int rc = -1; long now = time(NULL); (snip) if (crm_mail_to || snmp_target || external_agent) { /* Process operation updates */ xmlXPathObject *xpathObj = xpath_search(msg, "//" >> F_CIB_UPDATE_RESULT "//" XML_TAG_DIFF_ADDED "//" >> XML_LRM_TAG_RSC_OP); int lpc = 0, max = numXpathResults(xpathObj); (snip) Best Regards, Hideo Yamauch. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org >>> >> > >___ >Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >Project Home: http://www.clusterlabs.org >Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >Bugs: http://bugs.clusterlabs.org > > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question] About snmp trap of crm_mon.
Hi Andrew, > Perhaps someone feels like testing this: > https://github.com/beekhof/pacemaker/commit/3df6aff > > Otherwise I'll do it on monday An immediate correction, thank you. I confirm snmp by the end of Monday. Many Thanks! Hideo Yamauchi. - Original Message - > From: Andrew Beekhof > To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager > > Cc: > Date: 2014/7/25, Fri 14:02 > Subject: Re: [Pacemaker] [Question] About snmp trap of crm_mon. > > > On 24 Jul 2014, at 6:32 pm, Andrew Beekhof wrote: > >> >> On 24 Jul 2014, at 11:54 am, renayama19661...@ybb.ne.jp wrote: >> >>> Hi All, >>> >>> We were going to confirm snmptrap function in crm_mon of > Pacemaker1.1.12. >>> However, crm_mon does not seem to support a message for a new > difference of cib. >> >> dammit :( > > Perhaps someone feels like testing this: > https://github.com/beekhof/pacemaker/commit/3df6aff > > Otherwise I'll do it on monday > >> >>> >>> >>> void >>> crm_diff_update(const char *event, xmlNode * msg) >>> { >>> int rc = -1; >>> long now = time(NULL); >>> (snip) >>> if (crm_mail_to || snmp_target || external_agent) { >>> /* Process operation updates */ >>> xmlXPathObject *xpathObj = xpath_search(msg, >>> "//" > F_CIB_UPDATE_RESULT "//" XML_TAG_DIFF_ADDED >>> "//" > XML_LRM_TAG_RSC_OP); >>> int lpc = 0, max = numXpathResults(xpathObj); >>> (snip) >>> >>> Best Regards, >>> Hideo Yamauch. >>> >>> >>> ___ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Question] About snmp trap of crm_mon.
Hi All, We were going to confirm snmptrap function in crm_mon of Pacemaker1.1.12. However, crm_mon does not seem to support a message for a new difference of cib. void crm_diff_update(const char *event, xmlNode * msg) { int rc = -1; long now = time(NULL); (snip) if (crm_mail_to || snmp_target || external_agent) { /* Process operation updates */ xmlXPathObject *xpathObj = xpath_search(msg, "//" F_CIB_UPDATE_RESULT "//" XML_TAG_DIFF_ADDED "//" XML_LRM_TAG_RSC_OP); int lpc = 0, max = numXpathResults(xpathObj); (snip) Best Regards, Hideo Yamauch. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Enhancement] When attrd reboots, the attribute disappears.
Hi Andrew, Thank you for comennts. > Please use bugs.clusterlabs.org in future. > I'll follow up in bugzilla Okay! Best Regards, Hideo Yamauchi. --- On Tue, 2014/6/10, Andrew Beekhof wrote: > > On 9 Jun 2014, at 12:01 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > I submitted a problem in next bugziila in the past. > > * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2501 > > Please use bugs.clusterlabs.org in future. > I'll follow up in bugzilla > > > > > A similar phenomenon is generated in attrd of latest Pacemaker. > > > > Step 1) Set the setting of the cluster as follows. > > export PCMK_fail_fast=no > > > > Step 2) Start a cluster. > > > > Step 3) Cause trouble in a resource and improve a trouble count.(fail-count) > > > > [root@srv01 ~]# crm_mon -1 -Af > > (snip) > > Online: [ srv01 ] > > > > before-dummy (ocf::heartbeat:Dummy): Started srv01 > > vip-master (ocf::heartbeat:Dummy2): Started srv01 > > > > > > Migration summary: > > * Node srv01: > > before-dummy: migration-threshold=10 fail-count=1 last-failure='Mon Jun > >9 19:21:07 2014' > > > > Failed actions: > > before-dummy_monitor_1 on srv01 'not running' (7): call=11, > >status=complete, last-rc-change='Mon Jun 9 19:21:07 2014', queued=0ms, > >exec=0ms > > > > > > Step 4) Reboot attrd in kill.(I assume that attrd breaks down and rebooted.) > > > > Step 5) Produce trouble in a resource same as step 3 again. > > * The trouble number(fail-count) of times returns to 1. > > > > > > [root@srv01 ~]# crm_mon -1 -Af > > (snip) > > Online: [ srv01 ] > > > > before-dummy (ocf::heartbeat:Dummy): Started srv01 > > vip-master (ocf::heartbeat:Dummy2): Started srv01 > > > > Migration summary: > > * Node srv01: > > before-dummy: migration-threshold=10 fail-count=1 last-failure='Mon Jun > >9 19:22:47 2014' > > > > Failed actions: > > before-dummy_monitor_1 on srv01 'not running' (7): call=17, > >status=complete, last-rc-change='Mon Jun 9 19:22:47 2014', queued=0ms, > >exec=0ms > > > > > > Even if attrd reboots, I think that it is necessary to improve attrd so > > that an attribute is maintained definitely. > > > > Best Regards, > > Hideo Yamauch. > > > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Enhancement] When attrd reboots, the attribute disappears.
Hi All, I submitted a problem in next bugziila in the past. * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2501 A similar phenomenon is generated in attrd of latest Pacemaker. Step 1) Set the setting of the cluster as follows. export PCMK_fail_fast=no Step 2) Start a cluster. Step 3) Cause trouble in a resource and improve a trouble count.(fail-count) [root@srv01 ~]# crm_mon -1 -Af (snip) Online: [ srv01 ] before-dummy (ocf::heartbeat:Dummy): Started srv01 vip-master (ocf::heartbeat:Dummy2):Started srv01 Migration summary: * Node srv01: before-dummy: migration-threshold=10 fail-count=1 last-failure='Mon Jun 9 19:21:07 2014' Failed actions: before-dummy_monitor_1 on srv01 'not running' (7): call=11, status=complete, last-rc-change='Mon Jun 9 19:21:07 2014', queued=0ms, exec=0ms Step 4) Reboot attrd in kill.(I assume that attrd breaks down and rebooted.) Step 5) Produce trouble in a resource same as step 3 again. * The trouble number(fail-count) of times returns to 1. [root@srv01 ~]# crm_mon -1 -Af (snip) Online: [ srv01 ] before-dummy (ocf::heartbeat:Dummy): Started srv01 vip-master (ocf::heartbeat:Dummy2):Started srv01 Migration summary: * Node srv01: before-dummy: migration-threshold=10 fail-count=1 last-failure='Mon Jun 9 19:22:47 2014' Failed actions: before-dummy_monitor_1 on srv01 'not running' (7): call=17, status=complete, last-rc-change='Mon Jun 9 19:22:47 2014', queued=0ms, exec=0ms Even if attrd reboots, I think that it is necessary to improve attrd so that an attribute is maintained definitely. Best Regards, Hideo Yamauch. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The "dampen" parameter of the attrd_updater command is ignored, and an attribute is updated.
Hi Andrew, I confirmed movement at once. Your patch solves a problem. Many Thanks! Hideo Yamauchi. --- On Wed, 2014/5/28, renayama19661...@ybb.ne.jp wrote: > Hi Andrew, > > > Perhaps try: > > > > diff --git a/attrd/commands.c b/attrd/commands.c > > index 7f1b4b0..7342e23 100644 > > --- a/attrd/commands.c > > +++ b/attrd/commands.c > > @@ -464,6 +464,15 @@ attrd_peer_update(crm_node_t *peer, xmlNode *xml, bool > > filter) > > > > a->changed |= changed; > > > > + if(changed) { > > + if(a->timer) { > > + crm_trace("Delayed write out (%dms) for %s", a->timeout_ms, > > a->id); > > + mainloop_timer_start(a->timer); > > + } else { > > + write_or_elect_attribute(a); > > + } > > + } > > + > > /* this only involves cluster nodes. */ > > if(v->nodeid == 0 && (v->is_remote == FALSE)) { > > if(crm_element_value_int(xml, F_ATTRD_HOST_ID, (int*)&v->nodeid) > > == 0) { > > @@ -476,15 +485,6 @@ attrd_peer_update(crm_node_t *peer, xmlNode *xml, bool > > filter) > > } > > } > > } > > - > > - if(changed) { > > - if(a->timer) { > > - crm_trace("Delayed write out (%dms) for %s", a->timeout_ms, > > a->id); > > - mainloop_timer_start(a->timer); > > - } else { > > - write_or_elect_attribute(a); > > - } > > - } > > } > > > > void > > Okay! > I confirm movement. > > Many Thanks! > Hideo Yamauchi. > > --- On Wed, 2014/5/28, Andrew Beekhof wrote: > > > > > On 28 May 2014, at 4:10 pm, Andrew Beekhof wrote: > > > > > > > > On 28 May 2014, at 3:04 pm, renayama19661...@ybb.ne.jp wrote: > > > > > >> Hi Andrew, > > >> > > I'd expect that block to hit this clause though: > > > > } else if(mainloop_timer_running(a->timer)) { > > crm_info("Write out of '%s' delayed: timer is running", a->id); > > return; > > >>> > > >>> Which point of the source code does the suggested code mentioned above > > >>> revise? > > >>> (Which line of the source code is it?) > > >> > > >> Is it the next cord that you pointed? > > > > > > right > > > > > >> > > >> void > > >> write_attribute(attribute_t *a) > > >> { > > >> int updates = 0; > > >> (snip) > > >> } else if(mainloop_timer_running(a->timer)) { > > >> crm_info("Write out of '%s' delayed: timer is running", a->id); > > >> return; > > >> } > > >> (snip) > > >> > > >> At the time of phenomenon of the problem, the timer does not yet block > > >> it by this processing because it does not start. > > > > > > Thats the curious part > > > > Perhaps try: > > > > diff --git a/attrd/commands.c b/attrd/commands.c > > index 7f1b4b0..7342e23 100644 > > --- a/attrd/commands.c > > +++ b/attrd/commands.c > > @@ -464,6 +464,15 @@ attrd_peer_update(crm_node_t *peer, xmlNode *xml, bool > > filter) > > > > a->changed |= changed; > > > > + if(changed) { > > + if(a->timer) { > > + crm_trace("Delayed write out (%dms) for %s", a->timeout_ms, > > a->id); > > + mainloop_timer_start(a->timer); > > + } else { > > + write_or_elect_attribute(a); > > + } > > + } > > + > > /* this only involves cluster nodes. */ > > if(v->nodeid == 0 && (v->is_remote == FALSE)) { > > if(crm_element_value_int(xml, F_ATTRD_HOST_ID, (int*)&v->nodeid) > > == 0) { > > @@ -476,15 +485,6 @@ attrd_peer_update(crm_node_t *peer, xmlNode *xml, bool > > filter) > > } > > } > > } > > - > > - if(changed) { > > - if(a->timer) { > > - crm_trace("Delayed write out (%dms) for %s", a->timeout_ms, > > a->id); > > - mainloop_timer_start(a->timer); > > - } else { > > - write_or_elect_attribute(a); > > - } > > - } > > } > > > > void > > > > > > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The "dampen" parameter of the attrd_updater command is ignored, and an attribute is updated.
Hi Andrew, > Perhaps try: > > diff --git a/attrd/commands.c b/attrd/commands.c > index 7f1b4b0..7342e23 100644 > --- a/attrd/commands.c > +++ b/attrd/commands.c > @@ -464,6 +464,15 @@ attrd_peer_update(crm_node_t *peer, xmlNode *xml, bool > filter) > > a->changed |= changed; > > + if(changed) { > + if(a->timer) { > + crm_trace("Delayed write out (%dms) for %s", a->timeout_ms, > a->id); > + mainloop_timer_start(a->timer); > + } else { > + write_or_elect_attribute(a); > + } > + } > + > /* this only involves cluster nodes. */ > if(v->nodeid == 0 && (v->is_remote == FALSE)) { > if(crm_element_value_int(xml, F_ATTRD_HOST_ID, (int*)&v->nodeid) == > 0) { > @@ -476,15 +485,6 @@ attrd_peer_update(crm_node_t *peer, xmlNode *xml, bool > filter) > } > } > } > - > - if(changed) { > - if(a->timer) { > - crm_trace("Delayed write out (%dms) for %s", a->timeout_ms, > a->id); > - mainloop_timer_start(a->timer); > - } else { > - write_or_elect_attribute(a); > - } > - } > } > > void Okay! I confirm movement. Many Thanks! Hideo Yamauchi. --- On Wed, 2014/5/28, Andrew Beekhof wrote: > > On 28 May 2014, at 4:10 pm, Andrew Beekhof wrote: > > > > > On 28 May 2014, at 3:04 pm, renayama19661...@ybb.ne.jp wrote: > > > >> Hi Andrew, > >> > I'd expect that block to hit this clause though: > > } else if(mainloop_timer_running(a->timer)) { > crm_info("Write out of '%s' delayed: timer is running", a->id); > return; > >>> > >>> Which point of the source code does the suggested code mentioned above > >>> revise? > >>> (Which line of the source code is it?) > >> > >> Is it the next cord that you pointed? > > > > right > > > >> > >> void > >> write_attribute(attribute_t *a) > >> { > >> int updates = 0; > >> (snip) > >> } else if(mainloop_timer_running(a->timer)) { > >> crm_info("Write out of '%s' delayed: timer is running", a->id); > >> return; > >> } > >> (snip) > >> > >> At the time of phenomenon of the problem, the timer does not yet block it > >> by this processing because it does not start. > > > > Thats the curious part > > Perhaps try: > > diff --git a/attrd/commands.c b/attrd/commands.c > index 7f1b4b0..7342e23 100644 > --- a/attrd/commands.c > +++ b/attrd/commands.c > @@ -464,6 +464,15 @@ attrd_peer_update(crm_node_t *peer, xmlNode *xml, bool > filter) > > a->changed |= changed; > > + if(changed) { > + if(a->timer) { > + crm_trace("Delayed write out (%dms) for %s", a->timeout_ms, > a->id); > + mainloop_timer_start(a->timer); > + } else { > + write_or_elect_attribute(a); > + } > + } > + > /* this only involves cluster nodes. */ > if(v->nodeid == 0 && (v->is_remote == FALSE)) { > if(crm_element_value_int(xml, F_ATTRD_HOST_ID, (int*)&v->nodeid) == > 0) { > @@ -476,15 +485,6 @@ attrd_peer_update(crm_node_t *peer, xmlNode *xml, bool > filter) > } > } > } > - > - if(changed) { > - if(a->timer) { > - crm_trace("Delayed write out (%dms) for %s", a->timeout_ms, > a->id); > - mainloop_timer_start(a->timer); > - } else { > - write_or_elect_attribute(a); > - } > - } > } > > void > > > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The "dampen" parameter of the attrd_updater command is ignored, and an attribute is updated.
Hi Andrew, > > I'd expect that block to hit this clause though: > > > > } else if(mainloop_timer_running(a->timer)) { > > crm_info("Write out of '%s' delayed: timer is running", a->id); > > return; > > Which point of the source code does the suggested code mentioned above revise? > (Which line of the source code is it?) Is it the next cord that you pointed? void write_attribute(attribute_t *a) { int updates = 0; (snip) } else if(mainloop_timer_running(a->timer)) { crm_info("Write out of '%s' delayed: timer is running", a->id); return; } (snip) At the time of phenomenon of the problem, the timer does not yet block it by this processing because it does not start. Best Regards, Hideo Yamauchi. --- On Wed, 2014/5/28, renayama19661...@ybb.ne.jp wrote: > Hi Andrew, > > Thank you for comment. > > > > --- attrd/command.c - > > > (snip) > > > /* this only involves cluster nodes. */ > > > if(v->nodeid == 0 && (v->is_remote == FALSE)) { > > > if(crm_element_value_int(xml, F_ATTRD_HOST_ID, (int*)&v->nodeid) > > >== 0) { > > > /* Create the name/id association */ > > > crm_node_t *peer = crm_get_peer(v->nodeid, host); > > > crm_trace("We know %s's node id now: %s", peer->uname, > > >peer->uuid); > > > if(election_state(writer) == election_won) { > > > write_attributes(FALSE, TRUE); > > > return; > > > } > > > } > > > } > > > > This is for 5194 right? > > No. > I listed the same thing in 5194, but this does not seem to be 5194 basic > problems. > > I try the reproduction of 5194 problems, but have not been able to yet > reappear. > Possibly 5194 problems may not happen in PM1.1.12-rc1. > * As for 5194 matters, please give me time a little more. > > > > > I'd expect that block to hit this clause though: > > > > } else if(mainloop_timer_running(a->timer)) { > > crm_info("Write out of '%s' delayed: timer is running", a->id); > > return; > > Which point of the source code does the suggested code mentioned above revise? > (Which line of the source code is it?) > > Best Regards, > Hideo Yamauchi. > > --- On Wed, 2014/5/28, Andrew Beekhof wrote: > > > > > On 27 May 2014, at 12:13 pm, renayama19661...@ybb.ne.jp wrote: > > > > > Hi All, > > > > > > The attrd_updater command ignores the "dampen" parameter and updates an > > > attribute. > > > > > > Step1) Start one node. > > > [root@srv01 ~]# crm_mon -1 -Af > > > Last updated: Tue May 27 19:36:35 2014 > > > Last change: Tue May 27 19:34:59 2014 > > > Stack: corosync > > > Current DC: srv01 (3232238180) - partition WITHOUT quorum > > > Version: 1.1.11-f0f09b8 > > > 1 Nodes configured > > > 0 Resources configured > > > > > > > > > Online: [ srv01 ] > > > > > > > > > Node Attributes: > > > * Node srv01: > > > > > > Migration summary: > > > * Node srv01: > > > > > > Step2) Update an attribute by attrd_updater command. > > > [root@srv01 ~]# attrd_updater -n default_ping_set -U 500 -d 3000 > > > > > > Step3) The attribute is updated without waiting for the time of the > > > "dampen" parameter. > > > [root@srv01 ~]# cibadmin -Q | grep ping_set > > > > >name="default_ping_set" value="500"/> > > > > > > The next code seems to have a problem somehow or other. > > > > > > --- attrd/command.c - > > > (snip) > > > /* this only involves cluster nodes. */ > > > if(v->nodeid == 0 && (v->is_remote == FALSE)) { > > > if(crm_element_value_int(xml, F_ATTRD_HOST_ID, (int*)&v->nodeid) > > >== 0) { > > > /* Create the name/id association */ > > > crm_node_t *peer = crm_get_peer(v->nodeid, host); > > > crm_trace("We know %s's node id now: %s", peer->uname, > > >peer->uuid); > > > if(election_state(writer) == election_won) { > > > write_attributes(FALSE, TRUE); > > > return; > > > } > > > } > > > } > > > > This is for 5194 right? > > > > I'd expect that block to hit this clause though: > > > > } else if(mainloop_timer_running(a->timer)) { > > crm_info("Write out of '%s' delayed: timer is running", a->id); > > return; > > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The "dampen" parameter of the attrd_updater command is ignored, and an attribute is updated.
Hi Andrew, Thank you for comment. > > --- attrd/command.c - > > (snip) > > /* this only involves cluster nodes. */ > > if(v->nodeid == 0 && (v->is_remote == FALSE)) { > > if(crm_element_value_int(xml, F_ATTRD_HOST_ID, (int*)&v->nodeid) == > >0) { > > /* Create the name/id association */ > > crm_node_t *peer = crm_get_peer(v->nodeid, host); > > crm_trace("We know %s's node id now: %s", peer->uname, > >peer->uuid); > > if(election_state(writer) == election_won) { > > write_attributes(FALSE, TRUE); > > return; > > } > > } > > } > > This is for 5194 right? No. I listed the same thing in 5194, but this does not seem to be 5194 basic problems. I try the reproduction of 5194 problems, but have not been able to yet reappear. Possibly 5194 problems may not happen in PM1.1.12-rc1. * As for 5194 matters, please give me time a little more. > > I'd expect that block to hit this clause though: > > } else if(mainloop_timer_running(a->timer)) { > crm_info("Write out of '%s' delayed: timer is running", a->id); > return; Which point of the source code does the suggested code mentioned above revise? (Which line of the source code is it?) Best Regards, Hideo Yamauchi. --- On Wed, 2014/5/28, Andrew Beekhof wrote: > > On 27 May 2014, at 12:13 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > The attrd_updater command ignores the "dampen" parameter and updates an > > attribute. > > > > Step1) Start one node. > > [root@srv01 ~]# crm_mon -1 -Af > > Last updated: Tue May 27 19:36:35 2014 > > Last change: Tue May 27 19:34:59 2014 > > Stack: corosync > > Current DC: srv01 (3232238180) - partition WITHOUT quorum > > Version: 1.1.11-f0f09b8 > > 1 Nodes configured > > 0 Resources configured > > > > > > Online: [ srv01 ] > > > > > > Node Attributes: > > * Node srv01: > > > > Migration summary: > > * Node srv01: > > > > Step2) Update an attribute by attrd_updater command. > > [root@srv01 ~]# attrd_updater -n default_ping_set -U 500 -d 3000 > > > > Step3) The attribute is updated without waiting for the time of the > > "dampen" parameter. > > [root@srv01 ~]# cibadmin -Q | grep ping_set > > >name="default_ping_set" value="500"/> > > > > The next code seems to have a problem somehow or other. > > > > --- attrd/command.c - > > (snip) > > /* this only involves cluster nodes. */ > > if(v->nodeid == 0 && (v->is_remote == FALSE)) { > > if(crm_element_value_int(xml, F_ATTRD_HOST_ID, (int*)&v->nodeid) == > >0) { > > /* Create the name/id association */ > > crm_node_t *peer = crm_get_peer(v->nodeid, host); > > crm_trace("We know %s's node id now: %s", peer->uname, > >peer->uuid); > > if(election_state(writer) == election_won) { > > write_attributes(FALSE, TRUE); > > return; > > } > > } > > } > > This is for 5194 right? > > I'd expect that block to hit this clause though: > > } else if(mainloop_timer_running(a->timer)) { > crm_info("Write out of '%s' delayed: timer is running", a->id); > return; > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Problem] The "dampen" parameter of the attrd_updater command is ignored, and an attribute is updated.
Hi All, The attrd_updater command ignores the "dampen" parameter and updates an attribute. Step1) Start one node. [root@srv01 ~]# crm_mon -1 -Af Last updated: Tue May 27 19:36:35 2014 Last change: Tue May 27 19:34:59 2014 Stack: corosync Current DC: srv01 (3232238180) - partition WITHOUT quorum Version: 1.1.11-f0f09b8 1 Nodes configured 0 Resources configured Online: [ srv01 ] Node Attributes: * Node srv01: Migration summary: * Node srv01: Step2) Update an attribute by attrd_updater command. [root@srv01 ~]# attrd_updater -n default_ping_set -U 500 -d 3000 Step3) The attribute is updated without waiting for the time of the "dampen" parameter. [root@srv01 ~]# cibadmin -Q | grep ping_set The next code seems to have a problem somehow or other. --- attrd/command.c - (snip) /* this only involves cluster nodes. */ if(v->nodeid == 0 && (v->is_remote == FALSE)) { if(crm_element_value_int(xml, F_ATTRD_HOST_ID, (int*)&v->nodeid) == 0) { /* Create the name/id association */ crm_node_t *peer = crm_get_peer(v->nodeid, host); crm_trace("We know %s's node id now: %s", peer->uname, peer->uuid); if(election_state(writer) == election_won) { write_attributes(FALSE, TRUE); return; } } } Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question] About control of colocation.(master-slave with primitive)
Hi Andrew, I registered a problem in Bugzilla. And I attached a file of crm_report. * http://bugs.clusterlabs.org/show_bug.cgi?id=5213 Best Regards, Hideo Yamauchi. --- On Thu, 2014/5/15, renayama19661...@ybb.ne.jp wrote: > Hi Andrew, > > > >> Your config looks reasonable... almost certainly a bug in the PE. > > >> Do you happen to have the relevant pengine input file available? > > > > > > Really? > > > > I would expect that: > > > > colocation rsc_colocation-master-1 INFINITY: msPostgresql:Master A-master > > > > would only promote msPostgresql on a node where A-master was running. > > > > Is that not what you were wanting? > > Yes. I wanted it. > > However, this colocation does not come to be applied by handling of PE. > This is because role of msPostgresql is not decided when it calculates > placement of A-MASTER. > * In this case colocation seems to affect only the priority of the > Master/Slave resource. > > I think that this problem disappears if this calculation of the PE is revised. > > Best Regards, > Hideo Yamauchi. > > --- On Thu, 2014/5/15, Andrew Beekhof wrote: > > > > > On 15 May 2014, at 9:57 am, renayama19661...@ybb.ne.jp wrote: > > > > > Hi Andrew, > > > > > > Thank you for comments. > > > > > >>> We do not want to be promoted to Master in the node that primitive > > >>> resource does not start. > > >>> Is there the setting of colocation and order which are not promoted to > > >>> Master of the Master node? > > >> > > >> Your config looks reasonable... almost certainly a bug in the PE. > > >> Do you happen to have the relevant pengine input file available? > > > > > > Really? > > > > I would expect that: > > > > colocation rsc_colocation-master-1 INFINITY: msPostgresql:Master A-master > > > > would only promote msPostgresql on a node where A-master was running. > > > > Is that not what you were wanting? > > > > > > > It was like right handling of PE as far as I confirmed a source code of > > > PM1.1. > > > I register this problem with Bugzilla and contact you. > > > > > > Best Regards, > > > Hideo Yamauchi. > > > > > > --- On Wed, 2014/5/14, Andrew Beekhof wrote: > > > > > >> > > >> On 13 May 2014, at 3:14 pm, renayama19661...@ybb.ne.jp wrote: > > >> > > >>> Hi All, > > >>> > > >>> We assume special resource constitution. > > >>> Master of master-slave depends on primitive resource for the > > >>> constitution. > > >>> > > >>> We performed the setting that Master stopped becoming it in Slave node > > >>> experimentally. > > >>> > > >>> > > >>> location rsc_location-msStateful-1 msPostgresql \ > > >>> rule $role="master" 200: #uname eq srv01 \ > > >>> rule $role="master" -INFINITY: #uname eq srv02 > > >>> > > >>> The Master resource depends on the primitive resource. > > >>> > > >>> colocation rsc_colocation-master-1 INFINITY: msPostgresql:Master > > >>>A-master > > >>> > > >>> > > >>> Step1) Start Slave node. > > >>> --- > > >>> [root@srv02 ~]# crm_mon -1 -Af > > >>> Last updated: Tue May 13 22:28:12 2014 > > >>> Last change: Tue May 13 22:28:07 2014 > > >>> Stack: corosync > > >>> Current DC: srv02 (3232238190) - partition WITHOUT quorum > > >>> Version: 1.1.11-f0f09b8 > > >>> 1 Nodes configured > > >>> 3 Resources configured > > >>> > > >>> > > >>> Online: [ srv02 ] > > >>> > > >>> A-master (ocf::heartbeat:Dummy): Started srv02 > > >>> Master/Slave Set: msPostgresql [pgsql] > > >>> Slaves: [ srv02 ] > > >>> > > >>> Node Attributes: > > >>> * Node srv02: > > >>> + master-pgsql : 5 > > >>> > > >>> Migration summary: > > >>> * Node srv02: > > >>> --- > > >>> > > >>> Step2) Start Master node. > > >>> --- > > >>> [root@srv02 ~]# crm_mon -1 -Af > > >>> Last updated: Tue May 13 22:33:39 2014 > > >>> Last change: Tue May 13 22:28:07 2014 > > >>> Stack: corosync > > >>> Current DC: srv02 (3232238190) - partition with quorum > > >>> Version: 1.1.11-f0f09b8 > > >>> 2 Nodes configured > > >>> 3 Resources configured > > >>> > > >>> > > >>> Online: [ srv01 srv02 ] > > >>> > > >>> A-master (ocf::heartbeat:Dummy): Started srv02 > > >>> Master/Slave Set: msPostgresql [pgsql] > > >>> Masters: [ srv01 ] > > >>> Slaves: [ srv02 ] > > >>> > > >>> Node Attributes: > > >>> * Node srv01: > > >>> + master-pgsql : 10 > > >>> * Node srv02: > > >>> + master-pgsql : 5 > > >>> > > >>> Migration summary: > > >>> * Node srv02: > > >>> * Node srv01: > > >>> --- > > >>> > > >>> * The Master node that primitive node does not start becomes Master. > > >>> > > >>> > > >>> We do not want to be promoted to Master in the node that primitive > > >>> resource does not start. > > >>> Is
Re: [Pacemaker] [Problem][pacemaker1.0] The "probe" may not be carried out by difference in cib information of "probe".
Hi Andrwe, > Here we go: > > https://github.com/ClusterLabs/pacemaker-1.0/blob/master/README.md > > If any additional bugs are found in 1.0, we should create a new entry at > bugs.clusterlabs.org, add it to the above README and as long as 1.1 is > unaffected: close the bug as WONTFIX. All right! Many Thanks! Hideo Yamauchi. --- On Thu, 2014/5/15, Andrew Beekhof wrote: > > On 15 May 2014, at 9:54 am, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > >>> It is not necessary at all to revise it for Pacemaker1.0. > >> > >> Maybe we need to add KnownIssues.md to the repo for anyone thats slow to > >> update. > >> Are there any 1.0 bugs that really really need fixing or shall we move > >> them all to the KnownIssues file? > > > > That's a good idea. > > In the user who is behind with a shift to PM1.1, it will help big. > > Here we go: > > https://github.com/ClusterLabs/pacemaker-1.0/blob/master/README.md > > If any additional bugs are found in 1.0, we should create a new entry at > bugs.clusterlabs.org, add it to the above README and as long as 1.1 is > unaffected: close the bug as WONTFIX. > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question] About control of colocation.(master-slave with primitive)
Hi Andrew, > >> Your config looks reasonable... almost certainly a bug in the PE. > >> Do you happen to have the relevant pengine input file available? > > > > Really? > > I would expect that: > > colocation rsc_colocation-master-1 INFINITY: msPostgresql:Master A-master > > would only promote msPostgresql on a node where A-master was running. > > Is that not what you were wanting? Yes. I wanted it. However, this colocation does not come to be applied by handling of PE. This is because role of msPostgresql is not decided when it calculates placement of A-MASTER. * In this case colocation seems to affect only the priority of the Master/Slave resource. I think that this problem disappears if this calculation of the PE is revised. Best Regards, Hideo Yamauchi. --- On Thu, 2014/5/15, Andrew Beekhof wrote: > > On 15 May 2014, at 9:57 am, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > Thank you for comments. > > > >>> We do not want to be promoted to Master in the node that primitive > >>> resource does not start. > >>> Is there the setting of colocation and order which are not promoted to > >>> Master of the Master node? > >> > >> Your config looks reasonable... almost certainly a bug in the PE. > >> Do you happen to have the relevant pengine input file available? > > > > Really? > > I would expect that: > > colocation rsc_colocation-master-1 INFINITY: msPostgresql:Master A-master > > would only promote msPostgresql on a node where A-master was running. > > Is that not what you were wanting? > > > > It was like right handling of PE as far as I confirmed a source code of > > PM1.1. > > I register this problem with Bugzilla and contact you. > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Wed, 2014/5/14, Andrew Beekhof wrote: > > > >> > >> On 13 May 2014, at 3:14 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi All, > >>> > >>> We assume special resource constitution. > >>> Master of master-slave depends on primitive resource for the constitution. > >>> > >>> We performed the setting that Master stopped becoming it in Slave node > >>> experimentally. > >>> > >>> > >>> location rsc_location-msStateful-1 msPostgresql \ > >>> rule $role="master" 200: #uname eq srv01 \ > >>> rule $role="master" -INFINITY: #uname eq srv02 > >>> > >>> The Master resource depends on the primitive resource. > >>> > >>> colocation rsc_colocation-master-1 INFINITY: msPostgresql:Master > >>>A-master > >>> > >>> > >>> Step1) Start Slave node. > >>> --- > >>> [root@srv02 ~]# crm_mon -1 -Af > >>> Last updated: Tue May 13 22:28:12 2014 > >>> Last change: Tue May 13 22:28:07 2014 > >>> Stack: corosync > >>> Current DC: srv02 (3232238190) - partition WITHOUT quorum > >>> Version: 1.1.11-f0f09b8 > >>> 1 Nodes configured > >>> 3 Resources configured > >>> > >>> > >>> Online: [ srv02 ] > >>> > >>> A-master (ocf::heartbeat:Dummy): Started srv02 > >>> Master/Slave Set: msPostgresql [pgsql] > >>> Slaves: [ srv02 ] > >>> > >>> Node Attributes: > >>> * Node srv02: > >>> + master-pgsql : 5 > >>> > >>> Migration summary: > >>> * Node srv02: > >>> --- > >>> > >>> Step2) Start Master node. > >>> --- > >>> [root@srv02 ~]# crm_mon -1 -Af > >>> Last updated: Tue May 13 22:33:39 2014 > >>> Last change: Tue May 13 22:28:07 2014 > >>> Stack: corosync > >>> Current DC: srv02 (3232238190) - partition with quorum > >>> Version: 1.1.11-f0f09b8 > >>> 2 Nodes configured > >>> 3 Resources configured > >>> > >>> > >>> Online: [ srv01 srv02 ] > >>> > >>> A-master (ocf::heartbeat:Dummy): Started srv02 > >>> Master/Slave Set: msPostgresql [pgsql] > >>> Masters: [ srv01 ] > >>> Slaves: [ srv02 ] > >>> > >>> Node Attributes: > >>> * Node srv01: > >>> + master-pgsql : 10 > >>> * Node srv02: > >>> + master-pgsql : 5 > >>> > >>> Migration summary: > >>> * Node srv02: > >>> * Node srv01: > >>> --- > >>> > >>> * The Master node that primitive node does not start becomes Master. > >>> > >>> > >>> We do not want to be promoted to Master in the node that primitive > >>> resource does not start. > >>> Is there the setting of colocation and order which are not promoted to > >>> Master of the Master node? > >> > >> Your config looks reasonable... almost certainly a bug in the PE. > >> Do you happen to have the relevant pengine input file available? > >> > >>> > >>> I think that one method includes the next method. > >>> * I handle it to update an attribute when primitive resource starts. > >>> * I write an attribute in the condition to be promoted to Master. > >>> > >>> > >>> In addition, we are often confused about control
Re: [Pacemaker] [Question] About control of colocation.(master-slave with primitive)
Hi Andrew, Thank you for comments. > > We do not want to be promoted to Master in the node that primitive resource > > does not start. > > Is there the setting of colocation and order which are not promoted to > > Master of the Master node? > > Your config looks reasonable... almost certainly a bug in the PE. > Do you happen to have the relevant pengine input file available? Really? It was like right handling of PE as far as I confirmed a source code of PM1.1. I register this problem with Bugzilla and contact you. Best Regards, Hideo Yamauchi. --- On Wed, 2014/5/14, Andrew Beekhof wrote: > > On 13 May 2014, at 3:14 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > We assume special resource constitution. > > Master of master-slave depends on primitive resource for the constitution. > > > > We performed the setting that Master stopped becoming it in Slave node > > experimentally. > > > > > > location rsc_location-msStateful-1 msPostgresql \ > > rule $role="master" 200: #uname eq srv01 \ > > rule $role="master" -INFINITY: #uname eq srv02 > > > > The Master resource depends on the primitive resource. > > > > colocation rsc_colocation-master-1 INFINITY: msPostgresql:Master A-master > > > > > > Step1) Start Slave node. > > --- > > [root@srv02 ~]# crm_mon -1 -Af > > Last updated: Tue May 13 22:28:12 2014 > > Last change: Tue May 13 22:28:07 2014 > > Stack: corosync > > Current DC: srv02 (3232238190) - partition WITHOUT quorum > > Version: 1.1.11-f0f09b8 > > 1 Nodes configured > > 3 Resources configured > > > > > > Online: [ srv02 ] > > > > A-master (ocf::heartbeat:Dummy): Started srv02 > > Master/Slave Set: msPostgresql [pgsql] > > Slaves: [ srv02 ] > > > > Node Attributes: > > * Node srv02: > > + master-pgsql : 5 > > > > Migration summary: > > * Node srv02: > > --- > > > > Step2) Start Master node. > > --- > > [root@srv02 ~]# crm_mon -1 -Af > > Last updated: Tue May 13 22:33:39 2014 > > Last change: Tue May 13 22:28:07 2014 > > Stack: corosync > > Current DC: srv02 (3232238190) - partition with quorum > > Version: 1.1.11-f0f09b8 > > 2 Nodes configured > > 3 Resources configured > > > > > > Online: [ srv01 srv02 ] > > > > A-master (ocf::heartbeat:Dummy): Started srv02 > > Master/Slave Set: msPostgresql [pgsql] > > Masters: [ srv01 ] > > Slaves: [ srv02 ] > > > > Node Attributes: > > * Node srv01: > > + master-pgsql : 10 > > * Node srv02: > > + master-pgsql : 5 > > > > Migration summary: > > * Node srv02: > > * Node srv01: > > --- > > > > * The Master node that primitive node does not start becomes Master. > > > > > > We do not want to be promoted to Master in the node that primitive resource > > does not start. > > Is there the setting of colocation and order which are not promoted to > > Master of the Master node? > > Your config looks reasonable... almost certainly a bug in the PE. > Do you happen to have the relevant pengine input file available? > > > > > I think that one method includes the next method. > > * I handle it to update an attribute when primitive resource starts. > > * I write an attribute in the condition to be promoted to Master. > > > > > > In addition, we are often confused about control of colotaion and order. > > It is in particular the control between primitive/group resource and > > clone/master-slave resources. > > Will you describe detailed contents in a document? > > > > > > Best Regards, > > Hideo Yamauchi. > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem][pacemaker1.0] The "probe" may not be carried out by difference in cib information of "probe".
Hi Andrew, > > It is not necessary at all to revise it for Pacemaker1.0. > > Maybe we need to add KnownIssues.md to the repo for anyone thats slow to > update. > Are there any 1.0 bugs that really really need fixing or shall we move them > all to the KnownIssues file? That's a good idea. In the user who is behind with a shift to PM1.1, it will help big. Best Regards, Hideo Yamachi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem][pacemaker1.0] The "probe" may not be carried out by difference in cib information of "probe".
Hi Andrew, Thank you for comments. > Do you guys have any timeframe for moving away from 1.0.x? > The 1.1 series is over 4 years old now and quite usable :-) > > There is really a (low) limit to how much effort I can put into support for > it. We gradually move from Pacemaker1.0 to Pacemaker1.1, too. I thought that I should record that there was this problem with Pacemaker1.0. And I registered a problem and reported it. (possibly the user who is behind with a shift to Pacemaker1.1 may encounter the same problem.) It is not necessary at all to revise it for Pacemaker1.0. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Question] About control of colocation.(master-slave with primitive)
Hi All, We assume special resource constitution. Master of master-slave depends on primitive resource for the constitution. We performed the setting that Master stopped becoming it in Slave node experimentally. location rsc_location-msStateful-1 msPostgresql \ rule $role="master" 200: #uname eq srv01 \ rule $role="master" -INFINITY: #uname eq srv02 The Master resource depends on the primitive resource. colocation rsc_colocation-master-1 INFINITY: msPostgresql:Master A-master Step1) Start Slave node. --- [root@srv02 ~]# crm_mon -1 -Af Last updated: Tue May 13 22:28:12 2014 Last change: Tue May 13 22:28:07 2014 Stack: corosync Current DC: srv02 (3232238190) - partition WITHOUT quorum Version: 1.1.11-f0f09b8 1 Nodes configured 3 Resources configured Online: [ srv02 ] A-master (ocf::heartbeat:Dummy): Started srv02 Master/Slave Set: msPostgresql [pgsql] Slaves: [ srv02 ] Node Attributes: * Node srv02: + master-pgsql : 5 Migration summary: * Node srv02: --- Step2) Start Master node. --- [root@srv02 ~]# crm_mon -1 -Af Last updated: Tue May 13 22:33:39 2014 Last change: Tue May 13 22:28:07 2014 Stack: corosync Current DC: srv02 (3232238190) - partition with quorum Version: 1.1.11-f0f09b8 2 Nodes configured 3 Resources configured Online: [ srv01 srv02 ] A-master (ocf::heartbeat:Dummy): Started srv02 Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 ] Node Attributes: * Node srv01: + master-pgsql : 10 * Node srv02: + master-pgsql : 5 Migration summary: * Node srv02: * Node srv01: --- * The Master node that primitive node does not start becomes Master. We do not want to be promoted to Master in the node that primitive resource does not start. Is there the setting of colocation and order which are not promoted to Master of the Master node? I think that one method includes the next method. * I handle it to update an attribute when primitive resource starts. * I write an attribute in the condition to be promoted to Master. In addition, we are often confused about control of colotaion and order. It is in particular the control between primitive/group resource and clone/master-slave resources. Will you describe detailed contents in a document? Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question] About "quorum-policy=freeze" and "promote".
Hi Andrew, > > Okay. > > I wish this problem is revised by the next release. > > crm_report? I confirmed a problem again in PM1.2-rc1 and registered in Bugzilla. * http://bugs.clusterlabs.org/show_bug.cgi?id=5212 Towards Bugzilla, I attached the crm_report file. Best Regards, Hideo Yamauchi. --- On Fri, 2014/5/9, Andrew Beekhof wrote: > > On 9 May 2014, at 2:05 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > Thank you for comment. > > > >>> Is it responsibility of the resource agent side to prevent a state of > >>> these plural Master? > >> > >> No. > >> > >> In this scenario, no nodes have quorum and therefor no additional > >> instances should have been promoted. Thats the definition of "freeze" :) > >> Even if one partition DID have quorum, no instances should have been > >> promoted without fencing occurring first. > > > > Okay. > > I wish this problem is revised by the next release. > > crm_report? > > > > > Many Thanks! > > Hideo Yamauchi. > > > > --- On Fri, 2014/5/9, Andrew Beekhof wrote: > > > >> > >> On 8 May 2014, at 1:37 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi All, > >>> > >>> I composed Master/Slave resource of three nodes that set > >>> quorum-policy="freeze". > >>> (I use Stateful in Master/Slave resource.) > >>> > >>> - > >>> Current DC: srv01 (3232238280) - partition with quorum > >>> Version: 1.1.11-830af67 > >>> 3 Nodes configured > >>> 9 Resources configured > >>> > >>> > >>> Online: [ srv01 srv02 srv03 ] > >>> > >>> Resource Group: grpStonith1 > >>> prmStonith1-1 (stonith:external/ssh): Started srv02 > >>> Resource Group: grpStonith2 > >>> prmStonith2-1 (stonith:external/ssh): Started srv01 > >>> Resource Group: grpStonith3 > >>> prmStonith3-1 (stonith:external/ssh): Started srv01 > >>> Master/Slave Set: msPostgresql [pgsql] > >>> Masters: [ srv01 ] > >>> Slaves: [ srv02 srv03 ] > >>> Clone Set: clnPingd [prmPingd] > >>> Started: [ srv01 srv02 srv03 ] > >>> - > >>> > >>> > >>> Master resource starts in all nodes when I interrupt the internal > >>> communication of all nodes. > >>> > >>> - > >>> Node srv02 (3232238290): UNCLEAN (offline) > >>> Node srv03 (3232238300): UNCLEAN (offline) > >>> Online: [ srv01 ] > >>> > >>> Resource Group: grpStonith1 > >>> prmStonith1-1 (stonith:external/ssh): Started srv02 > >>> Resource Group: grpStonith2 > >>> prmStonith2-1 (stonith:external/ssh): Started srv01 > >>> Resource Group: grpStonith3 > >>> prmStonith3-1 (stonith:external/ssh): Started srv01 > >>> Master/Slave Set: msPostgresql [pgsql] > >>> Masters: [ srv01 ] > >>> Slaves: [ srv02 srv03 ] > >>> Clone Set: clnPingd [prmPingd] > >>> Started: [ srv01 srv02 srv03 ] > >>> (snip) > >>> Node srv01 (3232238280): UNCLEAN (offline) > >>> Node srv03 (3232238300): UNCLEAN (offline) > >>> Online: [ srv02 ] > >>> > >>> Resource Group: grpStonith1 > >>> prmStonith1-1 (stonith:external/ssh): Started srv02 > >>> Resource Group: grpStonith2 > >>> prmStonith2-1 (stonith:external/ssh): Started srv01 > >>> Resource Group: grpStonith3 > >>> prmStonith3-1 (stonith:external/ssh): Started srv01 > >>> Master/Slave Set: msPostgresql [pgsql] > >>> Masters: [ srv01 srv02 ] > >>> Slaves: [ srv03 ] > >>> Clone Set: clnPingd [prmPingd] > >>> Started: [ srv01 srv02 srv03 ] > >>> (snip) > >>> Node srv01 (3232238280): UNCLEAN (offline) > >>> Node srv02 (3232238290): UNCLEAN (offline) > >>> Online: [ srv03 ] > >>> > >>> Resource Group: grpStonith1 > >>> prmStonith1-1 (stonith:external/ssh): Started srv02 > >>> Resource Group: grpStonith2 > >>> prmStonith2-1 (stonith:external/ssh): Started srv01 > >>> Resource Group: grpStonith3 > >>> prmStonith3-1 (stonith:external/ssh): Started srv01 > >>> Master/Slave Set: msPostgresql [pgsql] > >>> Masters: [ srv01 srv03 ] > >>> Slaves: [ srv02 ] > >>> Clone Set: clnPingd [prmPingd] > >>> Started: [ srv01 srv02 srv03 ] > >>> - > >>> > >>> I think even if the cluster loses Quorum, being "promote" the Master / > >>> Slave resource that's specification of Pacemaker. > >>> > >>> Is it responsibility of the resource agent side to prevent a state of > >>> these plural Master? > >> > >> No. > >> > >> In this scenario, no nodes have quorum and therefor no additional > >> instances should have been promoted. Thats the definition of "freeze" :) > >> Even if one partition DID have quorum, no instances should have been > >> promoted without fencing occurring first. > >> > >>> * I think that drbd-RA has those functions. > >>> * But, there is no function in Stateful-RA. > >>> * As an example, I think that the mechanism such as drbd is necessary by > >>> all means when I make a resource of Master/Slave newly. > >>
Re: [Pacemaker] [Question] About "quorum-policy=freeze" and "promote".
Hi Andrew, Thank you for comment. > > Is it responsibility of the resource agent side to prevent a state of these > > plural Master? > > No. > > In this scenario, no nodes have quorum and therefor no additional instances > should have been promoted. Thats the definition of "freeze" :) > Even if one partition DID have quorum, no instances should have been promoted > without fencing occurring first. Okay. I wish this problem is revised by the next release. Many Thanks! Hideo Yamauchi. --- On Fri, 2014/5/9, Andrew Beekhof wrote: > > On 8 May 2014, at 1:37 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > I composed Master/Slave resource of three nodes that set > > quorum-policy="freeze". > > (I use Stateful in Master/Slave resource.) > > > > - > > Current DC: srv01 (3232238280) - partition with quorum > > Version: 1.1.11-830af67 > > 3 Nodes configured > > 9 Resources configured > > > > > > Online: [ srv01 srv02 srv03 ] > > > > Resource Group: grpStonith1 > > prmStonith1-1 (stonith:external/ssh): Started srv02 > > Resource Group: grpStonith2 > > prmStonith2-1 (stonith:external/ssh): Started srv01 > > Resource Group: grpStonith3 > > prmStonith3-1 (stonith:external/ssh): Started srv01 > > Master/Slave Set: msPostgresql [pgsql] > > Masters: [ srv01 ] > > Slaves: [ srv02 srv03 ] > > Clone Set: clnPingd [prmPingd] > > Started: [ srv01 srv02 srv03 ] > > - > > > > > > Master resource starts in all nodes when I interrupt the internal > > communication of all nodes. > > > > - > > Node srv02 (3232238290): UNCLEAN (offline) > > Node srv03 (3232238300): UNCLEAN (offline) > > Online: [ srv01 ] > > > > Resource Group: grpStonith1 > > prmStonith1-1 (stonith:external/ssh): Started srv02 > > Resource Group: grpStonith2 > > prmStonith2-1 (stonith:external/ssh): Started srv01 > > Resource Group: grpStonith3 > > prmStonith3-1 (stonith:external/ssh): Started srv01 > > Master/Slave Set: msPostgresql [pgsql] > > Masters: [ srv01 ] > > Slaves: [ srv02 srv03 ] > > Clone Set: clnPingd [prmPingd] > > Started: [ srv01 srv02 srv03 ] > > (snip) > > Node srv01 (3232238280): UNCLEAN (offline) > > Node srv03 (3232238300): UNCLEAN (offline) > > Online: [ srv02 ] > > > > Resource Group: grpStonith1 > > prmStonith1-1 (stonith:external/ssh): Started srv02 > > Resource Group: grpStonith2 > > prmStonith2-1 (stonith:external/ssh): Started srv01 > > Resource Group: grpStonith3 > > prmStonith3-1 (stonith:external/ssh): Started srv01 > > Master/Slave Set: msPostgresql [pgsql] > > Masters: [ srv01 srv02 ] > > Slaves: [ srv03 ] > > Clone Set: clnPingd [prmPingd] > > Started: [ srv01 srv02 srv03 ] > > (snip) > > Node srv01 (3232238280): UNCLEAN (offline) > > Node srv02 (3232238290): UNCLEAN (offline) > > Online: [ srv03 ] > > > > Resource Group: grpStonith1 > > prmStonith1-1 (stonith:external/ssh): Started srv02 > > Resource Group: grpStonith2 > > prmStonith2-1 (stonith:external/ssh): Started srv01 > > Resource Group: grpStonith3 > > prmStonith3-1 (stonith:external/ssh): Started srv01 > > Master/Slave Set: msPostgresql [pgsql] > > Masters: [ srv01 srv03 ] > > Slaves: [ srv02 ] > > Clone Set: clnPingd [prmPingd] > > Started: [ srv01 srv02 srv03 ] > > - > > > > I think even if the cluster loses Quorum, being "promote" the Master / > > Slave resource that's specification of Pacemaker. > > > > Is it responsibility of the resource agent side to prevent a state of these > > plural Master? > > No. > > In this scenario, no nodes have quorum and therefor no additional instances > should have been promoted. Thats the definition of "freeze" :) > Even if one partition DID have quorum, no instances should have been promoted > without fencing occurring first. > > > * I think that drbd-RA has those functions. > > * But, there is no function in Stateful-RA. > > * As an example, I think that the mechanism such as drbd is necessary by > > all means when I make a resource of Master/Slave newly. > > > > Will my understanding be wrong? > > > > Best Regards, > > Hideo Yamauchi. > > > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Problem][pacemaker1.0] The "probe" may not be carried out by difference in cib information of "probe".
Hi All, We confirmed a problem when we performed "clean up" of the Master/Slave resource in Pacemaker1.0. When this problem occurs, "probe" processing is not carried out. I registered the problem with Bugzilla. * http://bugs.clusterlabs.org/show_bug.cgi?id=5211 In addition, I wrote the method of "clean up" avoiding a problem for Bugzilla. But this method may not be usable depending on the combination of resources. I request improvement if I can revise this problem in Pacemaker1.0 in community. * But this problem is improved in Pacemaker1.1 and does not seem to occur. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question] About "quorum-policy=freeze" and "promote".
Hi Emmanuel, > Why are you using ssh as stonith? i don't think the fencing is working > because your nodes are in unclean state No, STONITH is not carried out because all nodes lose quorum. This is right movement of Pacemaker. It is an example to use STONITH of ssh. Best Regards, Hideo Yamauchi. --- On Thu, 2014/5/8, emmanuel segura wrote: > > Why are you using ssh as stonith? i don't think the fencing is working > because your nodes are in unclean state > > > > > 2014-05-08 5:37 GMT+02:00 : > Hi All, > > I composed Master/Slave resource of three nodes that set > quorum-policy="freeze". > (I use Stateful in Master/Slave resource.) > > - > Current DC: srv01 (3232238280) - partition with quorum > Version: 1.1.11-830af67 > 3 Nodes configured > 9 Resources configured > > > Online: [ srv01 srv02 srv03 ] > > Resource Group: grpStonith1 > prmStonith1-1 (stonith:external/ssh): Started srv02 > Resource Group: grpStonith2 > prmStonith2-1 (stonith:external/ssh): Started srv01 > Resource Group: grpStonith3 > prmStonith3-1 (stonith:external/ssh): Started srv01 > Master/Slave Set: msPostgresql [pgsql] > Masters: [ srv01 ] > Slaves: [ srv02 srv03 ] > Clone Set: clnPingd [prmPingd] > Started: [ srv01 srv02 srv03 ] > - > > > Master resource starts in all nodes when I interrupt the internal > communication of all nodes. > > - > Node srv02 (3232238290): UNCLEAN (offline) > Node srv03 (3232238300): UNCLEAN (offline) > Online: [ srv01 ] > > Resource Group: grpStonith1 > prmStonith1-1 (stonith:external/ssh): Started srv02 > Resource Group: grpStonith2 > prmStonith2-1 (stonith:external/ssh): Started srv01 > Resource Group: grpStonith3 > prmStonith3-1 (stonith:external/ssh): Started srv01 > Master/Slave Set: msPostgresql [pgsql] > Masters: [ srv01 ] > Slaves: [ srv02 srv03 ] > Clone Set: clnPingd [prmPingd] > Started: [ srv01 srv02 srv03 ] > (snip) > Node srv01 (3232238280): UNCLEAN (offline) > Node srv03 (3232238300): UNCLEAN (offline) > Online: [ srv02 ] > > Resource Group: grpStonith1 > prmStonith1-1 (stonith:external/ssh): Started srv02 > Resource Group: grpStonith2 > prmStonith2-1 (stonith:external/ssh): Started srv01 > Resource Group: grpStonith3 > prmStonith3-1 (stonith:external/ssh): Started srv01 > Master/Slave Set: msPostgresql [pgsql] > Masters: [ srv01 srv02 ] > Slaves: [ srv03 ] > Clone Set: clnPingd [prmPingd] > Started: [ srv01 srv02 srv03 ] > (snip) > Node srv01 (3232238280): UNCLEAN (offline) > Node srv02 (3232238290): UNCLEAN (offline) > Online: [ srv03 ] > > Resource Group: grpStonith1 > prmStonith1-1 (stonith:external/ssh): Started srv02 > Resource Group: grpStonith2 > prmStonith2-1 (stonith:external/ssh): Started srv01 > Resource Group: grpStonith3 > prmStonith3-1 (stonith:external/ssh): Started srv01 > Master/Slave Set: msPostgresql [pgsql] > Masters: [ srv01 srv03 ] > Slaves: [ srv02 ] > Clone Set: clnPingd [prmPingd] > Started: [ srv01 srv02 srv03 ] > - > > I think even if the cluster loses Quorum, being "promote" the Master / Slave > resource that's specification of Pacemaker. > > Is it responsibility of the resource agent side to prevent a state of these > plural Master? > * I think that drbd-RA has those functions. > * But, there is no function in Stateful-RA. > * As an example, I think that the mechanism such as drbd is necessary by all > means when I make a resource of Master/Slave newly. > > Will my understanding be wrong? > > Best Regards, > Hideo Yamauchi. > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > > -- > esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Question] About "quorum-policy=freeze" and "promote".
Hi All, I composed Master/Slave resource of three nodes that set quorum-policy="freeze". (I use Stateful in Master/Slave resource.) - Current DC: srv01 (3232238280) - partition with quorum Version: 1.1.11-830af67 3 Nodes configured 9 Resources configured Online: [ srv01 srv02 srv03 ] Resource Group: grpStonith1 prmStonith1-1 (stonith:external/ssh): Started srv02 Resource Group: grpStonith2 prmStonith2-1 (stonith:external/ssh): Started srv01 Resource Group: grpStonith3 prmStonith3-1 (stonith:external/ssh): Started srv01 Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 srv03 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 srv03 ] - Master resource starts in all nodes when I interrupt the internal communication of all nodes. - Node srv02 (3232238290): UNCLEAN (offline) Node srv03 (3232238300): UNCLEAN (offline) Online: [ srv01 ] Resource Group: grpStonith1 prmStonith1-1 (stonith:external/ssh): Started srv02 Resource Group: grpStonith2 prmStonith2-1 (stonith:external/ssh): Started srv01 Resource Group: grpStonith3 prmStonith3-1 (stonith:external/ssh): Started srv01 Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 srv03 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 srv03 ] (snip) Node srv01 (3232238280): UNCLEAN (offline) Node srv03 (3232238300): UNCLEAN (offline) Online: [ srv02 ] Resource Group: grpStonith1 prmStonith1-1 (stonith:external/ssh): Started srv02 Resource Group: grpStonith2 prmStonith2-1 (stonith:external/ssh): Started srv01 Resource Group: grpStonith3 prmStonith3-1 (stonith:external/ssh): Started srv01 Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 srv02 ] Slaves: [ srv03 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 srv03 ] (snip) Node srv01 (3232238280): UNCLEAN (offline) Node srv02 (3232238290): UNCLEAN (offline) Online: [ srv03 ] Resource Group: grpStonith1 prmStonith1-1 (stonith:external/ssh): Started srv02 Resource Group: grpStonith2 prmStonith2-1 (stonith:external/ssh): Started srv01 Resource Group: grpStonith3 prmStonith3-1 (stonith:external/ssh): Started srv01 Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 srv03 ] Slaves: [ srv02 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 srv03 ] - I think even if the cluster loses Quorum, being "promote" the Master / Slave resource that's specification of Pacemaker. Is it responsibility of the resource agent side to prevent a state of these plural Master? * I think that drbd-RA has those functions. * But, there is no function in Stateful-RA. * As an example, I think that the mechanism such as drbd is necessary by all means when I make a resource of Master/Slave newly. Will my understanding be wrong? Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The timer which does not stop is discarded.
Hi All, I made a patch. Please confirm the contents of the patch. If there is not a problem, please reflect it in github. Best Regards, Hideo Yamauchi. --- On Thu, 2014/2/20, renayama19661...@ybb.ne.jp wrote: > Hi All, > > The timer which is not stopped at the time of the stop of the monitor of the > master slave resource of the local node runs. > Therefore, warning to cancel outputs a timer when crmd handles the transition > that is in a new state. > > I confirm it in the next procedure. > > Step1) Constitute a cluster. > > [root@srv01 ~]# crm_mon -1 -Af > Last updated: Thu Feb 20 22:57:09 2014 > Last change: Thu Feb 20 22:56:32 2014 via cibadmin on srv01 > Stack: corosync > Current DC: srv01 (3232238180) - partition with quorum > Version: 1.1.10-c1a326d > 2 Nodes configured > 6 Resources configured > > > Online: [ srv01 srv02 ] > > vip-master (ocf::heartbeat:Dummy2): Started srv01 > vip-rep (ocf::heartbeat:Dummy): Started srv01 > Master/Slave Set: msPostgresql [pgsql] > Masters: [ srv01 ] > Slaves: [ srv02 ] > Clone Set: clnPingd [prmPingd] > Started: [ srv01 srv02 ] > > Node Attributes: > * Node srv01: > + default_ping_set : 100 > + master-pgsql : 10 > * Node srv02: > + default_ping_set : 100 > + master-pgsql : 5 > > Migration summary: > * Node srv01: > * Node srv02: > > Step2) Cause trouble. > [root@srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state > > Step3) Warning is displayed by log. > (snip) > Feb 20 22:57:46 srv01 crmd[12107]: notice: te_rsc_command: Initiating > action 5: cancel pgsql_cancel_9000 on srv01 (local) > Feb 20 22:57:46 srv01 lrmd[12104]: info: cancel_recurring_action: > Cancelling operation pgsql_monitor_9000 > Feb 20 22:57:46 srv01 crmd[12107]: info: match_graph_event: Action > pgsql_monitor_9000 (5) confirmed on srv01 (rc=0) > (snip) > Feb 20 22:57:46 srv01 pengine[12106]: info: LogActions: Leave > prmPingd:1#011(Started srv02)Feb 20 22:57:46 srv01 crmd[12107]: info: > do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE > [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] > Feb 20 22:57:46 srv01 crmd[12107]: warning: destroy_action: Cancelling timer > for action 5 (src=139) > (snip) > > The time-out monitoring with the timer thinks like an unnecessary at the time > of the stop of the monitor of the master slave resource of the local node. > > I registered these contents with Bugzilla. > > * http://bugs.clusterlabs.org/show_bug.cgi?id=5199 > > Best Regards, > Hideo Yamauchi. > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > trac2766.patch Description: Binary data ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Problem] The timer which does not stop is discarded.
Hi All, The timer which is not stopped at the time of the stop of the monitor of the master slave resource of the local node runs. Therefore, warning to cancel outputs a timer when crmd handles the transition that is in a new state. I confirm it in the next procedure. Step1) Constitute a cluster. [root@srv01 ~]# crm_mon -1 -Af Last updated: Thu Feb 20 22:57:09 2014 Last change: Thu Feb 20 22:56:32 2014 via cibadmin on srv01 Stack: corosync Current DC: srv01 (3232238180) - partition with quorum Version: 1.1.10-c1a326d 2 Nodes configured 6 Resources configured Online: [ srv01 srv02 ] vip-master (ocf::heartbeat:Dummy2):Started srv01 vip-rep(ocf::heartbeat:Dummy): Started srv01 Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 ] Node Attributes: * Node srv01: + default_ping_set : 100 + master-pgsql : 10 * Node srv02: + default_ping_set : 100 + master-pgsql : 5 Migration summary: * Node srv01: * Node srv02: Step2) Cause trouble. [root@srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state Step3) Warning is displayed by log. (snip) Feb 20 22:57:46 srv01 crmd[12107]: notice: te_rsc_command: Initiating action 5: cancel pgsql_cancel_9000 on srv01 (local) Feb 20 22:57:46 srv01 lrmd[12104]: info: cancel_recurring_action: Cancelling operation pgsql_monitor_9000 Feb 20 22:57:46 srv01 crmd[12107]: info: match_graph_event: Action pgsql_monitor_9000 (5) confirmed on srv01 (rc=0) (snip) Feb 20 22:57:46 srv01 pengine[12106]: info: LogActions: Leave prmPingd:1#011(Started srv02)Feb 20 22:57:46 srv01 crmd[12107]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Feb 20 22:57:46 srv01 crmd[12107]: warning: destroy_action: Cancelling timer for action 5 (src=139) (snip) The time-out monitoring with the timer thinks like an unnecessary at the time of the stop of the monitor of the master slave resource of the local node. I registered these contents with Bugzilla. * http://bugs.clusterlabs.org/show_bug.cgi?id=5199 Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Patch]Information of "Connectivity is lost" is not displayed
Hi Andrew, > > I wish it is displayed as follows. > > > > > > * Node srv01: > > + default_ping_set : 0 : Connectivity is > >lost > > Ah! https://github.com/beekhof/pacemaker/commit/5d51930 It was displayed definitely. Many Thanks! Hideo Yamauchi. --- On Wed, 2014/2/19, Andrew Beekhof wrote: > > On 19 Feb 2014, at 2:55 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > Thank you for comments. > > > >> So I'm confused as to what the problem is. > >> What are you expecting crm_mon to show? > > > > I wish it is displayed as follows. > > > > > > * Node srv01: > > + default_ping_set : 0 : Connectivity is > >lost > > Ah! https://github.com/beekhof/pacemaker/commit/5d51930 > > > > > Best Regards, > > Hideo Yamauchi. > > --- On Wed, 2014/2/19, Andrew Beekhof wrote: > > > >> > >> On 18 Feb 2014, at 2:38 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi Andrew, > >>> > >>> I attach the result of the cibadmin -Q command. > >> > >> > >> So I see: > >> > >> >>type="ping"> > >> > >> >>id="prmPingd-instance_attributes-name"/> > >> >>id="prmPingd-instance_attributes-host_list"/> > >> >>id="prmPingd-instance_attributes-multiplier"/> > >> >>id="prmPingd-instance_attributes-attempts"/> > >> >>id="prmPingd-instance_attributes-timeout"/> > >> > >> > >> The correct way to query those is as parameters, not as meta attributes. > >> Which is what lars' patch achieves. > >> > >> In your email I see: > >> > > * Node srv01: > > + default_ping_set : 0 > >> > >> > >> Which looks correct based on: > >> > >> >>name="default_ping_set" value="0"/> > >> > >> So I'm confused as to what the problem is. > >> What are you expecting crm_mon to show? > >> > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >>> > >>> --- On Tue, 2014/2/18, Andrew Beekhof wrote: > >>> > > On 18 Feb 2014, at 1:45 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > Thank you for comments. > > > >> can I see the config of yours that crm_mon is not displaying correctly? > > > > It is displayed as follows. > > I mean the raw xml. Can you attach it? > > > - > > [root@srv01 tmp]# crm_mon -1 -Af > > Last updated: Tue Feb 18 19:51:04 2014 > > Last change: Tue Feb 18 19:48:55 2014 via cibadmin on srv01 > > Stack: corosync > > Current DC: srv01 (3232238180) - partition WITHOUT quorum > > Version: 1.1.10-9d39a6b > > 1 Nodes configured > > 5 Resources configured > > > > > > Online: [ srv01 ] > > > > Clone Set: clnPingd [prmPingd] > > Started: [ srv01 ] > > > > Node Attributes: > > * Node srv01: > > + default_ping_set : 0 > > > > Migration summary: > > * Node srv01: > > > > - > > > > I uploaded log in the next place.(trac2781.zip) > > > > * > > https://skydrive.live.com/?cid=3A14D57622C66876&id=3A14D57622C66876%21117 > > > > Best Regards, > > Hideo Yamauchi. > > > > > > --- On Tue, 2014/2/18, Andrew Beekhof wrote: > > > >> > >> On 18 Feb 2014, at 12:19 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi Andrew, > >>> > I'm confused... that patch seems to be the reverse of yours. > Are you saying that we need to undo Lars' one? > >>> > >>> No, I do not understand the meaning of the correction of Mr. Lars. > >> > >> > >> name, multiplier and host_list are all resource parameters, not meta > >> attributes. > >> so lars' patch should be correct. > >> > >> can I see the config of yours that crm_mon is not displaying correctly? > >> > >>> > >>> However, as now, crm_mon does not display a right attribute. > >>> Possibly did you not discuss the correction to put meta data in > >>> rsc-parameters with Mr. Lars? Or Mr. David? > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >>> --- On Tue, 2014/2/18, Andrew Beekhof wrote: > >>> > > On 17 Feb 2014, at 5:43 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > The next change was accomplished by Mr. Lars. > > > > https://github.com/ClusterLabs/pacemaker/commit/6a17c003b0167de9fe51d5330fb6e4f1b4ffe64c > > I'm confused... that patch seems to be the reverse of yours. > Are you saying that we need to undo Lars' one? > > > > > I may lack the correction of other parts which are not the patch > > which I sent. > > > > Best Regards, > > Hideo Yamau
Re: [Pacemaker] [Patch]Information of "Connectivity is lost" is not displayed
Hi Andrew, Thank you for comments. > So I'm confused as to what the problem is. > What are you expecting crm_mon to show? I wish it is displayed as follows. * Node srv01: + default_ping_set : 0 : Connectivity is lost Best Regards, Hideo Yamauchi. --- On Wed, 2014/2/19, Andrew Beekhof wrote: > > On 18 Feb 2014, at 2:38 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > I attach the result of the cibadmin -Q command. > > > So I see: > > > > id="prmPingd-instance_attributes-name"/> > id="prmPingd-instance_attributes-host_list"/> > id="prmPingd-instance_attributes-multiplier"/> > id="prmPingd-instance_attributes-attempts"/> > id="prmPingd-instance_attributes-timeout"/> > > > The correct way to query those is as parameters, not as meta attributes. > Which is what lars' patch achieves. > > In your email I see: > > >>> * Node srv01: > >>> + default_ping_set : 0 > > > Which looks correct based on: > > name="default_ping_set" value="0"/> > > So I'm confused as to what the problem is. > What are you expecting crm_mon to show? > > > > > Best Regards, > > Hideo Yamauchi. > > > > > > --- On Tue, 2014/2/18, Andrew Beekhof wrote: > > > >> > >> On 18 Feb 2014, at 1:45 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi Andrew, > >>> > >>> Thank you for comments. > >>> > can I see the config of yours that crm_mon is not displaying correctly? > >>> > >>> It is displayed as follows. > >> > >> I mean the raw xml. Can you attach it? > >> > >>> - > >>> [root@srv01 tmp]# crm_mon -1 -Af > >>> Last updated: Tue Feb 18 19:51:04 2014 > >>> Last change: Tue Feb 18 19:48:55 2014 via cibadmin on srv01 > >>> Stack: corosync > >>> Current DC: srv01 (3232238180) - partition WITHOUT quorum > >>> Version: 1.1.10-9d39a6b > >>> 1 Nodes configured > >>> 5 Resources configured > >>> > >>> > >>> Online: [ srv01 ] > >>> > >>> Clone Set: clnPingd [prmPingd] > >>> Started: [ srv01 ] > >>> > >>> Node Attributes: > >>> * Node srv01: > >>> + default_ping_set : 0 > >>> > >>> Migration summary: > >>> * Node srv01: > >>> > >>> - > >>> > >>> I uploaded log in the next place.(trac2781.zip) > >>> > >>> * > >>> https://skydrive.live.com/?cid=3A14D57622C66876&id=3A14D57622C66876%21117 > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >>> > >>> --- On Tue, 2014/2/18, Andrew Beekhof wrote: > >>> > > On 18 Feb 2014, at 12:19 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > >> I'm confused... that patch seems to be the reverse of yours. > >> Are you saying that we need to undo Lars' one? > > > > No, I do not understand the meaning of the correction of Mr. Lars. > > > name, multiplier and host_list are all resource parameters, not meta > attributes. > so lars' patch should be correct. > > can I see the config of yours that crm_mon is not displaying correctly? > > > > > However, as now, crm_mon does not display a right attribute. > > Possibly did you not discuss the correction to put meta data in > > rsc-parameters with Mr. Lars? Or Mr. David? > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Tue, 2014/2/18, Andrew Beekhof wrote: > > > >> > >> On 17 Feb 2014, at 5:43 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi All, > >>> > >>> The next change was accomplished by Mr. Lars. > >>> > >>> https://github.com/ClusterLabs/pacemaker/commit/6a17c003b0167de9fe51d5330fb6e4f1b4ffe64c > >> > >> I'm confused... that patch seems to be the reverse of yours. > >> Are you saying that we need to undo Lars' one? > >> > >>> > >>> I may lack the correction of other parts which are not the patch > >>> which I sent. > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >>> --- On Mon, 2014/2/17, renayama19661...@ybb.ne.jp > >>> wrote: > >>> > Hi All, > > The crm_mon tool which is attached to Pacemaker1.1 seems to have a > problem. > I send a patch. > > Best Regards, > Hideo Yamauchi. > >>> > >>> ___ > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: > >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > >> > >> > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >
Re: [Pacemaker] [Problem] Fail-over is delayed.(State transition is not calculated.)
Hi Andrew, > I'll follow up on the bug. Thanks! Hideo Yamauch. --- On Wed, 2014/2/19, Andrew Beekhof wrote: > I'll follow up on the bug. > > On 19 Feb 2014, at 10:55 am, renayama19661...@ybb.ne.jp wrote: > > > Hi David, > > > > Thank you for comments. > > > >> You have resource-stickiness=INFINITY, this is what is preventing the > >> failover from occurring. Set resource-stickiness=1 or 0 and the failover > >> should occur. > >> > > > > However, the resource moves by a calculation of the next state transition. > > By a calculation of the first trouble, can it not travel the resource? > > > > In addition, the resource moves when the resource deletes next colocation. > > > > colocation rsc_colocation-master-3 INFINITY: vip-rep msPostgresql:Master > > > > There is the problem with handling of colocation of some Pacemaker? > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Wed, 2014/2/19, David Vossel wrote: > > > >> > >> - Original Message - > >>> From: renayama19661...@ybb.ne.jp > >>> To: "PaceMaker-ML" > >>> Sent: Monday, February 17, 2014 7:06:53 PM > >>> Subject: [Pacemaker] [Problem] Fail-over is delayed.(State transition is > >>> not calculated.) > >>> > >>> Hi All, > >>> > >>> I confirmed movement at the time of the trouble in one of Master/Slave in > >>> Pacemaker1.1.11. > >>> > >>> - > >>> > >>> Step1) Constitute a cluster. > >>> > >>> [root@srv01 ~]# crm_mon -1 -Af > >>> Last updated: Tue Feb 18 18:07:24 2014 > >>> Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01 > >>> Stack: corosync > >>> Current DC: srv01 (3232238180) - partition with quorum > >>> Version: 1.1.10-9d39a6b > >>> 2 Nodes configured > >>> 6 Resources configured > >>> > >>> > >>> Online: [ srv01 srv02 ] > >>> > >>> vip-master (ocf::heartbeat:Dummy): Started srv01 > >>> vip-rep (ocf::heartbeat:Dummy): Started srv01 > >>> Master/Slave Set: msPostgresql [pgsql] > >>> Masters: [ srv01 ] > >>> Slaves: [ srv02 ] > >>> Clone Set: clnPingd [prmPingd] > >>> Started: [ srv01 srv02 ] > >>> > >>> Node Attributes: > >>> * Node srv01: > >>> + default_ping_set : 100 > >>> + master-pgsql : 10 > >>> * Node srv02: > >>> + default_ping_set : 100 > >>> + master-pgsql : 5 > >>> > >>> Migration summary: > >>> * Node srv01: > >>> * Node srv02: > >>> > >>> Step2) Monitor error in vip-master. > >>> > >>> [root@srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state > >>> > >>> [root@srv01 ~]# crm_mon -1 -Af > >>> Last updated: Tue Feb 18 18:07:58 2014 > >>> Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01 > >>> Stack: corosync > >>> Current DC: srv01 (3232238180) - partition with quorum > >>> Version: 1.1.10-9d39a6b > >>> 2 Nodes configured > >>> 6 Resources configured > >>> > >>> > >>> Online: [ srv01 srv02 ] > >>> > >>> Master/Slave Set: msPostgresql [pgsql] > >>> Masters: [ srv01 ] > >>> Slaves: [ srv02 ] > >>> Clone Set: clnPingd [prmPingd] > >>> Started: [ srv01 srv02 ] > >>> > >>> Node Attributes: > >>> * Node srv01: > >>> + default_ping_set : 100 > >>> + master-pgsql : 10 > >>> * Node srv02: > >>> + default_ping_set : 100 > >>> + master-pgsql : 5 > >>> > >>> Migration summary: > >>> * Node srv01: > >>> vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb > >>>18 > >>> 18:07:50 2014' > >>> * Node srv02: > >>> > >>> Failed actions: > >>> vip-master_monitor_1 on srv01 'not running' (7): call=30, > >>> status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', > >>>queued=0ms, > >>> exec=0ms > >>> - > >>> > >>> However, the resource does not fail-over. > >>> > >>> But, fail-over is calculated when I check cib in crm_simulate at this > >>> point > >>> in time. > >>> > >>> - > >>> [root@srv01 ~]# crm_simulate -L -s > >>> > >>> Current cluster status: > >>> Online: [ srv01 srv02 ] > >>> > >>> vip-master (ocf::heartbeat:Dummy): Stopped > >>> vip-rep (ocf::heartbeat:Dummy): Stopped > >>> Master/Slave Set: msPostgresql [pgsql] > >>> Masters: [ srv01 ] > >>> Slaves: [ srv02 ] > >>> Clone Set: clnPingd [prmPingd] > >>> Started: [ srv01 srv02 ] > >>> > >>> Allocation scores: > >>> clone_color: clnPingd allocation score on srv01: 0 > >>> clone_color: clnPingd allocation score on srv02: 0 > >>> clone_color: prmPingd:0 allocation score on srv01: INFINITY > >>> clone_color: prmPingd:0 allocation score on srv02: 0 > >>> clone_color: prmPingd:1 allocation score on srv01: 0 > >>> clone_color: prmPingd:1 allocation score on srv02: INFINITY > >>> native_color: prmPingd:0 allocation score on srv01: INFINITY > >>> native_color: prmPingd:0 allocation scor
Re: [Pacemaker] [Problem] Fail-over is delayed.(State transition is not calculated.)
Hi David, Thank you for comments. > You have resource-stickiness=INFINITY, this is what is preventing the > failover from occurring. Set resource-stickiness=1 or 0 and the failover > should occur. > However, the resource moves by a calculation of the next state transition. By a calculation of the first trouble, can it not travel the resource? In addition, the resource moves when the resource deletes next colocation. colocation rsc_colocation-master-3 INFINITY: vip-rep msPostgresql:Master There is the problem with handling of colocation of some Pacemaker? Best Regards, Hideo Yamauchi. --- On Wed, 2014/2/19, David Vossel wrote: > > - Original Message - > > From: renayama19661...@ybb.ne.jp > > To: "PaceMaker-ML" > > Sent: Monday, February 17, 2014 7:06:53 PM > > Subject: [Pacemaker] [Problem] Fail-over is delayed.(State transition is > > not calculated.) > > > > Hi All, > > > > I confirmed movement at the time of the trouble in one of Master/Slave in > > Pacemaker1.1.11. > > > > - > > > > Step1) Constitute a cluster. > > > > [root@srv01 ~]# crm_mon -1 -Af > > Last updated: Tue Feb 18 18:07:24 2014 > > Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01 > > Stack: corosync > > Current DC: srv01 (3232238180) - partition with quorum > > Version: 1.1.10-9d39a6b > > 2 Nodes configured > > 6 Resources configured > > > > > > Online: [ srv01 srv02 ] > > > > vip-master (ocf::heartbeat:Dummy): Started srv01 > > vip-rep (ocf::heartbeat:Dummy): Started srv01 > > Master/Slave Set: msPostgresql [pgsql] > > Masters: [ srv01 ] > > Slaves: [ srv02 ] > > Clone Set: clnPingd [prmPingd] > > Started: [ srv01 srv02 ] > > > > Node Attributes: > > * Node srv01: > > + default_ping_set : 100 > > + master-pgsql : 10 > > * Node srv02: > > + default_ping_set : 100 > > + master-pgsql : 5 > > > > Migration summary: > > * Node srv01: > > * Node srv02: > > > > Step2) Monitor error in vip-master. > > > > [root@srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state > > > > [root@srv01 ~]# crm_mon -1 -Af > > Last updated: Tue Feb 18 18:07:58 2014 > > Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01 > > Stack: corosync > > Current DC: srv01 (3232238180) - partition with quorum > > Version: 1.1.10-9d39a6b > > 2 Nodes configured > > 6 Resources configured > > > > > > Online: [ srv01 srv02 ] > > > > Master/Slave Set: msPostgresql [pgsql] > > Masters: [ srv01 ] > > Slaves: [ srv02 ] > > Clone Set: clnPingd [prmPingd] > > Started: [ srv01 srv02 ] > > > > Node Attributes: > > * Node srv01: > > + default_ping_set : 100 > > + master-pgsql : 10 > > * Node srv02: > > + default_ping_set : 100 > > + master-pgsql : 5 > > > > Migration summary: > > * Node srv01: > > vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18 > > 18:07:50 2014' > > * Node srv02: > > > > Failed actions: > > vip-master_monitor_1 on srv01 'not running' (7): call=30, > > status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms, > > exec=0ms > > - > > > > However, the resource does not fail-over. > > > > But, fail-over is calculated when I check cib in crm_simulate at this point > > in time. > > > > - > > [root@srv01 ~]# crm_simulate -L -s > > > > Current cluster status: > > Online: [ srv01 srv02 ] > > > > vip-master (ocf::heartbeat:Dummy): Stopped > > vip-rep (ocf::heartbeat:Dummy): Stopped > > Master/Slave Set: msPostgresql [pgsql] > > Masters: [ srv01 ] > > Slaves: [ srv02 ] > > Clone Set: clnPingd [prmPingd] > > Started: [ srv01 srv02 ] > > > > Allocation scores: > > clone_color: clnPingd allocation score on srv01: 0 > > clone_color: clnPingd allocation score on srv02: 0 > > clone_color: prmPingd:0 allocation score on srv01: INFINITY > > clone_color: prmPingd:0 allocation score on srv02: 0 > > clone_color: prmPingd:1 allocation score on srv01: 0 > > clone_color: prmPingd:1 allocation score on srv02: INFINITY > > native_color: prmPingd:0 allocation score on srv01: INFINITY > > native_color: prmPingd:0 allocation score on srv02: 0 > > native_color: prmPingd:1 allocation score on srv01: -INFINITY > > native_color: prmPingd:1 allocation score on srv02: INFINITY > > clone_color: msPostgresql allocation score on srv01: 0 > > clone_color: msPostgresql allocation score on srv02: 0 > > clone_color: pgsql:0 allocation score on srv01: INFINITY > > clone_color: pgsql:0 allocation score on srv02: 0 > > clone_color: pgsql:1 allocation score on srv01: 0 > > clone_color: pgsql:1 allocation score on srv02: INFINITY > > native_color: pgsql:0 allocation score on srv01: INFINITY > > native_color: pgsql:0 allocation s
Re: [Pacemaker] [Patch]Information of "Connectivity is lost" is not displayed
Hi Andrew, I attach the result of the cibadmin -Q command. Best Regards, Hideo Yamauchi. --- On Tue, 2014/2/18, Andrew Beekhof wrote: > > On 18 Feb 2014, at 1:45 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > Thank you for comments. > > > >> can I see the config of yours that crm_mon is not displaying correctly? > > > > It is displayed as follows. > > I mean the raw xml. Can you attach it? > > > - > > [root@srv01 tmp]# crm_mon -1 -Af > > Last updated: Tue Feb 18 19:51:04 2014 > > Last change: Tue Feb 18 19:48:55 2014 via cibadmin on srv01 > > Stack: corosync > > Current DC: srv01 (3232238180) - partition WITHOUT quorum > > Version: 1.1.10-9d39a6b > > 1 Nodes configured > > 5 Resources configured > > > > > > Online: [ srv01 ] > > > > Clone Set: clnPingd [prmPingd] > > Started: [ srv01 ] > > > > Node Attributes: > > * Node srv01: > > + default_ping_set : 0 > > > > Migration summary: > > * Node srv01: > > > > - > > > > I uploaded log in the next place.(trac2781.zip) > > > > * https://skydrive.live.com/?cid=3A14D57622C66876&id=3A14D57622C66876%21117 > > > > Best Regards, > > Hideo Yamauchi. > > > > > > --- On Tue, 2014/2/18, Andrew Beekhof wrote: > > > >> > >> On 18 Feb 2014, at 12:19 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi Andrew, > >>> > I'm confused... that patch seems to be the reverse of yours. > Are you saying that we need to undo Lars' one? > >>> > >>> No, I do not understand the meaning of the correction of Mr. Lars. > >> > >> > >> name, multiplier and host_list are all resource parameters, not meta > >> attributes. > >> so lars' patch should be correct. > >> > >> can I see the config of yours that crm_mon is not displaying correctly? > >> > >>> > >>> However, as now, crm_mon does not display a right attribute. > >>> Possibly did you not discuss the correction to put meta data in > >>> rsc-parameters with Mr. Lars? Or Mr. David? > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >>> --- On Tue, 2014/2/18, Andrew Beekhof wrote: > >>> > > On 17 Feb 2014, at 5:43 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > The next change was accomplished by Mr. Lars. > > > > https://github.com/ClusterLabs/pacemaker/commit/6a17c003b0167de9fe51d5330fb6e4f1b4ffe64c > > I'm confused... that patch seems to be the reverse of yours. > Are you saying that we need to undo Lars' one? > > > > > I may lack the correction of other parts which are not the patch which > > I sent. > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Mon, 2014/2/17, renayama19661...@ybb.ne.jp > > wrote: > > > >> Hi All, > >> > >> The crm_mon tool which is attached to Pacemaker1.1 seems to have a > >> problem. > >> I send a patch. > >> > >> Best Regards, > >> Hideo Yamauchi. > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > >>> > >>> ___ > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > >> > >> > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > >
Re: [Pacemaker] [Patch]Information of "Connectivity is lost" is not displayed
Hi Andrew, Thank you for comments. > can I see the config of yours that crm_mon is not displaying correctly? It is displayed as follows. - [root@srv01 tmp]# crm_mon -1 -Af Last updated: Tue Feb 18 19:51:04 2014 Last change: Tue Feb 18 19:48:55 2014 via cibadmin on srv01 Stack: corosync Current DC: srv01 (3232238180) - partition WITHOUT quorum Version: 1.1.10-9d39a6b 1 Nodes configured 5 Resources configured Online: [ srv01 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 ] Node Attributes: * Node srv01: + default_ping_set : 0 Migration summary: * Node srv01: - I uploaded log in the next place.(trac2781.zip) * https://skydrive.live.com/?cid=3A14D57622C66876&id=3A14D57622C66876%21117 Best Regards, Hideo Yamauchi. --- On Tue, 2014/2/18, Andrew Beekhof wrote: > > On 18 Feb 2014, at 12:19 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > >> I'm confused... that patch seems to be the reverse of yours. > >> Are you saying that we need to undo Lars' one? > > > > No, I do not understand the meaning of the correction of Mr. Lars. > > > name, multiplier and host_list are all resource parameters, not meta > attributes. > so lars' patch should be correct. > > can I see the config of yours that crm_mon is not displaying correctly? > > > > > However, as now, crm_mon does not display a right attribute. > > Possibly did you not discuss the correction to put meta data in > > rsc-parameters with Mr. Lars? Or Mr. David? > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Tue, 2014/2/18, Andrew Beekhof wrote: > > > >> > >> On 17 Feb 2014, at 5:43 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi All, > >>> > >>> The next change was accomplished by Mr. Lars. > >>> > >>> https://github.com/ClusterLabs/pacemaker/commit/6a17c003b0167de9fe51d5330fb6e4f1b4ffe64c > >> > >> I'm confused... that patch seems to be the reverse of yours. > >> Are you saying that we need to undo Lars' one? > >> > >>> > >>> I may lack the correction of other parts which are not the patch which I > >>> sent. > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >>> --- On Mon, 2014/2/17, renayama19661...@ybb.ne.jp > >>> wrote: > >>> > Hi All, > > The crm_mon tool which is attached to Pacemaker1.1 seems to have a > problem. > I send a patch. > > Best Regards, > Hideo Yamauchi. > >>> > >>> ___ > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > >> > >> > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Patch]Information of "Connectivity is lost" is not displayed
Hi Andrew, > I'm confused... that patch seems to be the reverse of yours. > Are you saying that we need to undo Lars' one? No, I do not understand the meaning of the correction of Mr. Lars. However, as now, crm_mon does not display a right attribute. Possibly did you not discuss the correction to put meta data in rsc-parameters with Mr. Lars? Or Mr. David? Best Regards, Hideo Yamauchi. --- On Tue, 2014/2/18, Andrew Beekhof wrote: > > On 17 Feb 2014, at 5:43 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > The next change was accomplished by Mr. Lars. > > > > https://github.com/ClusterLabs/pacemaker/commit/6a17c003b0167de9fe51d5330fb6e4f1b4ffe64c > > I'm confused... that patch seems to be the reverse of yours. > Are you saying that we need to undo Lars' one? > > > > > I may lack the correction of other parts which are not the patch which I > > sent. > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Mon, 2014/2/17, renayama19661...@ybb.ne.jp > > wrote: > > > >> Hi All, > >> > >> The crm_mon tool which is attached to Pacemaker1.1 seems to have a problem. > >> I send a patch. > >> > >> Best Regards, > >> Hideo Yamauchi. > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Problem] Fail-over is delayed.(State transition is not calculated.)
Hi All, I confirmed movement at the time of the trouble in one of Master/Slave in Pacemaker1.1.11. - Step1) Constitute a cluster. [root@srv01 ~]# crm_mon -1 -Af Last updated: Tue Feb 18 18:07:24 2014 Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01 Stack: corosync Current DC: srv01 (3232238180) - partition with quorum Version: 1.1.10-9d39a6b 2 Nodes configured 6 Resources configured Online: [ srv01 srv02 ] vip-master (ocf::heartbeat:Dummy): Started srv01 vip-rep(ocf::heartbeat:Dummy): Started srv01 Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 ] Node Attributes: * Node srv01: + default_ping_set : 100 + master-pgsql : 10 * Node srv02: + default_ping_set : 100 + master-pgsql : 5 Migration summary: * Node srv01: * Node srv02: Step2) Monitor error in vip-master. [root@srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state [root@srv01 ~]# crm_mon -1 -Af Last updated: Tue Feb 18 18:07:58 2014 Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01 Stack: corosync Current DC: srv01 (3232238180) - partition with quorum Version: 1.1.10-9d39a6b 2 Nodes configured 6 Resources configured Online: [ srv01 srv02 ] Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 ] Node Attributes: * Node srv01: + default_ping_set : 100 + master-pgsql : 10 * Node srv02: + default_ping_set : 100 + master-pgsql : 5 Migration summary: * Node srv01: vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18 18:07:50 2014' * Node srv02: Failed actions: vip-master_monitor_1 on srv01 'not running' (7): call=30, status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms, exec=0ms - However, the resource does not fail-over. But, fail-over is calculated when I check cib in crm_simulate at this point in time. - [root@srv01 ~]# crm_simulate -L -s Current cluster status: Online: [ srv01 srv02 ] vip-master (ocf::heartbeat:Dummy): Stopped vip-rep(ocf::heartbeat:Dummy): Stopped Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 ] Allocation scores: clone_color: clnPingd allocation score on srv01: 0 clone_color: clnPingd allocation score on srv02: 0 clone_color: prmPingd:0 allocation score on srv01: INFINITY clone_color: prmPingd:0 allocation score on srv02: 0 clone_color: prmPingd:1 allocation score on srv01: 0 clone_color: prmPingd:1 allocation score on srv02: INFINITY native_color: prmPingd:0 allocation score on srv01: INFINITY native_color: prmPingd:0 allocation score on srv02: 0 native_color: prmPingd:1 allocation score on srv01: -INFINITY native_color: prmPingd:1 allocation score on srv02: INFINITY clone_color: msPostgresql allocation score on srv01: 0 clone_color: msPostgresql allocation score on srv02: 0 clone_color: pgsql:0 allocation score on srv01: INFINITY clone_color: pgsql:0 allocation score on srv02: 0 clone_color: pgsql:1 allocation score on srv01: 0 clone_color: pgsql:1 allocation score on srv02: INFINITY native_color: pgsql:0 allocation score on srv01: INFINITY native_color: pgsql:0 allocation score on srv02: 0 native_color: pgsql:1 allocation score on srv01: -INFINITY native_color: pgsql:1 allocation score on srv02: INFINITY pgsql:1 promotion score on srv02: 5 pgsql:0 promotion score on srv01: 1 native_color: vip-master allocation score on srv01: -INFINITY native_color: vip-master allocation score on srv02: INFINITY native_color: vip-rep allocation score on srv01: -INFINITY native_color: vip-rep allocation score on srv02: INFINITY Transition Summary: * Start vip-master (srv02) * Start vip-rep (srv02) * Demote pgsql:0 (Master -> Slave srv01) * Promote pgsql:1 (Slave -> Master srv02) - In addition, fail-over is calculated even if "cluster_recheck_interval" is carried out. Fail-over is carried out even if I carry out cibadmin -B. - [root@srv01 ~]# cibadmin -B [root@srv01 ~]# crm_mon -1 -Af Last updated: Tue Feb 18 18:21:15 2014 Last change: Tue Feb 18 18:21:00 2014 via cibadmin on srv01 Stack: corosync Current DC: srv01 (3232238180) - partition with quorum Version: 1.1.10-9d39a6b 2 Nodes configured 6 Resources configured Online: [ srv01 srv02 ] vip-master (ocf::heartbeat:Dummy): Started srv02 vip-rep(ocf::heartbeat:Dummy): Started srv02 Master/Slave Set:
Re: [Pacemaker] [Patch]Information of "Connectivity is lost" is not displayed
Hi All, The next change was accomplished by Mr. Lars. https://github.com/ClusterLabs/pacemaker/commit/6a17c003b0167de9fe51d5330fb6e4f1b4ffe64c I may lack the correction of other parts which are not the patch which I sent. Best Regards, Hideo Yamauchi. --- On Mon, 2014/2/17, renayama19661...@ybb.ne.jp wrote: > Hi All, > > The crm_mon tool which is attached to Pacemaker1.1 seems to have a problem. > I send a patch. > > Best Regards, > Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Patch]Information of "Connectivity is lost" is not displayed
Hi All, The crm_mon tool which is attached to Pacemaker1.1 seems to have a problem. I send a patch. Best Regards, Hideo Yamauchi. trac2781.patch Description: Binary data ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question] About replacing in resource_set of the order limitation.
Hi Andrew, > >> Is this related to your email about symmetrical not being defaulted > >> consistently between colocate_rsc_sets() and unpack_colocation_set()? > > > > Yes. > > I think that a default is not handled well. > > I will not have any problem when "sequential" attribute is set in cib by > > all means. > > > > I think that I should revise processing when "sequential" attribute is not > > set. > > agreed. I've changed some occurrences locally but there may be more. All right! Many Thanks! Hideo Yamauchi. --- On Mon, 2014/2/17, Andrew Beekhof wrote: > > On 17 Feb 2014, at 12:47 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > Thank you for comments. > > > >> Is this related to your email about symmetrical not being defaulted > >> consistently between colocate_rsc_sets() and unpack_colocation_set()? > > > > Yes. > > I think that a default is not handled well. > > I will not have any problem when "sequential" attribute is set in cib by > > all means. > > > > I think that I should revise processing when "sequential" attribute is not > > set. > > agreed. I've changed some occurrences locally but there may be more. > > > > > Best Regards, > > Hideo Yamauchi. > > > >> > >> On 22 Jan 2014, at 3:05 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi All, > >>> > >>> My test seemed to include a mistake. > >>> It seems to be replaced by two limitation. > >>> > However, I think that symmetircal="false" is applied to all order > limitation in this. > (snip) > symmetrical="false"> > > > > > > ... > > > > (snip) > >>> > >>> > >>> >>>then="prmEx" symmetrical="false"> > >>> > >>> >>>symmetrical="true"> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> If my understanding includes a mistake, please point it out. > >>> > >>> Best Reagards, > >>> Hideo Yamauchi. > >>> > >>> --- On Fri, 2014/1/17, renayama19661...@ybb.ne.jp > >>> wrote: > >>> > Hi All, > > We confirm a function of resource_set. > > There were the resource of the group and the resource of the clone. > > (snip) > Stack: corosync > Current DC: srv01 (3232238180) - partition WITHOUT quorum > Version: 1.1.10-f2d0cbc > 1 Nodes configured > 7 Resources configured > > > Online: [ srv01 ] > > Resource Group: grpPg > A (ocf::heartbeat:Dummy): Started srv01 > B (ocf::heartbeat:Dummy): Started srv01 > C (ocf::heartbeat:Dummy): Started srv01 > D (ocf::heartbeat:Dummy): Started srv01 > E (ocf::heartbeat:Dummy): Started srv01 > F (ocf::heartbeat:Dummy): Started srv01 > Clone Set: clnPing [prmPing] > Started: [ srv01 ] > > Node Attributes: > * Node srv01: > + default_ping_set : 100 > > Migration summary: > * Node srv01: > > (snip) > > These have limitation showing next. > > (snip) > score="INFINITY" rsc="grpPg" with-rsc="clnPing"> > > then="grpPg" symmetrical="false"> > > (snip) > > > We tried that we rearranged a group in resource_set. > I think that I can rearrange the limitation of "colocation" as follows. > > (snip) > score="INFINITY"> > > > > ... > > > > (snip) > > How should I rearrange the limitation of "order" in resource_set? > > I thought that it was necessary to list two of the next, but a method to > express well was not found. > > * "symmetirical=true" is necessary between the resources that were a > group(A to F). > * "symmetirical=false" is necessary between the resource that was a > group(A to F) and the clone resources. > > I wrote it as follows. > However, I think that symmetircal="false" is applied to all order > limitation in this. > (snip) > symmetrical="false"> > > > > > > ... > > > > (snip) > > Best Reards, > Hideo Yamauchi. > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started
Re: [Pacemaker] About the difference in handling of "sequential".
Hi Andrew, I found your correction. https://github.com/beekhof/pacemaker/commit/37ff51a0edba208e6240e812936717fffc941a41 Many Thanks! Hideo Yamauchi. --- On Wed, 2014/2/12, renayama19661...@ybb.ne.jp wrote: > Hi All, > > There is difference in two between handling of "sequential" of "resouce_set" > of colocation. > > Is either one not a mistake? > > > static gboolean > unpack_colocation_set(xmlNode * set, int score, pe_working_set_t * data_set) > { > xmlNode *xml_rsc = NULL; > resource_t *with = NULL; > resource_t *resource = NULL; > const char *set_id = ID(set); > const char *role = crm_element_value(set, "role"); > const char *sequential = crm_element_value(set, "sequential"); > int local_score = score; > > const char *score_s = crm_element_value(set, XML_RULE_ATTR_SCORE); > > if (score_s) { > local_score = char2score(score_s); > } > > /* When "sequential" is not set, "sequential" is treat as TRUE. */ > > if (sequential != NULL && crm_is_true(sequential) == FALSE) { > return TRUE; > (snip) > static gboolean > colocate_rsc_sets(const char *id, xmlNode * set1, xmlNode * set2, int score, > pe_working_set_t * data_set) > { > xmlNode *xml_rsc = NULL; > resource_t *rsc_1 = NULL; > resource_t *rsc_2 = NULL; > > const char *role_1 = crm_element_value(set1, "role"); > const char *role_2 = crm_element_value(set2, "role"); > > const char *sequential_1 = crm_element_value(set1, "sequential"); > const char *sequential_2 = crm_element_value(set2, "sequential"); > > /* When "sequential" is not set, "sequential" is treat as FALSE. */ > > if (crm_is_true(sequential_1)) { > /* get the first one */ > for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = > __xml_next(xml_rsc)) { > if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, > TRUE)) { > EXPAND_CONSTRAINT_IDREF(id, rsc_1, ID(xml_rsc)); > break; > } > } > } > > if (crm_is_true(sequential_2)) { > /* get the last one */ > (snip) > > > > Best Regards, > Hideo Yamauchi. > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question] About replacing in resource_set of the order limitation.
Hi Andrew, Thank you for comments. > Is this related to your email about symmetrical not being defaulted > consistently between colocate_rsc_sets() and unpack_colocation_set()? Yes. I think that a default is not handled well. I will not have any problem when "sequential" attribute is set in cib by all means. I think that I should revise processing when "sequential" attribute is not set. Best Regards, Hideo Yamauchi. > > On 22 Jan 2014, at 3:05 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > My test seemed to include a mistake. > > It seems to be replaced by two limitation. > > > >> However, I think that symmetircal="false" is applied to all order > >> limitation in this. > >> (snip) > >> >>symmetrical="false"> > >> > >> > >> > >> > >> > >> ... > >> > >> > >> > >> (snip) > > > > > > >then="prmEx" symmetrical="false"> > > > > > > > > > > > > > > > > > > > > > > > > > > If my understanding includes a mistake, please point it out. > > > > Best Reagards, > > Hideo Yamauchi. > > > > --- On Fri, 2014/1/17, renayama19661...@ybb.ne.jp > > wrote: > > > >> Hi All, > >> > >> We confirm a function of resource_set. > >> > >> There were the resource of the group and the resource of the clone. > >> > >> (snip) > >> Stack: corosync > >> Current DC: srv01 (3232238180) - partition WITHOUT quorum > >> Version: 1.1.10-f2d0cbc > >> 1 Nodes configured > >> 7 Resources configured > >> > >> > >> Online: [ srv01 ] > >> > >> Resource Group: grpPg > >> A (ocf::heartbeat:Dummy): Started srv01 > >> B (ocf::heartbeat:Dummy): Started srv01 > >> C (ocf::heartbeat:Dummy): Started srv01 > >> D (ocf::heartbeat:Dummy): Started srv01 > >> E (ocf::heartbeat:Dummy): Started srv01 > >> F (ocf::heartbeat:Dummy): Started srv01 > >> Clone Set: clnPing [prmPing] > >> Started: [ srv01 ] > >> > >> Node Attributes: > >> * Node srv01: > >> + default_ping_set : 100 > >> > >> Migration summary: > >> * Node srv01: > >> > >> (snip) > >> > >> These have limitation showing next. > >> > >> (snip) > >> >>rsc="grpPg" with-rsc="clnPing"> > >> > >> >>then="grpPg" symmetrical="false"> > >> > >> (snip) > >> > >> > >> We tried that we rearranged a group in resource_set. > >> I think that I can rearrange the limitation of "colocation" as follows. > >> > >> (snip) > >> > >> > >> > >> > >> ... > >> > >> > >> > >> (snip) > >> > >> How should I rearrange the limitation of "order" in resource_set? > >> > >> I thought that it was necessary to list two of the next, but a method to > >> express well was not found. > >> > >> * "symmetirical=true" is necessary between the resources that were a > >> group(A to F). > >> * "symmetirical=false" is necessary between the resource that was a > >> group(A to F) and the clone resources. > >> > >> I wrote it as follows. > >> However, I think that symmetircal="false" is applied to all order > >> limitation in this. > >> (snip) > >> >>symmetrical="false"> > >> > >> > >> > >> > >> > >> ... > >> > >> > >> > >> (snip) > >> > >> Best Reards, > >> Hideo Yamauchi. > >> > >> > >> > >> ___ > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > >> Project Home: http://www.clusterlabs.org > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: http://bugs.clusterlabs.org > >> > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question:crmsh] About a setting method of seuquential=true in crmsh.
Hi Kristoffer, Thank you for comments. > > But the next information appeared when I put crm. > > Does this last message not have any problem? > > > > --- > > [root@srv01 ~]# crm configure load update > > db2-resource_set_0207.crm WARNING: pgsql: action monitor not > > advertised in meta-data, it may not be supported by the RA WARNING: > > pgsql: action notify not advertised in meta-data, it may not be > > supported by the RA WARNING: pgsql: action demote not advertised in > > meta-data, it may not be supported by the RA WARNING: pgsql: action > > promote not advertised in meta-data, it may not be supported by the > > RA INFO: object rsc_colocation-master cannot be represented in the > > CLI notation --- > > It seems that there is some problem parsing the pgsql RA, which > I suspect is the underlying cause that makes crmsh display the > constraint as XML. > > Which version of resource-agents is installed? On my test system, I > have version resource-agents-3.9.5-70.2.x86_64. However, the version > installed by centOS 6.5 seems to be a very old version, > resource-agents-3.9.2-40.el6_5.6.x86_64. I use resource-agent3.9.5, too. I send it to savannah.nongnu.org if I still have a problem. > It would be very helpful if you could file an issue for this problem at > > https://savannah.nongnu.org/bugs/?group=crmsh > > and also attach your configuration and an hb_report or crm_report, > thank you. > > I have also implemented a fix for the problem discovered by Mr. Inoue. > His original email was unfortunately missed at the time. > The fix is in changeset 71841e4559cf. > > Thank you, I told the uptake of the patch to Mr Inoue. I confirmed movement in latest crmsh-364c59ee0612. Most problems were solved somehow or other. However, the problem that colocation is displayed in xml seems to still remain after having sent crm file. (snip) colocation rsc_colocation-master INFINITY: [ vip-master vip-rep sequential=true ] [ msPostgresql:Master sequential=true ] (snip) #crm(live)configure# show (snip) xml \ \ \ \ \ \ \ \ (snip) I send this problem to savannah.nongnu.org. Many Thanks! Hideo Yamauchi. > > -- > // Kristoffer Grönlund > // kgronl...@suse.com > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question:crmsh] About a setting method of seuquential=true in crmsh.
Hi Kristoffer, Thank you for comments. By crmsh-7f620e736895.tar.gz, I did "make install" well. I seem to be able to set the sequential attribute definitely. The sequential attribute does become true. --- (snip) colocation rsc_colocation-master INFINITY: [ vip-master vip-rep sequential=true ] [ msPostgresql:Master sequential=true ] (snip) (snip) --- But the next information appeared when I put crm. Does this last message not have any problem? --- [root@srv01 ~]# crm configure load update db2-resource_set_0207.crm WARNING: pgsql: action monitor not advertised in meta-data, it may not be supported by the RA WARNING: pgsql: action notify not advertised in meta-data, it may not be supported by the RA WARNING: pgsql: action demote not advertised in meta-data, it may not be supported by the RA WARNING: pgsql: action promote not advertised in meta-data, it may not be supported by the RA INFO: object rsc_colocation-master cannot be represented in the CLI notation --- In addition, colocation becomes xml when I confirm it by "crm configure show" command, and sequential disappears. Do you not have a problem with this result either? --- crm configure show (snip) xml \ \ \ \ \ \ \ \ (snip) --- In addition, it becomes the error when I appoint false in symmetrical attribute of order. --- (snip) ### Resource Order ### order rsc_order-clnPingd-msPostgresql-1 0: clnPingd msPostgresql symmetrical=false order test-order-1 0: ( vip-master vip-rep ) symmetrical=false order test-order-2 INFINITY: msPostgresql:promote vip-master:start symmetrical=false order test-order-3 INFINITY: msPostgresql:promote vip-rep:start symmetrical=false order test-order-4 0: msPostgresql:demote vip-master:stop symmetrical=false order test-order-5 0: msPostgresql:demote vip-rep:stop symmetrical=false (snip) [root@srv01 ~]# crm configure load update db2-resource_set_0207.crm Traceback (most recent call last): File "/usr/sbin/crm", line 56, in rc = main.run() File "/usr/lib64/python2.6/site-packages/crmsh/main.py", line 433, in run return do_work(context, user_args) File "/usr/lib64/python2.6/site-packages/crmsh/main.py", line 272, in do_work if context.run(' '.join(l)): File "/usr/lib64/python2.6/site-packages/crmsh/ui_context.py", line 87, in run rv = self.execute_command() is not False File "/usr/lib64/python2.6/site-packages/crmsh/ui_context.py", line 244, in execute_command rv = self.command_info.function(*arglist) File "/usr/lib64/python2.6/site-packages/crmsh/ui_configure.py", line 432, in do_load return set_obj.import_file(method, url) File "/usr/lib64/python2.6/site-packages/crmsh/cibconfig.py", line 314, in import_file return self.save(s, no_remove=True, method=method) File "/usr/lib64/python2.6/site-packages/crmsh/cibconfig.py", line 529, in save upd_type="cli", method=method) File "/usr/lib64/python2.6/site-packages/crmsh/cibconfig.py", line 3178, in set_update if not self._set_update(edit_d, mk_set, upd_set, del_set, upd_type, method): File "/usr/lib64/python2.6/site-packages/crmsh/cibconfig.py", line 3170, in _set_update return self._cli_set_update(edit_d, mk_set, upd_set, del_set, method) File "/usr/lib64/python2.6/site-packages/crmsh/cibconfig.py", line 3118, in _cli_set_update obj = self.create_from_cli(cli) File "/usr/lib64/python2.6/site-packages/crmsh/cibconfig.py", line 3045, in create_from_cli node = obj.cli2node(cli_list) File "/usr/lib64/python2.6/site-packages/crmsh/cibconfig.py", line 1014, in cli2node node = self._cli_list2node(cli_list, oldnode) File "/usr/lib64/python2.6/site-packages/crmsh/cibconfig.py", line 1832, in _cli_list2node headnode = mkxmlsimple(head, oldnode, '') File "/usr/lib64/python2.6/site-packages/crmsh/cibconfig.py", line 662, in mkxmlsimple node.set(n, v) File "lxml.etree.pyx", line 634, in lxml.etree._Element.set (src/lxml/lxml.etree.c:31548) File "apihelpers.pxi", line 487, in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:13896) File "apihelpers.pxi", line 1240, in lxml.etree._utf8 (src/lxml/lxml.etree.c:19826) TypeError: Argument must be string or unicode. --- This error seems to be broken off if I apply the patch which next Mr Inoue contributed somehow or other * http://www.gossamer-threads.com/lists/linuxha/dev/88660 - v = v.lower() if n.startswith('$'): n = n.lstrip('$') + if isinstance(v, bool): + v = "true" if v else "false" if (type(v) != type('') and type(v) != type(u'')) \ or v: # skip empty strings node.set(n, v) - Best Regards, Hideo Yamauchi. --- On Thu, 2014/2/13, Kristoffer Grönlund wrote: > On
Re: [Pacemaker] [Question:crmsh] About a setting method of seuquential=true in crmsh.
Hi Kristoffer, Thank you for comments. > Could you try with the latest changeset 337654e0cdc4? However, the problem seems to still occur. [root@srv01 crmsh-337654e0cdc4]# make install Making install in doc make[1]: Entering directory `/opt/crmsh-337654e0cdc4/doc' a2x -f manpage crm.8.txt WARNING: crm.8.txt: line 621: missing [[cmdhelp_._status] section WARNING: crm.8.txt: line 3936: missing [[cmdhelp_._report] section ./crm.8.xml:3137: element refsect1: validity error : Element refsect1 content does not follow the DTD, expecting (refsect1info? , (title , subtitle? , titleabbrev?) , (((calloutlist | glosslist | bibliolist | itemizedlist | orderedlist | segmentedlist | simplelist | variablelist | caution | important | note | tip | warning | literallayout | programlisting | programlistingco | screen | screenco | screenshot | synopsis | cmdsynopsis | funcsynopsis | classsynopsis | fieldsynopsis | constructorsynopsis | destructorsynopsis | methodsynopsis | formalpara | para | simpara | address | blockquote | graphic | graphicco | mediaobject | mediaobjectco | informalequation | informalexample | informalfigure | informaltable | equation | example | figure | table | msgset | procedure | sidebar | qandaset | task | anchor | bridgehead | remark | highlights | abstract | authorblurb | epigraph | indexterm | beginpage)+ , refsect2*) | refsect2+)), got (title simpara simpara simpara simpara literallayout refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 simpara simpara simpara literallayout simpara literallayout refsect2 refsect2 refsect2 ) ^ a2x: failed: xmllint --nonet --noout --valid "./crm.8.xml" make[1]: *** [crm.8] Error 1 make[1]: Leaving directory `/opt/crmsh-337654e0cdc4/doc' make: *** [install-recursive] Error 1 Best Regards, --- On Wed, 2014/2/12, Kristoffer Grönlund wrote: > On Wed, 12 Feb 2014 09:12:08 +0900 (JST) > renayama19661...@ybb.ne.jp wrote: > > > Hi Kristoffer, > > > > Thank you for comments. > > > > I tested it. > > However, the problem seems to still occur. > > > > Hello, > > Could you try with the latest changeset 337654e0cdc4? > > Thank you, > > -- > // Kristoffer Grönlund > // kgronl...@suse.com > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] About the difference in handling of "sequential".
Hi All, There is difference in two between handling of "sequential" of "resouce_set" of colocation. Is either one not a mistake? static gboolean unpack_colocation_set(xmlNode * set, int score, pe_working_set_t * data_set) { xmlNode *xml_rsc = NULL; resource_t *with = NULL; resource_t *resource = NULL; const char *set_id = ID(set); const char *role = crm_element_value(set, "role"); const char *sequential = crm_element_value(set, "sequential"); int local_score = score; const char *score_s = crm_element_value(set, XML_RULE_ATTR_SCORE); if (score_s) { local_score = char2score(score_s); } /* When "sequential" is not set, "sequential" is treat as TRUE. */ if (sequential != NULL && crm_is_true(sequential) == FALSE) { return TRUE; (snip) static gboolean colocate_rsc_sets(const char *id, xmlNode * set1, xmlNode * set2, int score, pe_working_set_t * data_set) { xmlNode *xml_rsc = NULL; resource_t *rsc_1 = NULL; resource_t *rsc_2 = NULL; const char *role_1 = crm_element_value(set1, "role"); const char *role_2 = crm_element_value(set2, "role"); const char *sequential_1 = crm_element_value(set1, "sequential"); const char *sequential_2 = crm_element_value(set2, "sequential"); /* When "sequential" is not set, "sequential" is treat as FALSE. */ if (crm_is_true(sequential_1)) { /* get the first one */ for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = __xml_next(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_1, ID(xml_rsc)); break; } } } if (crm_is_true(sequential_2)) { /* get the last one */ (snip) Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question:crmsh] About a setting method of seuquential=true in crmsh.
Hi Kristoffer, Thank you for comments. I tested it. However, the problem seems to still occur. --- [root@srv01 crmsh-8d984b138fc4]# pwd /opt/crmsh-8d984b138fc4 [root@srv01 crmsh-8d984b138fc4]# ./autogen.sh autoconf: autoconf (GNU Autoconf) 2.63 automake: automake (GNU automake) 1.11.1 (snip) [root@srv01 crmsh-8d984b138fc4]# ./configure --sysconfdir=/etc --localstatedir=/var checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p (snip) Prefix = /usr Executables = /usr/sbin Man pages= /usr/share/man Libraries= /usr/lib64 Header files = ${prefix}/include Arch-independent files = /usr/share State information= /var System configuration = /etc [root@srv01 crmsh-8d984b138fc4]# make install Making install in doc make[1]: Entering directory `/opt/crmsh-8d984b138fc4/doc' a2x -f manpage crm.8.txt WARNING: crm.8.txt: line 621: missing [[cmdhelp_._status] section WARNING: crm.8.txt: line 3936: missing [[cmdhelp_._report] section ./crm.8.xml:3137: element refsect1: validity error : Element refsect1 content does not follow the DTD, expecting (refsect1info? , (title , subtitle? , titleabbrev?) , (((calloutlist | glosslist | bibliolist | itemizedlist | orderedlist | segmentedlist | simplelist | variablelist | caution | important | note | tip | warning | literallayout | programlisting | programlistingco | screen | screenco | screenshot | synopsis | cmdsynopsis | funcsynopsis | classsynopsis | fieldsynopsis | constructorsynopsis | destructorsynopsis | methodsynopsis | formalpara | para | simpara | address | blockquote | graphic | graphicco | mediaobject | mediaobjectco | informalequation | informalexample | informalfigure | informaltable | equation | example | figure | table | msgset | procedure | sidebar | qandaset | task | anchor | bridgehead | remark | highlights | abstract | authorblurb | epigraph | indexterm | beginpage)+ , refsect2*) | refsect2+)), got (title simpara simpara simpara simpara literallayout refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 simpara simpara simpara literallayout simpara literallayout refsect2 refsect2 refsect2 ) ^ a2x: failed: xmllint --nonet --noout --valid "./crm.8.xml" make[1]: *** [crm.8] Error 1 make[1]: Leaving directory `/opt/crmsh-8d984b138fc4/doc' make: *** [install-recursive] Error 1 --- Best Regards, Hideo Yamauchi. --- On Mon, 2014/2/10, Kristoffer Grönlund wrote: > On Fri, 7 Feb 2014 09:21:12 +0900 (JST) > renayama19661...@ybb.ne.jp wrote: > > > Hi Kristoffer, > > > > In RHEL6.4, crmsh-c8f214020b2c gives the next error and cannot > > install it. > > > > Does a procedure of any installation have a problem? > > Hello, > > It seems that docbook validation is stricter on RHEL 6.4 than on other > systems I use to test. I have pushed a fix for this problem, please > test again with changeset 8d984b138fc4. > > Thank you, > > -- > // Kristoffer Grönlund > // kgronl...@suse.com > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question:crmsh] About a setting method of seuquential=true in crmsh.
Hi Kristoffer, In RHEL6.4, crmsh-c8f214020b2c gives the next error and cannot install it. Does a procedure of any installation have a problem? --- [root@srv01 crmsh-c8f214020b2c]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.4 (Santiago) [root@srv01 crmsh-c8f214020b2c]# ./autogen.sh (snip) [root@srv01 crmsh-c8f214020b2c]# ./configure --sysconfdir=/etc --localstatedir=/var (snip) [root@srv01 crmsh-c8f214020b2c]# make install Making install in doc make[1]: Entering directory `/opt/crmsh-c8f214020b2c/doc' a2x -f manpage crm.8.txt WARNING: crm.8.txt: line 621: missing [[cmdhelp_._status] section WARNING: crm.8.txt: line 3935: missing [[cmdhelp_._report] section ./crm.8.xml:3137: element refsect1: validity error : Element refsect1 content does not follow the DTD, expecting (refsect1info? , (title , subtitle? , titleabbrev?) , (((calloutlist | glosslist | bibliolist | itemizedlist | orderedlist | segmentedlist | simplelist | variablelist | caution | important | note | tip | warning | literallayout | programlisting | programlistingco | screen | screenco | screenshot | synopsis | cmdsynopsis | funcsynopsis | classsynopsis | fieldsynopsis | constructorsynopsis | destructorsynopsis | methodsynopsis | formalpara | para | simpara | address | blockquote | graphic | graphicco | mediaobject | mediaobjectco | informalequation | informalexample | informalfigure | informaltable | equation | example | figure | table | msgset | procedure | sidebar | qandaset | task | anchor | bridgehead | remark | highlights | abstract | authorblurb | epigraph | indexterm | beginpage)+ , refsect2*) | refsect2+)), got (title simpara simpara simpara simpara literallayout refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 refsect2 simpara simpara simpara literallayout simpara literallayout refsect2 refsect2 refsect2 ) ^ a2x: failed: xmllint --nonet --noout --valid "./crm.8.xml" make[1]: *** [crm.8] Error 1 make[1]: Leaving directory `/opt/crmsh-c8f214020b2c/doc' make: *** [install-recursive] Error 1 --- The same error seems to be given in latest crmsh-cc52dc69ceb1. Best Regards, Hideo Yamauchi. --- On Thu, 2014/2/6, Kristoffer Grönlund wrote: > On Wed, 5 Feb 2014 23:17:36 +0900 (JST) > renayama19661...@ybb.ne.jp wrote: > > > Hi Kristoffer. > > > > Thank you for comments. > > We wait for a correction. > > > > Many Thanks! > > Hideo Yamauchi. > > > > Hello, > > A fix for this issue has now been committed as changeset c8f214020b2c, > please let me know if it solves the problem for you. > > This construction should now generate the correct XML: > > colocation rsc_colocation-grpPg-clnPing INFINITY: \ > [ msPostgresql:Master sequential=true ] \ > [ vip-master vip-rep sequential=true ] > > Thank you, > > -- > // Kristoffer Grönlund > // kgronl...@suse.com > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question:crmsh] About a setting method of seuquential=true in crmsh.
Hi Kristoffer. Thank you for comments. We wait for a correction. Many Thanks! Hideo Yamauchi. --- On Wed, 2014/2/5, Kristoffer Grönlund wrote: > On Wed, 5 Feb 2014 15:55:42 +0900 (JST) > renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > We tried to set sequential attribute of resource_set of colocation in > > true in crmsh. > > > > We tried the next method, but true was not able to set it well. > > > > [snipped] > > > > How can true set sequantial attribute if I operate it in crmsh? > > > > Hello, > > Unfortunately, the parsing of resource sets in crmsh is incorrect in > this case. crmsh will never explicitly set sequential to true, only to > false. This is a bug in both the 2.0 development branch and in all > previous versions. > > A fix is on the way. > > Thank you, > Kristoffer > > > Best Regards, > > Hideo Yamauchi. > > > > > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: > > http://bugs.clusterlabs.org > > > > > > -- > // Kristoffer Grönlund > // kgronl...@suse.com > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Question:crmsh] About a setting method of seuquential=true in crmsh.
Hi All, We tried to set sequential attribute of resource_set of colocation in true in crmsh. We tried the next method, but true was not able to set it well. - [pengine]# crm --version 2.0 (Build 7cd5688c164d2949009accc7f172ce559cadbc4b) - Pattern 1 - colocation rsc_colocation-grpPg-clnPing INFINITY: msPostgresql:Master vip-master vip-rep sequential=true - Pattern 2 - colocation rsc_colocation-grpPg-clnPing INFINITY: msPostgresql:Master vip-master vip-rep sequential=false - Pattern 3 - colocation rsc_colocation-grpPg-clnPing INFINITY: msPostgresql:Master sequential=true vip-master vip-rep sequential=true - Pattern 4 - colocation rsc_colocation-grpPg-clnPing INFINITY: msPostgresql:Master sequential=false vip-master vip-rep sequential=false - Pattern 5 - colocation rsc_colocation-grpPg-clnPing INFINITY: [ msPostgresql:Master ] [ vip-master vip-rep ] - Pattern 6 - colocation rsc_colocation-grpPg-clnPing INFINITY: ( msPostgresql:Master ) ( vip-master vip-rep ) - Pattern 7 - colocation rsc_colocation-grpPg-clnPing INFINITY: [ msPostgresql:Master sequential=true ] [ vip-master vip-rep sequential=true ] - Pattern 8 - colocation rsc_colocation-grpPg-clnPing INFINITY: ( msPostgresql:Master sequential=true ) ( vip-master vip-rep sequential=true ) - How can true set sequantial attribute if I operate it in crmsh? Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)
Hi Andrew, It became late. I registered this problem by Bugzilla. The report file is attached, too. * http://bugs.clusterlabs.org/show_bug.cgi?id=5194 Best Regards, Hideo Yamauchi. --- On Tue, 2014/1/14, Andrew Beekhof wrote: > > On 14 Jan 2014, at 4:33 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > Are you using the new attrd code or the legacy stuff? > >>> > >>> I use new attrd. > >> > >> And the values are not being sent to the cib at the same time? > > > > As far as I looked. . . > > When the transmission of the attribute of attrd of the node was late, a > > leader of attrd seemed to send an attribute to cib without waiting for it. > > And you have a delay configured? And this value was set prior to that delay > expiring? > > > > Only the new code makes (or at least should do) crmd-transition-delay > redundant. > >>> > >>> It did not seem to work so that new attrd dispensed with > >>> crmd-transition-delay to me. > >>> I report the details again. > >>> # Probably it will be Bugzilla. . . > >> > >> Sounds good > > > > All right! > > > > Many Thanks! > > Hideo Yamauch. > > > > --- On Tue, 2014/1/14, Andrew Beekhof wrote: > > > >> > >> On 14 Jan 2014, at 4:13 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi Andrew, > >>> > >>> Thank you for comments. > >>> > Are you using the new attrd code or the legacy stuff? > >>> > >>> I use new attrd. > >> > >> And the values are not being sent to the cib at the same time? > >> > >>> > > If you're not using corosync 2.x or see: > > crm_notice("Starting mainloop..."); > > then its the old code. The new code could also be used with CMAN but > isn't configured to build for in that situation. > > Only the new code makes (or at least should do) crmd-transition-delay > redundant. > >>> > >>> It did not seem to work so that new attrd dispensed with > >>> crmd-transition-delay to me. > >>> I report the details again. > >>> # Probably it will be Bugzilla. . . > >> > >> Sounds good > >> > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >>> --- On Tue, 2014/1/14, Andrew Beekhof wrote: > >>> > > On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > I contributed next bugzilla by a problem to occur for the difference of > > the timing of the attribute update by attrd before. > > * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528 > > > > We can evade this problem now by using crmd-transition-delay parameter. > > > > I confirmed whether I could evade this problem by renewed attrd > > recently. > > * In latest attrd, one became a leader and seemed to come to update an > > attribute. > > > > However, latest attrd does not seem to substitute for > > crmd-transition-delay. > > * I contribute detailed log later. > > > > We are dissatisfied with continuing using crmd-transition-delay. > > Is there the plan when attrd handles this problem well in the future? > > Are you using the new attrd code or the legacy stuff? > > If you're not using corosync 2.x or see: > > crm_notice("Starting mainloop..."); > > then its the old code. The new code could also be used with CMAN but > isn't configured to build for in that situation. > > Only the new code makes (or at least should do) crmd-transition-delay > redundant. > > >> > >> > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question] About replacing in resource_set of the order limitation.
Hi All, My test seemed to include a mistake. It seems to be replaced by two limitation. > However, I think that symmetircal="false" is applied to all order limitation > in this. > (snip) > > > > > > > ... > > > > (snip) If my understanding includes a mistake, please point it out. Best Reagards, Hideo Yamauchi. --- On Fri, 2014/1/17, renayama19661...@ybb.ne.jp wrote: > Hi All, > > We confirm a function of resource_set. > > There were the resource of the group and the resource of the clone. > > (snip) > Stack: corosync > Current DC: srv01 (3232238180) - partition WITHOUT quorum > Version: 1.1.10-f2d0cbc > 1 Nodes configured > 7 Resources configured > > > Online: [ srv01 ] > > Resource Group: grpPg > A (ocf::heartbeat:Dummy): Started srv01 > B (ocf::heartbeat:Dummy): Started srv01 > C (ocf::heartbeat:Dummy): Started srv01 > D (ocf::heartbeat:Dummy): Started srv01 > E (ocf::heartbeat:Dummy): Started srv01 > F (ocf::heartbeat:Dummy): Started srv01 > Clone Set: clnPing [prmPing] > Started: [ srv01 ] > > Node Attributes: > * Node srv01: > + default_ping_set : 100 > > Migration summary: > * Node srv01: > > (snip) > > These have limitation showing next. > > (snip) > rsc="grpPg" with-rsc="clnPing"> > > then="grpPg" symmetrical="false"> > > (snip) > > > We tried that we rearranged a group in resource_set. > I think that I can rearrange the limitation of "colocation" as follows. > > (snip) > > > > > ... > > > > (snip) > > How should I rearrange the limitation of "order" in resource_set? > > I thought that it was necessary to list two of the next, but a method to > express well was not found. > > * "symmetirical=true" is necessary between the resources that were a group(A > to F). > * "symmetirical=false" is necessary between the resource that was a group(A > to F) and the clone resources. > > I wrote it as follows. > However, I think that symmetircal="false" is applied to all order limitation > in this. > (snip) > > > > > > > ... > > > > (snip) > > Best Reards, > Hideo Yamauchi. > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Question] About replacing in resource_set of the order limitation.
Hi All, We confirm a function of resource_set. There were the resource of the group and the resource of the clone. (snip) Stack: corosync Current DC: srv01 (3232238180) - partition WITHOUT quorum Version: 1.1.10-f2d0cbc 1 Nodes configured 7 Resources configured Online: [ srv01 ] Resource Group: grpPg A (ocf::heartbeat:Dummy): Started srv01 B (ocf::heartbeat:Dummy): Started srv01 C (ocf::heartbeat:Dummy): Started srv01 D (ocf::heartbeat:Dummy): Started srv01 E (ocf::heartbeat:Dummy): Started srv01 F (ocf::heartbeat:Dummy): Started srv01 Clone Set: clnPing [prmPing] Started: [ srv01 ] Node Attributes: * Node srv01: + default_ping_set : 100 Migration summary: * Node srv01: (snip) These have limitation showing next. (snip) (snip) We tried that we rearranged a group in resource_set. I think that I can rearrange the limitation of "colocation" as follows. (snip) ... (snip) How should I rearrange the limitation of "order" in resource_set? I thought that it was necessary to list two of the next, but a method to express well was not found. * "symmetirical=true" is necessary between the resources that were a group(A to F). * "symmetirical=false" is necessary between the resource that was a group(A to F) and the clone resources. I wrote it as follows. However, I think that symmetircal="false" is applied to all order limitation in this. (snip) ... (snip) Best Reards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Enhancement] Change of the "globally-unique" attribute of the resource.
Hi Andrew, Sorry This problem is a thing of Pacemaker1.0. On Pacemaker1.1.11, the resource did movement to stop definitely. When "globally-unique" attribute changed somehow or other in Pacemaker1.1, Pacemkaer seems to carry out the reboot of the resource. (snip) Jan 15 18:29:40 rh64-2744 pengine[3369]: warning: process_rsc_state: Detected active orphan prmClusterMon running on rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: clone_print: Clone Set: clnClusterMon [prmClusterMon] (unique) Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_print: prmClusterMon:0#011(ocf::pacemaker:ClusterMon):#011Stopped Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_print: prmClusterMon:1#011(ocf::pacemaker:ClusterMon):#011Stopped Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_print: prmClusterMon#011(ocf::pacemaker:ClusterMon):#011 ORPHANED Started rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: notice: DeleteRsc: Removing prmClusterMon from rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: native_color: Stopping orphan resource prmClusterMon Jan 15 18:29:40 rh64-2744 pengine[3369]: info: RecurringOp: Start recurring monitor (10s) for prmClusterMon:0 on rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: info: RecurringOp: Start recurring monitor (10s) for prmClusterMon:1 on rh64-2744 Jan 15 18:29:40 rh64-2744 pengine[3369]: notice: LogActions: Start prmClusterMon:0#011(rh64-2744) Jan 15 18:29:40 rh64-2744 pengine[3369]: notice: LogActions: Start prmClusterMon:1#011(rh64-2744)Jan 15 18:29:40 rh64-2744 pengine[3369]: notice: LogActions: StopprmClusterMon#011(rh64-2744) (snip) Best Regards, Hideo Yamauchi. --- On Wed, 2014/1/15, renayama19661...@ybb.ne.jp wrote: > Hi Andrew, > > Thank you for comment. > > > > But, the resource does not stop because PID file was changed as for the > > > changed resource of the "globally-unique" attribute. > > > > I'd have expected the stop action to be performed with the old attributes. > > crm_report tarball? > > Okay. > > I register this topic with Bugzilla. > I attach the log to Bugzilla. > > Best Regards, > Hideo Yamauchi. > --- On Wed, 2014/1/15, Andrew Beekhof wrote: > > > > > On 14 Jan 2014, at 7:26 pm, renayama19661...@ybb.ne.jp wrote: > > > > > Hi All, > > > > > > When a user changes the "globally-unique" attribute of the resource, a > > > problem occurs. > > > > > > When it manages the resource with PID file, this occurs, but this is > > > because PID file name changes by "globally-unique" attribute. > > > > > > (snip) > > > if [ ${OCF_RESKEY_CRM_meta_globally_unique} = "false" ]; then > > > : ${OCF_RESKEY_pidfile:="$HA_VARRUN/ping-${OCF_RESKEY_name}"} > > > else > > > : ${OCF_RESKEY_pidfile:="$HA_VARRUN/ping-${OCF_RESOURCE_INSTANCE}"} > > > fi > > > (snip) > > > > This is correct. The pid file cannot include the instance number when > > globally-unique is false and must do so when it is true. > > > > > > > > > > > The problem can reappear in the following procedure. > > > > > > * Step1: Started a resource. > > > (snip) > > > primitive prmPingd ocf:pacemaker:pingd \ > > > params name="default_ping_set" host_list="192.168.0.1" > > >multiplier="200" \ > > > op start interval="0s" timeout="60s" on-fail="restart" \ > > > op monitor interval="10s" timeout="60s" on-fail="restart" \ > > > op stop interval="0s" timeout="60s" on-fail="ignore" > > > clone clnPingd prmPingd > > > (snip) > > > > > > * Step2: Change "globally-unique" attribute. > > > > > > [root]# crm configure edit > > > (snip) > > > clone clnPingd prmPingd \ > > > meta clone-max="2" clone-node-max="2" globally-unique="true" > > > (snip) > > > > > > * Step3: Stop Pacemaker > > > > > > But, the resource does not stop because PID file was changed as for the > > > changed resource of the "globally-unique" attribute. > > > > I'd have expected the stop action to be performed with the old attributes. > > crm_report tarball? > > > > > > > > > > I think that this is a known problem. > > > > It wasn't until now. > > > > > > > > I wish this problem is solved in the future > > > > > > Best Regards, > > > Hideo Yamauchi. > > > > > > ___ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacema
Re: [Pacemaker] [Enhancement] Change of the "globally-unique" attribute of the resource.
Hi Andrew, Thank you for comment. > > But, the resource does not stop because PID file was changed as for the > > changed resource of the "globally-unique" attribute. > > I'd have expected the stop action to be performed with the old attributes. > crm_report tarball? Okay. I register this topic with Bugzilla. I attach the log to Bugzilla. Best Regards, Hideo Yamauchi. --- On Wed, 2014/1/15, Andrew Beekhof wrote: > > On 14 Jan 2014, at 7:26 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > When a user changes the "globally-unique" attribute of the resource, a > > problem occurs. > > > > When it manages the resource with PID file, this occurs, but this is > > because PID file name changes by "globally-unique" attribute. > > > > (snip) > > if [ ${OCF_RESKEY_CRM_meta_globally_unique} = "false" ]; then > > : ${OCF_RESKEY_pidfile:="$HA_VARRUN/ping-${OCF_RESKEY_name}"} > > else > > : ${OCF_RESKEY_pidfile:="$HA_VARRUN/ping-${OCF_RESOURCE_INSTANCE}"} > > fi > > (snip) > > This is correct. The pid file cannot include the instance number when > globally-unique is false and must do so when it is true. > > > > > > > The problem can reappear in the following procedure. > > > > * Step1: Started a resource. > > (snip) > > primitive prmPingd ocf:pacemaker:pingd \ > > params name="default_ping_set" host_list="192.168.0.1" > >multiplier="200" \ > > op start interval="0s" timeout="60s" on-fail="restart" \ > > op monitor interval="10s" timeout="60s" on-fail="restart" \ > > op stop interval="0s" timeout="60s" on-fail="ignore" > > clone clnPingd prmPingd > > (snip) > > > > * Step2: Change "globally-unique" attribute. > > > > [root]# crm configure edit > > (snip) > > clone clnPingd prmPingd \ > > meta clone-max="2" clone-node-max="2" globally-unique="true" > > (snip) > > > > * Step3: Stop Pacemaker > > > > But, the resource does not stop because PID file was changed as for the > > changed resource of the "globally-unique" attribute. > > I'd have expected the stop action to be performed with the old attributes. > crm_report tarball? > > > > > > I think that this is a known problem. > > It wasn't until now. > > > > > I wish this problem is solved in the future > > > > Best Regards, > > Hideo Yamauchi. > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Enhancement] Change of the "globally-unique" attribute of the resource.
Hi All, When a user changes the "globally-unique" attribute of the resource, a problem occurs. When it manages the resource with PID file, this occurs, but this is because PID file name changes by "globally-unique" attribute. (snip) if [ ${OCF_RESKEY_CRM_meta_globally_unique} = "false" ]; then : ${OCF_RESKEY_pidfile:="$HA_VARRUN/ping-${OCF_RESKEY_name}"} else : ${OCF_RESKEY_pidfile:="$HA_VARRUN/ping-${OCF_RESOURCE_INSTANCE}"} fi (snip) The problem can reappear in the following procedure. * Step1: Started a resource. (snip) primitive prmPingd ocf:pacemaker:pingd \ params name="default_ping_set" host_list="192.168.0.1" multiplier="200" \ op start interval="0s" timeout="60s" on-fail="restart" \ op monitor interval="10s" timeout="60s" on-fail="restart" \ op stop interval="0s" timeout="60s" on-fail="ignore" clone clnPingd prmPingd (snip) * Step2: Change "globally-unique" attribute. [root]# crm configure edit (snip) clone clnPingd prmPingd \ meta clone-max="2" clone-node-max="2" globally-unique="true" (snip) * Step3: Stop Pacemaker But, the resource does not stop because PID file was changed as for the changed resource of the "globally-unique" attribute. I think that this is a known problem. I wish this problem is solved in the future Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)
Hi Andrew, > >> Are you using the new attrd code or the legacy stuff? > > > > I use new attrd. > > And the values are not being sent to the cib at the same time? As far as I looked. . . When the transmission of the attribute of attrd of the node was late, a leader of attrd seemed to send an attribute to cib without waiting for it. > >> Only the new code makes (or at least should do) crmd-transition-delay > >> redundant. > > > > It did not seem to work so that new attrd dispensed with > > crmd-transition-delay to me. > > I report the details again. > > # Probably it will be Bugzilla. . . > > Sounds good All right! Many Thanks! Hideo Yamauch. --- On Tue, 2014/1/14, Andrew Beekhof wrote: > > On 14 Jan 2014, at 4:13 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > Thank you for comments. > > > >> Are you using the new attrd code or the legacy stuff? > > > > I use new attrd. > > And the values are not being sent to the cib at the same time? > > > > >> > >> If you're not using corosync 2.x or see: > >> > >> crm_notice("Starting mainloop..."); > >> > >> then its the old code. The new code could also be used with CMAN but > >> isn't configured to build for in that situation. > >> > >> Only the new code makes (or at least should do) crmd-transition-delay > >> redundant. > > > > It did not seem to work so that new attrd dispensed with > > crmd-transition-delay to me. > > I report the details again. > > # Probably it will be Bugzilla. . . > > Sounds good > > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Tue, 2014/1/14, Andrew Beekhof wrote: > > > >> > >> On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi All, > >>> > >>> I contributed next bugzilla by a problem to occur for the difference of > >>> the timing of the attribute update by attrd before. > >>> * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528 > >>> > >>> We can evade this problem now by using crmd-transition-delay parameter. > >>> > >>> I confirmed whether I could evade this problem by renewed attrd recently. > >>> * In latest attrd, one became a leader and seemed to come to update an > >>> attribute. > >>> > >>> However, latest attrd does not seem to substitute for > >>> crmd-transition-delay. > >>> * I contribute detailed log later. > >>> > >>> We are dissatisfied with continuing using crmd-transition-delay. > >>> Is there the plan when attrd handles this problem well in the future? > >> > >> Are you using the new attrd code or the legacy stuff? > >> > >> If you're not using corosync 2.x or see: > >> > >> crm_notice("Starting mainloop..."); > >> > >> then its the old code. The new code could also be used with CMAN but > >> isn't configured to build for in that situation. > >> > >> Only the new code makes (or at least should do) crmd-transition-delay > >> redundant. > >> > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)
Hi Andrew, Thank you for comments. > Are you using the new attrd code or the legacy stuff? I use new attrd. > > If you're not using corosync 2.x or see: > > crm_notice("Starting mainloop..."); > > then its the old code. The new code could also be used with CMAN but isn't > configured to build for in that situation. > > Only the new code makes (or at least should do) crmd-transition-delay > redundant. It did not seem to work so that new attrd dispensed with crmd-transition-delay to me. I report the details again. # Probably it will be Bugzilla. . . Best Regards, Hideo Yamauchi. --- On Tue, 2014/1/14, Andrew Beekhof wrote: > > On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > I contributed next bugzilla by a problem to occur for the difference of the > > timing of the attribute update by attrd before. > > * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528 > > > > We can evade this problem now by using crmd-transition-delay parameter. > > > > I confirmed whether I could evade this problem by renewed attrd recently. > > * In latest attrd, one became a leader and seemed to come to update an > > attribute. > > > > However, latest attrd does not seem to substitute for crmd-transition-delay. > > * I contribute detailed log later. > > > > We are dissatisfied with continuing using crmd-transition-delay. > > Is there the plan when attrd handles this problem well in the future? > > Are you using the new attrd code or the legacy stuff? > > If you're not using corosync 2.x or see: > > crm_notice("Starting mainloop..."); > > then its the old code. The new code could also be used with CMAN but isn't > configured to build for in that situation. > > Only the new code makes (or at least should do) crmd-transition-delay > redundant. > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)
Hi All, I contributed next bugzilla by a problem to occur for the difference of the timing of the attribute update by attrd before. * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528 We can evade this problem now by using crmd-transition-delay parameter. I confirmed whether I could evade this problem by renewed attrd recently. * In latest attrd, one became a leader and seemed to come to update an attribute. However, latest attrd does not seem to substitute for crmd-transition-delay. * I contribute detailed log later. We are dissatisfied with continuing using crmd-transition-delay. Is there the plan when attrd handles this problem well in the future? Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd Segmentation fault at pacemaker 1.0.12
Hi Andrew, Hi Takatsuka san, What kind of procedure did the problem that became this problem cause it in? If a problem easily happens in the environment of the user, we think that it is necessary to contact a user using Pacemaker1.0.13. Best Regards, Hideo Yamauchi. --- On Thu, 2013/11/14, TAKATSUKA Haruka wrote: > Hello Andrew, > Thank you for the quick modification. > > (Unfortunately the confirmation test is not possible because I don't > understand > a reproduction method for this crash ...) > > regards, > --- > Haruka Takatsuka > > > On Thu, 14 Nov 2013 09:32:24 +1100 > Andrew Beekhof wrote: > > > > > On 13 Nov 2013, at 7:36 pm, TAKATSUKA Haruka wrote: > > > > > Hello, pacemaker hackers > > > > > > I report crmd's crash at pacemaker 1.0.12 . > > > > > > We are going to upgrade pacemaker 1.0.12 to 1.0.13 . > > > But I was not able to find a fix for this problem from ChangeLog. > > > tengine.c:do_te_invoke() is not seem to care for transition_graph==NULL > > > case in even 1.0.x head code. > > > > This should help: > > > > https://github.com/ClusterLabs/pacemaker-1.0/commit/20f169d9cccb6c889946c64ab09ab4fb7f572f7c > > > > > > > > regards, > > > Haruka Takatsuka. > > > - > > > > > > [log] > > > Nov 07 00:00:08 srv1 crmd: [21843]: ERROR: crm_abort: > > > abort_transition_graph: Triggered assert at te_utils.c:259 : > > > transition_graph != NULL > > > Nov 07 00:00:08 srv1 heartbeat: [21823]: WARN: Managed > > > /usr/lib64/heartbeat/crmd process 21843 killed by signal 11 [SIGSEGV - > > > Segmentation violation]. > > > Nov 07 00:00:08 srv1 heartbeat: [21823]: ERROR: Managed > > > /usr/lib64/heartbeat/crmd process 21843 dumped core > > > Nov 07 00:00:08 srv1 heartbeat: [21823]: EMERG: Rebooting system. > > > Reason: /usr/lib64/heartbeat/crmd > > > > > > [gdb] > > > $ gdb -c core.21843 -s crmd.debug crmd > > > --(snip)-- > > > Program terminated with signal 11, Segmentation fault. > > > #0 0x004199c4 in do_te_invoke (action=140737488355328, > > > cause=C_FSA_INTERNAL, cur_state=S_POLICY_ENGINE, > > > current_input=I_FINALIZED, msg_data=0x1b28e20) at tengine.c:186 > > > 186 if(transition_graph->complete == FALSE) { > > > --(snip)-- > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem]Two error information is displayed.
Hi Andrew, I confirmed that a problem was solved in a revision. Thanks! Hideo Yamauchi. --- On Wed, 2013/9/4, Andrew Beekhof wrote: > Thanks (also to Andreas for sending me an example too)! > > Fixed: > https://github.com/beekhof/pacemaker/commit/a32474b > > On 04/09/2013, at 11:02 AM, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > Though the trouble is only once, two error information is displayed in > crm_mon. > >>> > >>> Have you got the full cib for when crm_mon is showing this? > >> > >> No. > >> I reproduce a problem once again and acquire cib. > > > > I send the result that I acquired by cibadmin -Q command. > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Tue, 2013/9/3, renayama19661...@ybb.ne.jp > > wrote: > > > >> Hi Andrew, > >> > Hi All, > > Though the trouble is only once, two error information is displayed in > crm_mon. > >>> > >>> Have you got the full cib for when crm_mon is showing this? > >> > >> No. > >> I reproduce a problem once again and acquire cib. > >> > >> Best Regards, > >> Hideo Yamauchi. > >> > >> ___ > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > >> Project Home: http://www.clusterlabs.org > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: http://bugs.clusterlabs.org > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem]Two error information is displayed.
Hi Andrew, > > > Though the trouble is only once, two error information is displayed in > > > crm_mon. > > > > Have you got the full cib for when crm_mon is showing this? > > No. > I reproduce a problem once again and acquire cib. I send the result that I acquired by cibadmin -Q command. Best Regards, Hideo Yamauchi. --- On Tue, 2013/9/3, renayama19661...@ybb.ne.jp wrote: > Hi Andrew, > > > > Hi All, > > > > > > Though the trouble is only once, two error information is displayed in > > > crm_mon. > > > > Have you got the full cib for when crm_mon is showing this? > > No. > I reproduce a problem once again and acquire cib. > > Best Regards, > Hideo Yamauchi. > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem]Two error information is displayed.
Hi Andrew, > > Hi All, > > > > Though the trouble is only once, two error information is displayed in > > crm_mon. > > Have you got the full cib for when crm_mon is showing this? No. I reproduce a problem once again and acquire cib. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem]Two error information is displayed.
Hi Andres, Thank you for comment. > But to be seriously: I see this phaenomena, too. > (pacemaker 1.1.11-1.el6-4f672bc) If the version that you confirm is the same as next, probably it will be that the same problem happens. There is a similar cord. (https://github.com/ClusterLabs/pacemaker/blob/4f672bc85eefd33e2fb09b601bb8ec1510645468/lib/pengine/unpack.c) Best Regards, Hideo Yamauchi. --- On Thu, 2013/8/29, Andreas Mock wrote: > Hi Hideo san, > > the two line shall emphasis that you do not only have trouble > but real trouble... ;-) > > But to be seriously: I see this phaenomena, too. > (pacemaker 1.1.11-1.el6-4f672bc) > > Best regards > Andreas Mock > > -Ursprüngliche Nachricht- > Von: renayama19661...@ybb.ne.jp [mailto:renayama19661...@ybb.ne.jp] > Gesendet: Donnerstag, 29. August 2013 02:38 > An: PaceMaker-ML > Betreff: [Pacemaker] [Problem]Two error information is displayed. > > Hi All, > > Though the trouble is only once, two error information is displayed in > crm_mon. > > - > > [root@rh64-coro2 ~]# crm_mon -1 -Af > Last updated: Thu Aug 29 18:11:00 2013 > Last change: Thu Aug 29 18:10:45 2013 via cibadmin on rh64-coro2 > Stack: corosync > Current DC: NONE > 1 Nodes configured > 1 Resources configured > > > Online: [ rh64-coro2 ] > > > Node Attributes: > * Node rh64-coro2: > > Migration summary: > * Node rh64-coro2: > dummy: migration-threshold=1 fail-count=1 last-failure='Thu Aug 29 > 18:10:57 2013' > > Failed actions: > dummy_monitor_3000 on (null) 'not running' (7): call=11, > status=complete, last-rc-change='Thu Aug 29 18:10:57 2013', queued=0ms, > exec=0ms > dummy_monitor_3000 on rh64-coro2 'not running' (7): call=11, > status=complete, last-rc-change='Thu Aug 29 18:10:57 2013', queued=0ms, > exec=0ms > > - > > There seems to be the problem with an additional judgment of the error > information somehow or other. > > - > static void > unpack_rsc_op_failure(resource_t *rsc, node_t *node, int rc, xmlNode > *xml_op, enum action_fail_response *on_fail, pe_working_set_t * data_set) > { > int interval = 0; > bool is_probe = FALSE; > action_t *action = NULL; > (snip) > if (rc != PCMK_OCF_NOT_INSTALLED || is_set(data_set->flags, > pe_flag_symmetric_cluster)) { > if ((node->details->shutdown == FALSE) || (node->details->online == > TRUE)) { > add_node_copy(data_set->failed, xml_op); > } > } > > crm_xml_add(xml_op, XML_ATTR_UNAME, node->details->uname); > if ((node->details->shutdown == FALSE) || (node->details->online == > TRUE)) { > add_node_copy(data_set->failed, xml_op); > } > (snip) > - > > > Please revise the additional handling of error information. > > Best Regards, > Hideo Yamauchi. > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Problem]Two error information is displayed.
Hi All, Though the trouble is only once, two error information is displayed in crm_mon. - [root@rh64-coro2 ~]# crm_mon -1 -Af Last updated: Thu Aug 29 18:11:00 2013 Last change: Thu Aug 29 18:10:45 2013 via cibadmin on rh64-coro2 Stack: corosync Current DC: NONE 1 Nodes configured 1 Resources configured Online: [ rh64-coro2 ] Node Attributes: * Node rh64-coro2: Migration summary: * Node rh64-coro2: dummy: migration-threshold=1 fail-count=1 last-failure='Thu Aug 29 18:10:57 2013' Failed actions: dummy_monitor_3000 on (null) 'not running' (7): call=11, status=complete, last-rc-change='Thu Aug 29 18:10:57 2013', queued=0ms, exec=0ms dummy_monitor_3000 on rh64-coro2 'not running' (7): call=11, status=complete, last-rc-change='Thu Aug 29 18:10:57 2013', queued=0ms, exec=0ms - There seems to be the problem with an additional judgment of the error information somehow or other. - static void unpack_rsc_op_failure(resource_t *rsc, node_t *node, int rc, xmlNode *xml_op, enum action_fail_response *on_fail, pe_working_set_t * data_set) { int interval = 0; bool is_probe = FALSE; action_t *action = NULL; (snip) if (rc != PCMK_OCF_NOT_INSTALLED || is_set(data_set->flags, pe_flag_symmetric_cluster)) { if ((node->details->shutdown == FALSE) || (node->details->online == TRUE)) { add_node_copy(data_set->failed, xml_op); } } crm_xml_add(xml_op, XML_ATTR_UNAME, node->details->uname); if ((node->details->shutdown == FALSE) || (node->details->online == TRUE)) { add_node_copy(data_set->failed, xml_op); } (snip) - Please revise the additional handling of error information. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The state of a node cut with the node that rebooted by a cluster is not recognized.
Hi Andrew, > Yep, sounds like a problem. > I'll follow up on bugzilla All right! Many Thanks! Hideo Yamauchi. --- On Tue, 2013/6/4, Andrew Beekhof wrote: > > On 04/06/2013, at 3:00 PM, renayama19661...@ybb.ne.jp wrote: > > > > > It is right movement that recognize other nodes in a UNCLEAN state in the > > node that rebooted, but seems to recognize it by mistake. > > > > It is like the problem of Pacemaker somehow or other. > > * There seems to be the problem with crm_peer_cache hush table. > > Yep, sounds like a problem. > I'll follow up on bugzilla ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The state of a node cut with the node that rebooted by a cluster is not recognized.
Hi All, I registered this problem with Bugzilla. * http://bugs.clusterlabs.org/show_bug.cgi?id=5160 Best Regards, Hideo Yamauchi. --- On Tue, 2013/6/4, renayama19661...@ybb.ne.jp wrote: > Hi All, > > We confirmed a state of the recognition of the cluster in the next procedure. > We confirm it by the next combination.(RHEL6.4 guest) > * corosync-2.3.0 > * pacemaker-Pacemaker-1.1.10-rc3 > > - > > Step 1) Start all nodes and constitute a cluster. > > [root@rh64-coro1 ~]# crm_mon -1 -Af > Last updated: Tue Jun 4 22:30:25 2013 > Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 > Stack: corosync > Current DC: rh64-coro3 (4231178432) - partition with quorum > Version: 1.1.9-db294e1 > 3 Nodes configured, unknown expected votes > 0 Resources configured. > > > Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ] > > > Node Attributes: > * Node rh64-coro1: > * Node rh64-coro2: > * Node rh64-coro3: > > Migration summary: > * Node rh64-coro1: > * Node rh64-coro3: > * Node rh64-coro2: > > > Step 2) Stop the first unit node. > > [root@rh64-coro2 ~]# crm_mon -1 -Af > Last updated: Tue Jun 4 22:30:55 2013 > Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 > Stack: corosync > Current DC: rh64-coro3 (4231178432) - partition with quorum > Version: 1.1.9-db294e1 > 3 Nodes configured, unknown expected votes > 0 Resources configured. > > > Online: [ rh64-coro2 rh64-coro3 ] > OFFLINE: [ rh64-coro1 ] > > > Node Attributes: > * Node rh64-coro2: > * Node rh64-coro3: > > Migration summary: > * Node rh64-coro3: > * Node rh64-coro2: > > > Step 3) Restart the first unit node. > > [root@rh64-coro1 ~]# crm_mon -1 -Af > Last updated: Tue Jun 4 22:31:29 2013 > Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 > Stack: corosync > Current DC: rh64-coro3 (4231178432) - partition with quorum > Version: 1.1.9-db294e1 > 3 Nodes configured, unknown expected votes > 0 Resources configured. > > > Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ] > > > Node Attributes: > * Node rh64-coro1: > * Node rh64-coro2: > * Node rh64-coro3: > > Migration summary: > * Node rh64-coro1: > * Node rh64-coro3: > * Node rh64-coro2: > > > Step 4) Interrupt the inter-connect of all nodes. > > [root@kvm-host ~]# brctl delif virbr2 vnet1;brctl delif virbr2 vnet4;brctl > delif virbr2 vnet7;brctl delif virbr3 vnet2;brctl delif virbr3 vnet5;brctl > delif virbr3 vnet8 > > - > > > Two nodes that do not reboot then recognize other nodes definitely. > > [root@rh64-coro2 ~]# crm_mon -1 -Af > Last updated: Tue Jun 4 22:32:06 2013 > Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 > Stack: corosync > Current DC: rh64-coro2 (4214401216) - partition WITHOUT quorum > Version: 1.1.9-db294e1 > 3 Nodes configured, unknown expected votes > 0 Resources configured. > > > Node rh64-coro1 (4197624000): UNCLEAN (offline) > Node rh64-coro3 (4231178432): UNCLEAN (offline) > Online: [ rh64-coro2 ] > > > Node Attributes: > * Node rh64-coro2: > > Migration summary: > * Node rh64-coro2: > > [root@rh64-coro3 ~]# crm_mon -1 -Af > Last updated: Tue Jun 4 22:33:17 2013 > Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 > Stack: corosync > Current DC: rh64-coro3 (4231178432) - partition WITHOUT quorum > Version: 1.1.9-db294e1 > 3 Nodes configured, unknown expected votes > 0 Resources configured. > > > Node rh64-coro1 (4197624000): UNCLEAN (offline) > Node rh64-coro2 (4214401216): UNCLEAN (offline) > Online: [ rh64-coro3 ] > > > Node Attributes: > * Node rh64-coro3: > > Migration summary: > * Node rh64-coro3: > > > However, the node that rebooted does not recognize the state of one node > definitely. > > [root@rh64-coro1 ~]# crm_mon -1 -Af > Last updated: Tue Jun 4 22:33:31 2013 > Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 > Stack: corosync > Current DC: rh64-coro1 (4197624000) - partition WITHOUT quorum > Version: 1.1.9-db294e1 > 3 Nodes configured, unknown expected votes > 0 Resources configured. > > > Node rh64-coro3 (4231178432): UNCLEAN (offline)> OKay. > Online: [ rh64-coro1 rh64-coro2 ] --> rh64-coro2 > NG. > > > Node Attributes: > * Node rh64-coro1: > * Node rh64-coro2: > > Migration summary: > * Node rh64-coro1: > * Node rh64-coro2: > > > It is right movement that recognize other nodes in a UNCLEAN state in the > node that rebooted, but seems to recognize it by mistake. > > It is like the problem of Pacemaker somehow or other. > * There seems to be the problem with crm_peer_cache hush table. > > Best Regards, > Hideo Yamauchi. > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _
[Pacemaker] [Problem] The state of a node cut with the node that rebooted by a cluster is not recognized.
Hi All, We confirmed a state of the recognition of the cluster in the next procedure. We confirm it by the next combination.(RHEL6.4 guest) * corosync-2.3.0 * pacemaker-Pacemaker-1.1.10-rc3 - Step 1) Start all nodes and constitute a cluster. [root@rh64-coro1 ~]# crm_mon -1 -Af Last updated: Tue Jun 4 22:30:25 2013 Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 Stack: corosync Current DC: rh64-coro3 (4231178432) - partition with quorum Version: 1.1.9-db294e1 3 Nodes configured, unknown expected votes 0 Resources configured. Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ] Node Attributes: * Node rh64-coro1: * Node rh64-coro2: * Node rh64-coro3: Migration summary: * Node rh64-coro1: * Node rh64-coro3: * Node rh64-coro2: Step 2) Stop the first unit node. [root@rh64-coro2 ~]# crm_mon -1 -Af Last updated: Tue Jun 4 22:30:55 2013 Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 Stack: corosync Current DC: rh64-coro3 (4231178432) - partition with quorum Version: 1.1.9-db294e1 3 Nodes configured, unknown expected votes 0 Resources configured. Online: [ rh64-coro2 rh64-coro3 ] OFFLINE: [ rh64-coro1 ] Node Attributes: * Node rh64-coro2: * Node rh64-coro3: Migration summary: * Node rh64-coro3: * Node rh64-coro2: Step 3) Restart the first unit node. [root@rh64-coro1 ~]# crm_mon -1 -Af Last updated: Tue Jun 4 22:31:29 2013 Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 Stack: corosync Current DC: rh64-coro3 (4231178432) - partition with quorum Version: 1.1.9-db294e1 3 Nodes configured, unknown expected votes 0 Resources configured. Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ] Node Attributes: * Node rh64-coro1: * Node rh64-coro2: * Node rh64-coro3: Migration summary: * Node rh64-coro1: * Node rh64-coro3: * Node rh64-coro2: Step 4) Interrupt the inter-connect of all nodes. [root@kvm-host ~]# brctl delif virbr2 vnet1;brctl delif virbr2 vnet4;brctl delif virbr2 vnet7;brctl delif virbr3 vnet2;brctl delif virbr3 vnet5;brctl delif virbr3 vnet8 - Two nodes that do not reboot then recognize other nodes definitely. [root@rh64-coro2 ~]# crm_mon -1 -Af Last updated: Tue Jun 4 22:32:06 2013 Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 Stack: corosync Current DC: rh64-coro2 (4214401216) - partition WITHOUT quorum Version: 1.1.9-db294e1 3 Nodes configured, unknown expected votes 0 Resources configured. Node rh64-coro1 (4197624000): UNCLEAN (offline) Node rh64-coro3 (4231178432): UNCLEAN (offline) Online: [ rh64-coro2 ] Node Attributes: * Node rh64-coro2: Migration summary: * Node rh64-coro2: [root@rh64-coro3 ~]# crm_mon -1 -Af Last updated: Tue Jun 4 22:33:17 2013 Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 Stack: corosync Current DC: rh64-coro3 (4231178432) - partition WITHOUT quorum Version: 1.1.9-db294e1 3 Nodes configured, unknown expected votes 0 Resources configured. Node rh64-coro1 (4197624000): UNCLEAN (offline) Node rh64-coro2 (4214401216): UNCLEAN (offline) Online: [ rh64-coro3 ] Node Attributes: * Node rh64-coro3: Migration summary: * Node rh64-coro3: However, the node that rebooted does not recognize the state of one node definitely. [root@rh64-coro1 ~]# crm_mon -1 -Af Last updated: Tue Jun 4 22:33:31 2013 Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1 Stack: corosync Current DC: rh64-coro1 (4197624000) - partition WITHOUT quorum Version: 1.1.9-db294e1 3 Nodes configured, unknown expected votes 0 Resources configured. Node rh64-coro3 (4231178432): UNCLEAN (offline)> OKay. Online: [ rh64-coro1 rh64-coro2 ] --> rh64-coro2 NG. Node Attributes: * Node rh64-coro1: * Node rh64-coro2: Migration summary: * Node rh64-coro1: * Node rh64-coro2: It is right movement that recognize other nodes in a UNCLEAN state in the node that rebooted, but seems to recognize it by mistake. It is like the problem of Pacemaker somehow or other. * There seems to be the problem with crm_peer_cache hush table. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Andrew, I registered a demand with Bugzilla. * http://bugs.clusterlabs.org/show_bug.cgi?id=5158 Many Thanks! Hideo Yamauchi. --- On Fri, 2013/5/24, renayama19661...@ybb.ne.jp wrote: > Hi Andrew, > > > > To Andrew : > > > If you make a patch removing a block of the file handling of pengine, I > > > confirm the movement. > > > If a problem is evaded without using tmpfs, many users welcome it. > > > > > > > You mean this patch? https://github.com/beekhof/pacemaker/commit/c7e10c6 > > Or another one? > > It is another one that I need. > A necessary thing is the structure which I can evade without our using tmpfs. > But it is our demand to the last because it is an environmental problem of > vSphere5. > > Best Regards, > Hideo Yamauchi. > > --- On Fri, 2013/5/24, Andrew Beekhof wrote: > > > > > On 24/05/2013, at 2:58 PM, renayama19661...@ybb.ne.jp wrote: > > > > > Hi Andrew, > > > Hi Vladislav, > > > > > >> We test movement when we located pe file in tmpfs repeatedly. > > >> It seems to move well for the moment. > > > > > > I only adopted tmpfs, and the I/O block of pengine was improved. > > > I confirm the synchronization with the fixed file, but think that there > > > is not the problem from now on. > > > It becomes one means to solve this problem to adopt tmpfs. > > > > > >> Pacemaker really isn't designed to function without disk access. > > >> You might be able to get away with it if you turn off saving PE files to > > >> disk though. > > > > > > To Andrew : > > > If you make a patch removing a block of the file handling of pengine, I > > > confirm the movement. > > > If a problem is evaded without using tmpfs, many users welcome it. > > > > > > > You mean this patch? https://github.com/beekhof/pacemaker/commit/c7e10c6 > > Or another one? > > > > > Best Regards, > > > Hideo Yamauchi. > > > > > > > > > --- On Wed, 2013/5/22, renayama19661...@ybb.ne.jp > > > wrote: > > > > > >> Hi Andrew, > > >> Hi Vladislav, > > >> > > >> We test movement when we located pe file in tmpfs repeatedly. > > >> It seems to move well for the moment. > > >> > > >> I confirm movement a little more, and we are going to try the method > > >> that Mr. Vladislav synchronizes. > > >> > > >> Best Regards, > > >> Hideo Yamauchi. > > >> > > >> --- On Wed, 2013/5/22, Andrew Beekhof wrote: > > >> > > >>> > > >>> On 17/05/2013, at 4:17 PM, Vladislav Bogdanov > > >>> wrote: > > >>> > > P.S. Andrew, is this patch ok to apply? > > >>> > > >>> https://github.com/beekhof/pacemaker/commit/c7e10c6 :) > > >>> > > >>> ___ > > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > >>> > > >>> Project Home: http://www.clusterlabs.org > > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > >>> Bugs: http://bugs.clusterlabs.org > > >>> > > >> > > >> ___ > > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > >> > > >> Project Home: http://www.clusterlabs.org > > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > >> Bugs: http://bugs.clusterlabs.org > > >> > > > > > > ___ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Andrew, > > To Andrew : > > If you make a patch removing a block of the file handling of pengine, I > > confirm the movement. > > If a problem is evaded without using tmpfs, many users welcome it. > > > > You mean this patch? https://github.com/beekhof/pacemaker/commit/c7e10c6 > Or another one? It is another one that I need. A necessary thing is the structure which I can evade without our using tmpfs. But it is our demand to the last because it is an environmental problem of vSphere5. Best Regards, Hideo Yamauchi. --- On Fri, 2013/5/24, Andrew Beekhof wrote: > > On 24/05/2013, at 2:58 PM, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > Hi Vladislav, > > > >> We test movement when we located pe file in tmpfs repeatedly. > >> It seems to move well for the moment. > > > > I only adopted tmpfs, and the I/O block of pengine was improved. > > I confirm the synchronization with the fixed file, but think that there is > > not the problem from now on. > > It becomes one means to solve this problem to adopt tmpfs. > > > >> Pacemaker really isn't designed to function without disk access. > >> You might be able to get away with it if you turn off saving PE files to > >> disk though. > > > > To Andrew : > > If you make a patch removing a block of the file handling of pengine, I > > confirm the movement. > > If a problem is evaded without using tmpfs, many users welcome it. > > > > You mean this patch? https://github.com/beekhof/pacemaker/commit/c7e10c6 > Or another one? > > > Best Regards, > > Hideo Yamauchi. > > > > > > --- On Wed, 2013/5/22, renayama19661...@ybb.ne.jp > > wrote: > > > >> Hi Andrew, > >> Hi Vladislav, > >> > >> We test movement when we located pe file in tmpfs repeatedly. > >> It seems to move well for the moment. > >> > >> I confirm movement a little more, and we are going to try the method that > >> Mr. Vladislav synchronizes. > >> > >> Best Regards, > >> Hideo Yamauchi. > >> > >> --- On Wed, 2013/5/22, Andrew Beekhof wrote: > >> > >>> > >>> On 17/05/2013, at 4:17 PM, Vladislav Bogdanov > >>> wrote: > >>> > P.S. Andrew, is this patch ok to apply? > >>> > >>> https://github.com/beekhof/pacemaker/commit/c7e10c6 :) > >>> > >>> ___ > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > >>> > >> > >> ___ > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > >> Project Home: http://www.clusterlabs.org > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: http://bugs.clusterlabs.org > >> > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Andrew, Hi Vladislav, > We test movement when we located pe file in tmpfs repeatedly. > It seems to move well for the moment. I only adopted tmpfs, and the I/O block of pengine was improved. I confirm the synchronization with the fixed file, but think that there is not the problem from now on. It becomes one means to solve this problem to adopt tmpfs. > Pacemaker really isn't designed to function without disk access. > You might be able to get away with it if you turn off saving PE files to disk > though. To Andrew : If you make a patch removing a block of the file handling of pengine, I confirm the movement. If a problem is evaded without using tmpfs, many users welcome it. Best Regards, Hideo Yamauchi. --- On Wed, 2013/5/22, renayama19661...@ybb.ne.jp wrote: > Hi Andrew, > Hi Vladislav, > > We test movement when we located pe file in tmpfs repeatedly. > It seems to move well for the moment. > > I confirm movement a little more, and we are going to try the method that Mr. > Vladislav synchronizes. > > Best Regards, > Hideo Yamauchi. > > --- On Wed, 2013/5/22, Andrew Beekhof wrote: > > > > > On 17/05/2013, at 4:17 PM, Vladislav Bogdanov wrote: > > > > > P.S. Andrew, is this patch ok to apply? > > > > https://github.com/beekhof/pacemaker/commit/c7e10c6 :) > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Andrew, Hi Vladislav, We test movement when we located pe file in tmpfs repeatedly. It seems to move well for the moment. I confirm movement a little more, and we are going to try the method that Mr. Vladislav synchronizes. Best Regards, Hideo Yamauchi. --- On Wed, 2013/5/22, Andrew Beekhof wrote: > > On 17/05/2013, at 4:17 PM, Vladislav Bogdanov wrote: > > > P.S. Andrew, is this patch ok to apply? > > https://github.com/beekhof/pacemaker/commit/c7e10c6 :) > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Vladislav, > For just this, patch is unneeded. It only plays when you have that > pengine files symlinked from stable storage to tmpfs, Without patch, > pengine would try to rewrite file where symlink points it - directly on > a stable storage. With that patch, pengine will remove symlink (and just > symlink) and will open new file on tmpfs for writing. Thus, it will not > block if stable storage is inaccessible (for my case because of > connectivity problems, for yours - because of backing storage outage). > > If you decide to go with tmpfs *and* use the same synchronization method > as I do, then you'd need to bake the similar patch for 1.0, just add > unlink() before pengine writes its data (I suspect that code to differ > between 1.0 and 1.1.10, even in 1.1.6 it was different to current master). Thank you for detailed explanation. At first I confirm movement only in tmpfs. Many Thanks! Hideo Yamauchi. --- On Fri, 2013/5/17, Vladislav Bogdanov wrote: > Hi Hideo-san, > > 17.05.2013 10:29, renayama19661...@ybb.ne.jp wrote: > > Hi Vladislav, > > > > Thank you for advice. > > > > I try the patch which you showed. > > > > We use Pacemaker1.0, but apply a patch there because there is a similar > > code. > > > > If there is a question by setting, I ask you a question by an email. > > * At first I only use tmpfs, and I intend to test it. > > For just this, patch is unneeded. It only plays when you have that > pengine files symlinked from stable storage to tmpfs, Without patch, > pengine would try to rewrite file where symlink points it - directly on > a stable storage. With that patch, pengine will remove symlink (and just > symlink) and will open new file on tmpfs for writing. Thus, it will not > block if stable storage is inaccessible (for my case because of > connectivity problems, for yours - because of backing storage outage). > > If you decide to go with tmpfs *and* use the same synchronization method > as I do, then you'd need to bake the similar patch for 1.0, just add > unlink() before pengine writes its data (I suspect that code to differ > between 1.0 and 1.1.10, even in 1.1.6 it was different to current master). > > > > >> P.S. Andrew, is this patch ok to apply? > > > > To Andrew... > > Does the patch in conjunction with the write_xml processing in your > >repository have to apply it before the confirmation of the patch of > >Vladislav? > > > > Many Thanks! > > Hideo Yamauchi. > > > > > > > > > > --- On Fri, 2013/5/17, Vladislav Bogdanov wrote: > > > >> Hi Hideo-san, > >> > >> You may try the following patch (with trick below) > >> > >> From 2c4418d11c491658e33c149f63e6a2f2316ef310 Mon Sep 17 00:00:00 2001 > >> From: Vladislav Bogdanov > >> Date: Fri, 17 May 2013 05:58:34 + > >> Subject: [PATCH] Feature: PE: Unlink pengine output files before writing. > >> This should help guys who store them to tmpfs and then copy to a stable > >>storage > >> on (inotify) events with symlink creation in the original place to > >>survive when > >> stable storage is not accessible. > >> > >> --- > >> pengine/pengine.c | 1 + > >> 1 files changed, 1 insertions(+), 0 deletions(-) > >> > >> diff --git a/pengine/pengine.c b/pengine/pengine.c > >> index c7e1c68..99a81c6 100644 > >> --- a/pengine/pengine.c > >> +++ b/pengine/pengine.c > >> @@ -184,6 +184,7 @@ process_pe_message(xmlNode * msg, xmlNode * xml_data, > >> crm_client_t * sender) > >> } > >> > >> if (is_repoke == FALSE && series_wrap != 0) { > >> + unlink(filename); > >> write_xml_file(xml_data, filename, HAVE_BZLIB_H); > >> write_last_sequence(PE_STATE_DIR, series[series_id].name, seq > >>+ 1, series_wrap); > >> } else { > >> -- > >> 1.7.1 > >> > >> You just need to ensure that /var/lib/pacemaker is on tmpfs. Then you may > >> watch on directories there > >> with inotify or so and take actions to move (copy) files to a stable > >> storage (RAM is not of infinite size). > >> In my case that is CIFS. And I use lsyncd to synchronize that directories. > >> If you are interested, I can > >> provide you with relevant lsyncd configuration. Frankly speaking, three is > >> no big need to create symlinks > >> in tmpfs to stable storage, as pacemaker does not use existing pengine > >> files (except sequences). That sequence > >> files and cib.xml are the only exceptions which you may want to exist in > >> two places (and you may want to copy > >> them from stable storage to tmpfs before pacemaker start), and you can > >> just move everything else away from > >> tmpfs once it is written. In this case you do not need this patch. > >> > >> Best, > >> Vladislav > >> > >> P.S. Andrew, is this patch ok to apply? > >> > >> 17.05.2013 03:27, renayama19661...@ybb.ne.jp wrote: > >>> Hi Andrew, > >>> Hi Vladislav, > >>> > >>> I try whether this correction is effective for this problem. > >>> * > >>>https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Vladislav, Thank you for advice. I try the patch which you showed. We use Pacemaker1.0, but apply a patch there because there is a similar code. If there is a question by setting, I ask you a question by an email. * At first I only use tmpfs, and I intend to test it. > P.S. Andrew, is this patch ok to apply? To Andrew... Does the patch in conjunction with the write_xml processing in your repository have to apply it before the confirmation of the patch of Vladislav? Many Thanks! Hideo Yamauchi. --- On Fri, 2013/5/17, Vladislav Bogdanov wrote: > Hi Hideo-san, > > You may try the following patch (with trick below) > > From 2c4418d11c491658e33c149f63e6a2f2316ef310 Mon Sep 17 00:00:00 2001 > From: Vladislav Bogdanov > Date: Fri, 17 May 2013 05:58:34 + > Subject: [PATCH] Feature: PE: Unlink pengine output files before writing. > This should help guys who store them to tmpfs and then copy to a stable > storage > on (inotify) events with symlink creation in the original place to survive > when > stable storage is not accessible. > > --- > pengine/pengine.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/pengine/pengine.c b/pengine/pengine.c > index c7e1c68..99a81c6 100644 > --- a/pengine/pengine.c > +++ b/pengine/pengine.c > @@ -184,6 +184,7 @@ process_pe_message(xmlNode * msg, xmlNode * xml_data, > crm_client_t * sender) > } > > if (is_repoke == FALSE && series_wrap != 0) { > + unlink(filename); > write_xml_file(xml_data, filename, HAVE_BZLIB_H); > write_last_sequence(PE_STATE_DIR, series[series_id].name, seq + > 1, series_wrap); > } else { > -- > 1.7.1 > > You just need to ensure that /var/lib/pacemaker is on tmpfs. Then you may > watch on directories there > with inotify or so and take actions to move (copy) files to a stable storage > (RAM is not of infinite size). > In my case that is CIFS. And I use lsyncd to synchronize that directories. If > you are interested, I can > provide you with relevant lsyncd configuration. Frankly speaking, three is no > big need to create symlinks > in tmpfs to stable storage, as pacemaker does not use existing pengine files > (except sequences). That sequence > files and cib.xml are the only exceptions which you may want to exist in two > places (and you may want to copy > them from stable storage to tmpfs before pacemaker start), and you can just > move everything else away from > tmpfs once it is written. In this case you do not need this patch. > > Best, > Vladislav > > P.S. Andrew, is this patch ok to apply? > > 17.05.2013 03:27, renayama19661...@ybb.ne.jp wrote: > > Hi Andrew, > > Hi Vladislav, > > > > I try whether this correction is effective for this problem. > > * > >https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e65dadf1c626e050a388c59 > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Thu, 2013/5/16, Andrew Beekhof wrote: > > > >> > >> On 16/05/2013, at 3:49 PM, Vladislav Bogdanov wrote: > >> > >>> 16.05.2013 02:46, Andrew Beekhof wrote: > > On 15/05/2013, at 6:44 PM, Vladislav Bogdanov > wrote: > > > 15.05.2013 11:18, Andrew Beekhof wrote: > >> > >> On 15/05/2013, at 5:31 PM, Vladislav Bogdanov > >> wrote: > >> > >>> 15.05.2013 10:25, Andrew Beekhof wrote: > > On 15/05/2013, at 3:50 PM, Vladislav Bogdanov > wrote: > > > 15.05.2013 08:23, Andrew Beekhof wrote: > >> > >> On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi Andrew, > >>> > >>> Thank you for comments. > >>> > > The guest located it to the shared disk. > > What is on the shared disk? The whole OS or app-specific data > (i.e. nothing pacemaker needs directly)? > >>> > >>> Shared disk has all the OS and the all data. > >> > >> Oh. I can imagine that being problematic. > >> Pacemaker really isn't designed to function without disk access. > >> > >> You might be able to get away with it if you turn off saving PE > >> files to disk though. > > > > I store CIB and PE files to tmpfs, and sync them to remote storage > > (CIFS) with lsyncd level 1 config (I may share it on request). It > > copies > > critical data like cib.xml, and moves everything else, symlinking > > it to > > original place. The same technique may apply here, but with local fs > > instead of cifs. > > > > Btw, the following patch is needed for that, otherwise pacemaker > > overwrites remote files instead of creating new ones on tmpfs: > > > > --- a/lib/common/xml.c 2011-02-11 11:42:37.0 +0100 > > +++ b/lib/common/xml.c 2011-02-24 15:07:48.541870829 +0100 > > @@
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Andrew, Hi Vladislav, I try whether this correction is effective for this problem. * https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e65dadf1c626e050a388c59 Best Regards, Hideo Yamauchi. --- On Thu, 2013/5/16, Andrew Beekhof wrote: > > On 16/05/2013, at 3:49 PM, Vladislav Bogdanov wrote: > > > 16.05.2013 02:46, Andrew Beekhof wrote: > >> > >> On 15/05/2013, at 6:44 PM, Vladislav Bogdanov wrote: > >> > >>> 15.05.2013 11:18, Andrew Beekhof wrote: > > On 15/05/2013, at 5:31 PM, Vladislav Bogdanov > wrote: > > > 15.05.2013 10:25, Andrew Beekhof wrote: > >> > >> On 15/05/2013, at 3:50 PM, Vladislav Bogdanov > >> wrote: > >> > >>> 15.05.2013 08:23, Andrew Beekhof wrote: > > On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > Thank you for comments. > > > >>> The guest located it to the shared disk. > >> > >> What is on the shared disk? The whole OS or app-specific data > >> (i.e. nothing pacemaker needs directly)? > > > > Shared disk has all the OS and the all data. > > Oh. I can imagine that being problematic. > Pacemaker really isn't designed to function without disk access. > > You might be able to get away with it if you turn off saving PE > files to disk though. > >>> > >>> I store CIB and PE files to tmpfs, and sync them to remote storage > >>> (CIFS) with lsyncd level 1 config (I may share it on request). It > >>> copies > >>> critical data like cib.xml, and moves everything else, symlinking it > >>> to > >>> original place. The same technique may apply here, but with local fs > >>> instead of cifs. > >>> > >>> Btw, the following patch is needed for that, otherwise pacemaker > >>> overwrites remote files instead of creating new ones on tmpfs: > >>> > >>> --- a/lib/common/xml.c 2011-02-11 11:42:37.0 +0100 > >>> +++ b/lib/common/xml.c 2011-02-24 15:07:48.541870829 +0100 > >>> @@ -529,6 +529,8 @@ write_file(const char *string, const char > >>> *filename) > >>> return -1; > >>> } > >>> > >>> + unlink(filename); > >> > >> Seems like it should be safe to include for normal operation. > > > > Exactly. > > Small flaw in that logic... write_file() is not used anywhere. > >>> > >>> Heh, thanks for spotting this. > >>> > >>> I recall write_file() was used for pengine, but some other function for > >>> CIB. You probably optimized that but forgot to remove unused function, > >>> that's why I was sure patch is still valid. And I did tests (CIFS > >>> storage outage simulation) only after initial patch, but not last years, > >>> that's why I didn't notice the regression - storage uses pacemaker too ;) > >>> . > >>> > >>> This should go to write_xml_file() (And probably to other places just > >>> before fopen(..., "w"), f.e. series). > >> > >> I've consolidated the code, however adding the unlink() would break things > >> for anyone intentionally symlinking cib.xml from somewhere else (like a > >> git repo). > >> So I'm not so sure I should make the unlink() change :( > > > > Agree. > > I originally made it specific to pengine files. > > What do you prefer, simple wrapper in xml.c (f.e. > > unlink_and_write_xml_file()) or just add unlink() call to pengine before > > it calls write_xml_file()? > > The last one :) > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Andrew, > > Thank you for comments. > > > >>> The guest located it to the shared disk. > >> > >> What is on the shared disk? The whole OS or app-specific data (i.e. > >> nothing pacemaker needs directly)? > > > > Shared disk has all the OS and the all data. > > Oh. I can imagine that being problematic. > Pacemaker really isn't designed to function without disk access. I think so, too. I thought so, and I did the following suggestion. > >>> For example... > >>> 1. crmd watches a request to pengine with a timer... > >>> 2. pengine writes in it with a timer and watches processing > >>> ..etc... But, there may be a better method. > > You might be able to get away with it if you turn off saving PE files to disk > though. > > > The placement of this shared disk is similar in KVM where the problem does > > not occur. > > That it works in KVM in this situation is kind of surprising. > Or perhaps I misunderstand. About the movement on KVM, I confirm the details once again. However, the movement on KVM is clearly different from the movement on vSphere5.1. Best Regards, Hideo Yamauchi. > > > > > * We understand that we are different in movement in the difference of the > > hyper visor. > > * However, it seems to be necessary to evade this problem to use Pacemaker > > in vSphere5.1 environment. > > > > Best Regards, > > Hideo Yamauchi. > > > > > > --- On Wed, 2013/5/15, Andrew Beekhof wrote: > > > >> > >> On 13/05/2013, at 4:14 PM, renayama19661...@ybb.ne.jp wrote: > >> > >>> Hi All, > >>> > >>> We constituted a simple cluster in environment of vSphere5.1. > >>> > >>> We composed it of two ESXi servers and shared disk. > >>> > >>> The guest located it to the shared disk. > >> > >> What is on the shared disk? The whole OS or app-specific data (i.e. > >> nothing pacemaker needs directly)? > >> > >>> > >>> > >>> Step 1) Constitute a cluster.(A DC node is an active node.) > >>> > >>> > >>> Last updated: Mon May 13 14:16:09 2013 > >>> Stack: Heartbeat > >>> Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition > >>> with quorum > >>> Version: 1.0.13-30bb726 > >>> 2 Nodes configured, unknown expected votes > >>> 2 Resources configured. > >>> > >>> > >>> Online: [ pgsr01 pgsr02 ] > >>> > >>> Resource Group: test-group > >>> Dummy1 (ocf::pacemaker:Dummy): Started pgsr01 > >>> Dummy2 (ocf::pacemaker:Dummy): Started pgsr01 > >>> Clone Set: clnPingd > >>> Started: [ pgsr01 pgsr02 ] > >>> > >>> Node Attributes: > >>> * Node pgsr01: > >>> + default_ping_set : 100 > >>> * Node pgsr02: > >>> + default_ping_set : 100 > >>> > >>> Migration summary: > >>> * Node pgsr01: > >>> * Node pgsr02: > >>> > >>> > >>> Step 2) Strace does the pengine process of the DC node. > >>> > >>> [root@pgsr01 ~]# ps -ef |grep heartbeat > >>> root 2072 1 0 13:56 ? 00:00:00 heartbeat: master control > >>> process > >>> root 2075 2072 0 13:56 ? 00:00:00 heartbeat: FIFO reader > >>> > >>> root 2076 2072 0 13:56 ? 00:00:00 heartbeat: write: bcast > >>> eth1 > >>> root 2077 2072 0 13:56 ? 00:00:00 heartbeat: read: bcast > >>> eth1 > >>> root 2078 2072 0 13:56 ? 00:00:00 heartbeat: write: bcast > >>> eth2 > >>> root 2079 2072 0 13:56 ? 00:00:00 heartbeat: read: bcast > >>> eth2 > >>> 496 2082 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/ccm > >>> 496 2083 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/cib > >>> root 2084 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/lrmd > >>> -r > >>> root 2085 2072 0 13:57 ? 00:00:00 > >>> /usr/lib64/heartbeat/stonithd > >>> 496 2086 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/attrd > >>> 496 2087 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/crmd > >>> 496 2089 2087 0 13:57 ? 00:00:00 > >>> /usr/lib64/heartbeat/pengine > >>> root 2182 1 0 14:15 ? 00:00:00 > >>> /usr/lib64/heartbeat/pingd -D -p /var/run//pingd-default_ping_set -a > >>> default_ping_set -d 5s -m 100 -i 1 -h 192.168.101.254 > >>> root 2287 1973 0 14:16 pts/0 00:00:00 grep heartbea > >>> > >>> [root@pgsr01 ~]# strace -p 2089 > >>> Process 2089 attached - interrupt to quit > >>> restart_syscall(<... resuming interrupted call ...>) = 0 > >>> times({tms_utime=5, tms_stime=6, tms_cutime=0, tms_cstime=0}) = 429527557 > >>> recvfrom(5, 0xa93ff7, 953, 64, 0, 0) = -1 EAGAIN (Resource temporarily > >>> unavailable) > >>> poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) > >>> recvfrom(5, 0xa93ff7, 953, 64, 0, 0) = -1 EAGAIN (Resource temporarily > >>> unavailable) > >>> poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) > >>> (snip) > >>> > >>> > >>> Step 3) Disconnect the shared disk which an active node was placed. > >>> > >>> Step 4) Cut
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Andrew, Thank you for comments. > > The guest located it to the shared disk. > > What is on the shared disk? The whole OS or app-specific data (i.e. nothing > pacemaker needs directly)? Shared disk has all the OS and the all data. The placement of this shared disk is similar in KVM where the problem does not occur. * We understand that we are different in movement in the difference of the hyper visor. * However, it seems to be necessary to evade this problem to use Pacemaker in vSphere5.1 environment. Best Regards, Hideo Yamauchi. --- On Wed, 2013/5/15, Andrew Beekhof wrote: > > On 13/05/2013, at 4:14 PM, renayama19661...@ybb.ne.jp wrote: > > > Hi All, > > > > We constituted a simple cluster in environment of vSphere5.1. > > > > We composed it of two ESXi servers and shared disk. > > > > The guest located it to the shared disk. > > What is on the shared disk? The whole OS or app-specific data (i.e. nothing > pacemaker needs directly)? > > > > > > > Step 1) Constitute a cluster.(A DC node is an active node.) > > > > > > Last updated: Mon May 13 14:16:09 2013 > > Stack: Heartbeat > > Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with > > quorum > > Version: 1.0.13-30bb726 > > 2 Nodes configured, unknown expected votes > > 2 Resources configured. > > > > > > Online: [ pgsr01 pgsr02 ] > > > > Resource Group: test-group > > Dummy1 (ocf::pacemaker:Dummy): Started pgsr01 > > Dummy2 (ocf::pacemaker:Dummy): Started pgsr01 > > Clone Set: clnPingd > > Started: [ pgsr01 pgsr02 ] > > > > Node Attributes: > > * Node pgsr01: > > + default_ping_set : 100 > > * Node pgsr02: > > + default_ping_set : 100 > > > > Migration summary: > > * Node pgsr01: > > * Node pgsr02: > > > > > > Step 2) Strace does the pengine process of the DC node. > > > > [root@pgsr01 ~]# ps -ef |grep heartbeat > > root 2072 1 0 13:56 ? 00:00:00 heartbeat: master control > > process > > root 2075 2072 0 13:56 ? 00:00:00 heartbeat: FIFO reader > > > > root 2076 2072 0 13:56 ? 00:00:00 heartbeat: write: bcast > > eth1 > > root 2077 2072 0 13:56 ? 00:00:00 heartbeat: read: bcast eth1 > > > > root 2078 2072 0 13:56 ? 00:00:00 heartbeat: write: bcast > > eth2 > > root 2079 2072 0 13:56 ? 00:00:00 heartbeat: read: bcast eth2 > > > > 496 2082 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/ccm > > 496 2083 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/cib > > root 2084 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/lrmd -r > > root 2085 2072 0 13:57 ? 00:00:00 > > /usr/lib64/heartbeat/stonithd > > 496 2086 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/attrd > > 496 2087 2072 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/crmd > > 496 2089 2087 0 13:57 ? 00:00:00 /usr/lib64/heartbeat/pengine > > root 2182 1 0 14:15 ? 00:00:00 /usr/lib64/heartbeat/pingd > > -D -p /var/run//pingd-default_ping_set -a default_ping_set -d 5s -m 100 -i > > 1 -h 192.168.101.254 > > root 2287 1973 0 14:16 pts/0 00:00:00 grep heartbea > > > > [root@pgsr01 ~]# strace -p 2089 > > Process 2089 attached - interrupt to quit > > restart_syscall(<... resuming interrupted call ...>) = 0 > > times({tms_utime=5, tms_stime=6, tms_cutime=0, tms_cstime=0}) = 429527557 > > recvfrom(5, 0xa93ff7, 953, 64, 0, 0) = -1 EAGAIN (Resource temporarily > > unavailable) > > poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) > > recvfrom(5, 0xa93ff7, 953, 64, 0, 0) = -1 EAGAIN (Resource temporarily > > unavailable) > > poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) > > (snip) > > > > > > Step 3) Disconnect the shared disk which an active node was placed. > > > > Step 4) Cut off pingd of the standby node. > > The score of pingd is reflected definitely, but handling of pengine > >blocks it. > > > > ~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1 > > ~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1 > > > > > > (snip) > > brk(0xd05000) = 0xd05000 > > brk(0xeed000) = 0xeed000 > > brk(0xf2d000) = 0xf2d000 > > fstat(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 > > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > > 0x7f86a255a000 > > write(6, "BZh51AY&SY\327\373\370\203\0\t(_\200UPX\3\377\377%cT > > \277\377\377"..., 2243) = 2243 > > brk(0xb1d000) = 0xb1d000 > > fsync(6 --> > > BLOCKED > > (snip) > > > > > > > > Last updated: Mon May 13 14:19:15 2013 > > Stack: Heartbeat > > Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with > > quorum > > Version: 1.0.13-30bb726 >
[Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi All, We constituted a simple cluster in environment of vSphere5.1. We composed it of two ESXi servers and shared disk. The guest located it to the shared disk. Step 1) Constitute a cluster.(A DC node is an active node.) Last updated: Mon May 13 14:16:09 2013 Stack: Heartbeat Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum Version: 1.0.13-30bb726 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ pgsr01 pgsr02 ] Resource Group: test-group Dummy1 (ocf::pacemaker:Dummy): Started pgsr01 Dummy2 (ocf::pacemaker:Dummy): Started pgsr01 Clone Set: clnPingd Started: [ pgsr01 pgsr02 ] Node Attributes: * Node pgsr01: + default_ping_set : 100 * Node pgsr02: + default_ping_set : 100 Migration summary: * Node pgsr01: * Node pgsr02: Step 2) Strace does the pengine process of the DC node. [root@pgsr01 ~]# ps -ef |grep heartbeat root 2072 1 0 13:56 ?00:00:00 heartbeat: master control process root 2075 2072 0 13:56 ?00:00:00 heartbeat: FIFO reader root 2076 2072 0 13:56 ?00:00:00 heartbeat: write: bcast eth1 root 2077 2072 0 13:56 ?00:00:00 heartbeat: read: bcast eth1 root 2078 2072 0 13:56 ?00:00:00 heartbeat: write: bcast eth2 root 2079 2072 0 13:56 ?00:00:00 heartbeat: read: bcast eth2 496 2082 2072 0 13:57 ?00:00:00 /usr/lib64/heartbeat/ccm 496 2083 2072 0 13:57 ?00:00:00 /usr/lib64/heartbeat/cib root 2084 2072 0 13:57 ?00:00:00 /usr/lib64/heartbeat/lrmd -r root 2085 2072 0 13:57 ?00:00:00 /usr/lib64/heartbeat/stonithd 496 2086 2072 0 13:57 ?00:00:00 /usr/lib64/heartbeat/attrd 496 2087 2072 0 13:57 ?00:00:00 /usr/lib64/heartbeat/crmd 496 2089 2087 0 13:57 ?00:00:00 /usr/lib64/heartbeat/pengine root 2182 1 0 14:15 ?00:00:00 /usr/lib64/heartbeat/pingd -D -p /var/run//pingd-default_ping_set -a default_ping_set -d 5s -m 100 -i 1 -h 192.168.101.254 root 2287 1973 0 14:16 pts/000:00:00 grep heartbea [root@pgsr01 ~]# strace -p 2089 Process 2089 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>) = 0 times({tms_utime=5, tms_stime=6, tms_cutime=0, tms_cstime=0}) = 429527557 recvfrom(5, 0xa93ff7, 953, 64, 0, 0)= -1 EAGAIN (Resource temporarily unavailable) poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) recvfrom(5, 0xa93ff7, 953, 64, 0, 0)= -1 EAGAIN (Resource temporarily unavailable) poll([{fd=5, events=0}], 1, 0) = 0 (Timeout) (snip) Step 3) Disconnect the shared disk which an active node was placed. Step 4) Cut off pingd of the standby node. The score of pingd is reflected definitely, but handling of pengine blocks it. ~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1 ~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1 (snip) brk(0xd05000) = 0xd05000 brk(0xeed000) = 0xeed000 brk(0xf2d000) = 0xf2d000 fstat(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f86a255a000 write(6, "BZh51AY&SY\327\373\370\203\0\t(_\200UPX\3\377\377%cT \277\377\377"..., 2243) = 2243 brk(0xb1d000) = 0xb1d000 fsync(6--> BLOCKED (snip) Last updated: Mon May 13 14:19:15 2013 Stack: Heartbeat Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum Version: 1.0.13-30bb726 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ pgsr01 pgsr02 ] Resource Group: test-group Dummy1 (ocf::pacemaker:Dummy): Started pgsr01 Dummy2 (ocf::pacemaker:Dummy): Started pgsr01 Clone Set: clnPingd Started: [ pgsr01 pgsr02 ] Node Attributes: * Node pgsr01: + default_ping_set : 100 * Node pgsr02: + default_ping_set : 0 : Connectivity is lost Migration summary: * Node pgsr01: * Node pgsr02: Step 4) Reconnect communication of pingd of the standby node. The score of pingd is reflected definitely, but handling of pengine blocks it. ~ # esxcfg-vswitch -M vmnic1 -p "ap-db" vSwitch1 ~ # esxcfg-vswitch -M vmnic2 -p "ap-db" vSwitch1 Last updated: Mon May 13 14:19:40 2013 Stack: Heartbeat Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum Version: 1.0.13-30bb726 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ pgsr01 pgsr02 ] Resource Group: test-group Dummy1 (ocf::pacemaker:Dummy): Started pgsr01 Dummy2 (ocf::pacemaker:Dummy): Started pgsr01 Clone Set: clnPingd Sta