2014-03-18 8:03 GMT+09:00 David Vossel <dvos...@redhat.com>: > > ----- Original Message ----- >> From: "Kazunori INOUE" <kazunori.ino...@gmail.com> >> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> >> Sent: Monday, March 17, 2014 4:51:11 AM >> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11 >> >> 2014-03-17 16:37 GMT+09:00 Kazunori INOUE <kazunori.ino...@gmail.com>: >> > 2014-03-15 4:08 GMT+09:00 David Vossel <dvos...@redhat.com>: >> >> >> >> >> >> ----- Original Message ----- >> >>> From: "Kazunori INOUE" <kazunori.ino...@gmail.com> >> >>> To: "pm" <pacemaker@oss.clusterlabs.org> >> >>> Sent: Friday, March 14, 2014 5:52:38 AM >> >>> Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11 >> >>> >> >>> Hi, >> >>> >> >>> When specifying the node name in UPPER case and performing >> >>> crm_resource, crmd was aborted. >> >>> (The real node name is a LOWER case.) >> >> >> >> https://github.com/ClusterLabs/pacemaker/pull/462 >> >> >> >> does that fix it? >> >> >> > >> > Since behavior of glib is strange somehow, the result is NO. >> > I tested this brunch. >> > https://github.com/davidvossel/pacemaker/tree/lrm-segfault >> > * Red Hat Enterprise Linux Server release 6.4 (Santiago) >> > * glib2-2.22.5-7.el6.x86_64 >> > >> > strcase_equal() is not called from g_hash_table_lookup(). >> > >> > [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409 >> > ...snip... >> > (gdb) b lrm.c:1232 >> > Breakpoint 1 at 0x4251d0: file lrm.c, line 1232. >> > (gdb) b strcase_equal >> > Breakpoint 2 at 0x429828: file lrm_state.c, line 95. >> > (gdb) c >> > Continuing. >> > >> > Breakpoint 1, do_lrm_invoke (action=288230376151711744, >> > cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER, >> > msg_data=0x7fff8d679540) at lrm.c:1232 >> > 1232 lrm_state = lrm_state_find(target_node); >> > (gdb) s >> > lrm_state_find (node_name=0x1d4c650 "X3650H") at lrm_state.c:267 >> > 267 { >> > (gdb) n >> > 268 if (!node_name) { >> > (gdb) n >> > 271 return g_hash_table_lookup(lrm_state_table, node_name); >> > (gdb) p g_hash_table_size(lrm_state_table) >> > $1 = 1 >> > (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))->data >> > $2 = 0x1c791a0 "x3650h" >> > (gdb) p node_name >> > $3 = 0x1d4c650 "X3650H" >> > (gdb) n >> > 272 } >> > (gdb) n >> > do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE, >> > cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540) >> > at lrm.c:1234 >> > 1234 if (lrm_state == NULL && is_remote_node) { >> > (gdb) n >> > 1240 CRM_ASSERT(lrm_state != NULL); >> > (gdb) n >> > >> > Program received signal SIGABRT, Aborted. >> > 0x0000003787e328a5 in raise () from /lib64/libc.so.6 >> > (gdb) >> > >> > >> > I wonder why... so I will continue investigation. >> > >> > >> >> I read the code of g_hash_table_lookup(). >> Key is compared by the hash value generated by crm_str_hash before >> strcase_equal() is performed. > > good catch. I've updated the patch in this pull request. Can you give it a go? > > https://github.com/ClusterLabs/pacemaker/pull/462 > fail-count is not cleared only in this.
$ crm_resource -C -r p1 -N X3650H Cleaning up p1 on X3650H Waiting for 1 replies from the CRMd. OK $ grep fail-count /var/log/ha-log Mar 18 13:53:36 x3650g attrd[3610]: debug: attrd_client_message: Broadcasting fail-count-p1[X3650H] = (null) $ $ crm_mon -rf1 Last updated: Tue Mar 18 13:54:51 2014 Last change: Tue Mar 18 13:53:36 2014 by hacluster via crmd on x3650h Stack: corosync Current DC: x3650h (3232261384) - partition with quorum Version: 1.1.10-83553fa 2 Nodes configured 1 Resources configured Online: [ x3650g x3650h ] Full list of resources: p1 (ocf::pacemaker:Dummy): Stopped Migration summary: * Node x3650h: p1: migration-threshold=1 fail-count=1 last-failure='Tue Mar 18 13:53:19 2014' * Node x3650g: $ So this change also seems to be necessary. $ git diff --patch-with-stat attrd/commands.c attrd/commands.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/attrd/commands.c b/attrd/commands.c index 985f90c..9f46d92 100644 --- a/attrd/commands.c +++ b/attrd/commands.c @@ -145,7 +145,7 @@ create_attribute(xmlNode *xml) a->id = crm_element_value_copy(xml, F_ATTRD_ATTRIBUTE); a->set = crm_element_value_copy(xml, F_ATTRD_SET); a->uuid = crm_element_value_copy(xml, F_ATTRD_KEY); - a->values = g_hash_table_new_full(crm_str_hash, g_str_equal, NULL, free_attribute_value); + a->values = g_hash_table_new_full(crm_strcase_hash, crm_strcase_equal, NULL, free_attribute_value); #if ENABLE_ACL crm_trace("Performing all %s operations as user '%s'", a->id, a->user); $ The result is as follows. $ grep fail-count /var/log/ha-log Mar 18 13:57:31 x3650g attrd[6688]: debug: attrd_client_message: Broadcasting fail-count-p1[X3650H] = (null) (writer) Mar 18 13:57:31 x3650g attrd[6688]: info: attrd_peer_update: Setting fail-count-p1[X3650H]: 1 -> (null) from x3650g Mar 18 13:57:31 x3650g attrd[6688]: debug: write_attribute: Update: x3650h[fail-count-p1]=(null) (3232261384 3232261384 3232261384 x3650h) Mar 18 13:57:31 x3650g attrd[6688]: notice: write_attribute: Sent update 14 with 1 changes for fail-count-p1, id=<n/a>, set=(null) Mar 18 13:57:31 x3650h attrd[20902]: info: attrd_peer_update: Setting fail-count-p1[X3650H]: 1 -> (null) from x3650g Mar 18 13:57:31 x3650g cib[6685]: info: cib_perform_op: -- /cib/status/node_state[@id='3232261384']/transient_attributes[@id='3232261384']/instance_attributes[@id='status-3232261384']/nvpair[@id='status-3232261384-fail-count-p1'] Mar 18 13:57:31 x3650g attrd[6688]: info: attrd_cib_callback: Update 14 for fail-count-p1: OK (0) Mar 18 13:57:31 x3650g attrd[6688]: notice: attrd_cib_callback: Update 14 for fail-count-p1[x3650h]=(null): OK (0) Mar 18 13:57:31 x3650h cib[20899]: info: cib_perform_op: -- /cib/status/node_state[@id='3232261384']/transient_attributes[@id='3232261384']/instance_attributes[@id='status-3232261384']/nvpair[@id='status-3232261384-fail-count-p1'] Mar 18 13:57:31 x3650h crmd[20904]: info: abort_transition_graph: Transition aborted by deletion of nvpair[@id='status-3232261384-fail-count-p1']: Transient attribute change (cib=0.3.18, source=te_update_diff:388, path=/cib/status/node_state[@id='3232261384']/transient_attributes[@id='3232261384']/instance_attributes[@id='status-3232261384']/nvpair[@id='status-3232261384-fail-count-p1'], 1) > >> >> *** This is quick-fix solution. *** >> >> crmd/lrm_state.c | 4 ++-- >> include/crm/crm.h | 2 ++ >> lib/common/utils.c | 11 +++++++++++ >> 3 files changed, 15 insertions(+), 2 deletions(-) >> >> diff --git a/crmd/lrm_state.c b/crmd/lrm_state.c >> index d20d74a..ae036fd 100644 >> --- a/crmd/lrm_state.c >> +++ b/crmd/lrm_state.c >> @@ -234,13 +234,13 @@ lrm_state_init_local(void) >> } >> >> lrm_state_table = >> - g_hash_table_new_full(crm_str_hash, strcase_equal, NULL, >> internal_lrm_state_destroy); >> + g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL, >> internal_lrm_state_destroy); >> if (!lrm_state_table) { >> return FALSE; >> } >> >> proxy_table = >> - g_hash_table_new_full(crm_str_hash, strcase_equal, NULL, >> remote_proxy_free); >> + g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL, >> remote_proxy_free); >> if (!proxy_table) { >> g_hash_table_destroy(lrm_state_table); >> return FALSE; >> diff --git a/include/crm/crm.h b/include/crm/crm.h >> index b763cc0..46fe5df 100644 >> --- a/include/crm/crm.h >> +++ b/include/crm/crm.h >> @@ -195,7 +195,9 @@ typedef GList *GListPtr; >> # include <crm/error.h> >> >> # define crm_str_hash g_str_hash_traditional >> +# define crm_str_hash2 g_str_hash_traditional2 >> >> guint g_str_hash_traditional(gconstpointer v); >> +guint g_str_hash_traditional2(gconstpointer v); >> >> #endif >> diff --git a/lib/common/utils.c b/lib/common/utils.c >> index 29d7965..50fa6c0 100644 >> --- a/lib/common/utils.c >> +++ b/lib/common/utils.c >> @@ -2368,6 +2368,17 @@ g_str_hash_traditional(gconstpointer v) >> >> return h; >> } >> +guint >> +g_str_hash_traditional2(gconstpointer v) >> +{ >> + const signed char *p; >> + guint32 h = 0; >> + >> + for (p = v; *p != '\0'; p++) >> + h = (h << 5) - h + g_ascii_tolower(*p); >> + >> + return h; >> +} >> >> void * >> find_library_function(void **handle, const char *lib, const char *fn, >> gboolean fatal) >> >> >> >>> # crm_resource -C -r p1 -N X3650H >> >>> Cleaning up p1 on X3650H >> >>> Waiting for 1 replies from the CRMdNo messages received in 60 seconds.. >> >>> aborting >> >>> >> >>> Mar 14 18:33:10 x3650h crmd[10718]: error: crm_abort: >> >>> do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state != >> >>> NULL >> >>> ...snip... >> >>> Mar 14 18:33:10 x3650h pacemakerd[10708]: error: child_waitpid: >> >>> Managed process 10718 (crmd) dumped core >> >>> >> >>> >> >>> * The state before performing crm_resource. >> >>> ---- >> >>> Stack: corosync >> >>> Current DC: x3650g (3232261383) - partition with quorum >> >>> Version: 1.1.10-38c5972 >> >>> 2 Nodes configured >> >>> 3 Resources configured >> >>> >> >>> >> >>> Online: [ x3650g x3650h ] >> >>> >> >>> Full list of resources: >> >>> >> >>> f-g (stonith:external/ibmrsa-telnet): Started x3650h >> >>> f-h (stonith:external/ibmrsa-telnet): Started x3650g >> >>> p1 (ocf::pacemaker:Dummy): Stopped >> >>> >> >>> Migration summary: >> >>> * Node x3650g: >> >>> * Node x3650h: >> >>> p1: migration-threshold=1 fail-count=1 last-failure='Fri Mar 14 >> >>> 18:32:48 2014' >> >>> >> >>> Failed actions: >> >>> p1_monitor_10000 on x3650h 'not running' (7): call=16, >> >>> status=complete, last-rc-change='Fri Mar 14 18:32:48 2014', >> >>> queued=0ms, exec=0ms >> >>> ---- >> >>> >> >>> Just for reference, similar phenomenon did not occur by crm_standby. >> >>> $ crm_standby -U X3650H -v on >> >>> >> >>> >> >>> Best Regards, >> >>> Kazunori INOUE >> >>> >> >>> _______________________________________________ >> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >>> >> >>> Project Home: http://www.clusterlabs.org >> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >>> Bugs: http://bugs.clusterlabs.org >> >>> >> >> >> >> _______________________________________________ >> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> >> >> Project Home: http://www.clusterlabs.org >> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org