Re: [Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.
On Fri, Aug 6, 2010 at 4:36 AM, nozawat noza...@gmail.com wrote: Hi Andrew, Is this data all from the CIB process? What was the load from the other processes like? (Ie. using top) The cib process was approximately 100% CPU rates of use, but most of the other processes did not use a CPU when they confirmed it in top. The number of the cores of the CPU assigns 2. Is there the method to make these processing early? You mean faster? Uh...For example, I thought in a parameter like batch-limit.. Because I feel that the handling of XML is not so early... My brain might be a little bit fuzzy from the vacation, but I'm still not 100% sure what you mean here... Could you attach the CIB your cluster was using so I can have a go? I'm sorry.cib.xml was not able to send a e-mail by the limit of the attached file. I compress a file and put it in Bugzilla. - cib.xml :http://developerbugs.linux-foundation.org/attachment.cgi?id=1684 - systemtap:http://developerbugs.linux-foundation.org/attachment.cgi?id=1685 Great. I'll follow up there once I finish making my way through my email backlog. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.
2010/8/5 nozawat noza...@gmail.com: Hi Andrew, I took oprofile of 120 seconds before Time-out occurs. However, sorted_xml function seems not to be called it so much as far as I watch log of Oprofile. I suspect you're misinterpreting the results. The number of calls isn't nearly as important as how much time is spent in that function. Could you attach the CIB your cluster was using so I can have a go? I took Oprofile by a sample program. But, the contents of Oprofile did not agree with the contents of the sample program. Therefore I used systemtap because I was not able to do internal bottleneck analysis from information of this Oprofile. Ah, I wondered where you go the data from. Could you share how you used systemtap for this? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.
Hi Andrew, Is this data all from the CIB process? What was the load from the other processes like? (Ie. using top) The cib process was approximately 100% CPU rates of use, but most of the other processes did not use a CPU when they confirmed it in top. The number of the cores of the CPU assigns 2. Is there the method to make these processing early? You mean faster? Uh...For example, I thought in a parameter like batch-limit.. Because I feel that the handling of XML is not so early... Could you attach the CIB your cluster was using so I can have a go? I'm sorry.cib.xml was not able to send a e-mail by the limit of the attached file. I compress a file and put it in Bugzilla. - cib.xml :http://developerbugs.linux-foundation.org/attachment.cgi?id=1684 - systemtap:http://developerbugs.linux-foundation.org/attachment.cgi?id=1685 Ah, I wondered where you go the data from. Could you share how you used systemtap for this? step 1) I divide a crm file at first to build a cluster. - The attached file is cib_14+2_functest01.crm and cib_14+2_functest02.crm. - Because I become the error by construction with one file. step 2) After construction completion, I let a DC node break down. step 3) I started systemtap from the 1,266th line of the log file in 120 seconds of the 1,267th line and acquired information. - The attached file is ha.log.This file is ha.log of the new DC node. note ) A systemtap use procedure 1)stap -x PID functime.stp 2)stop is Ctl+C -- Regards, Tomo 2010/8/5 Andrew Beekhof and...@beekhof.net 2010/8/5 nozawat noza...@gmail.com: Hi Andrew, I took oprofile of 120 seconds before Time-out occurs. However, sorted_xml function seems not to be called it so much as far as I watch log of Oprofile. I suspect you're misinterpreting the results. The number of calls isn't nearly as important as how much time is spent in that function. Could you attach the CIB your cluster was using so I can have a go? I took Oprofile by a sample program. But, the contents of Oprofile did not agree with the contents of the sample program. Therefore I used systemtap because I was not able to do internal bottleneck analysis from information of this Oprofile. Ah, I wondered where you go the data from. Could you share how you used systemtap for this? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Regards, Tomo ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.
2010/8/3 nozawat noza...@gmail.com: Hi Andrew, I changed cluster option to batch-limit=3,I re-tried it. However, similar time-out occurs. I measured processing just before the time-out(120s) in systemtap. The following only the function long time. - probe start! - cib_process_request [call-count:179][117,540,173,155 nsec] cib_process_command [call:179] [116,471,047,275 nsec] cib_process_command call function --- cib_config_changed [call:179] [101,169,909,572 nsec] cib_config_changed call function --- calculate_xml_digest [call:179] [ 68,820,560,745 nsec] create_xml_node [call:3012263] [ 19,855,469,976 nsec]※ But, but, create_xml_node() barely does _anything_. It's hard to imagine how this could take 19s: xmlNode* create_xml_node(xmlNode *parent, const char *name) { xmlDoc *doc = NULL; xmlNode *node = NULL; if (name == NULL || name[0] == 0) { return NULL; } if(parent == NULL) { doc = xmlNewDoc((const xmlChar*)1.0); node = xmlNewDocRawNode(doc, NULL, (const xmlChar*)name, NULL); xmlDocSetRootElement(doc, node); } else { doc = getDocPtr(parent); node = xmlNewDocRawNode(doc, NULL, (const xmlChar*)name, NULL); xmlAddChild(parent, node); } return node; } Is this data all from the CIB process? What was the load from the other processes like? (Ie. using top) xpath_search [call:179] [ 145,030,232 nsec] diff_xml_object [call:179] [ 32,677,359,476 nsec]※ calculate_xml_digest call function --- sorted_xml [call:1505799] [ 52,512,465,838 nsec]※ Yikes, these are also really high. Could you send me the CIB so that we can profile these operations on it? copy_xml [call:179] [ 3,692,232,073 nsec] dump_xml [call:536] [ 6,177,606,232 nsec] - Is there the method to make these processing early? You mean faster? Maybe oprofile can suggest something. 2010/6/14 renayama19661...@ybb.ne.jp Hi Andrew, Thank you for comment. More likely of the underlying messaging infrastructure, but I'll take a look. Perhaps the default cib operation timeouts are too low for larger clusters. The log attached it to next Bugzilla. #65533;* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443 Ok, I'll follow up there. If it is necessary for us to work for the solution of the problem, please order it. Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Mon, Jun 14, 2010 at 4:46 AM, renayama19661...@ybb.ne.jp wrote: We tested 16 node constitution (15+1). We carried out the next procedure. Step1) Start 16 nodes. Step2) Send cib after a DC node was decided. An error occurs by the update of the attribute of pingd after Probe processing was over. Jun 14 10:58:03 hb0102 pingd: [2465]: info: ping_read: Retrying... Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 337 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 340 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 343 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 346 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 349 for default_ping_set=1600 failed: Remote node did not respond In the middle of this error, I carried out a cibadmin(-Q optin) command, but time-out occurred. In addition, cib of the DC node seemed to move by the top command very busily. In addition, a communication error with cib occurs in the DC node, and crmd reboots. Jun 14 10:58:09 hb0101 attrd: [2278]: WARN: xmlfromIPC: No message received in the required interval (120s) Jun 14 10:58:09 hb0101 attrd: [2278]: info: attrd_perform_update: Sent update -41: default_ping_set=1600 (snip) Jun 14 10:59:07 hb0101 crmd: [2280]: info: do_exit: [crmd] stopped (2) Jun 14 10:59:07 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:858 info: pcmk_ipc_exit: Client crmd (conn=0x106a2bf0, async-conn=0x106a2bf0) left Jun 14 10:59:08 hb0101
Re: [Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.
Hi Andrew, I changed cluster option to batch-limit=3,I re-tried it. However, similar time-out occurs. I measured processing just before the time-out(120s) in systemtap. The following only the function long time. - probe start! - cib_process_request [call-count:179][117,540,173,155 nsec] cib_process_command [call:179] [116,471,047,275 nsec] cib_process_command call function --- cib_config_changed [call:179] [101,169,909,572 nsec] cib_config_changed call function --- calculate_xml_digest [call:179] [ 68,820,560,745 nsec] create_xml_node [call:3012263] [ 19,855,469,976 nsec]※ xpath_search [call:179] [145,030,232 nsec] diff_xml_object [call:179] [ 32,677,359,476 nsec]※ calculate_xml_digest call function --- sorted_xml [call:1505799] [ 52,512,465,838 nsec]※ copy_xml [call:179] [ 3,692,232,073 nsec] dump_xml [call:536] [ 6,177,606,232 nsec] - Is there the method to make these processing early? 2010/6/14 renayama19661...@ybb.ne.jp Hi Andrew, Thank you for comment. More likely of the underlying messaging infrastructure, but I'll take a look. Perhaps the default cib operation timeouts are too low for larger clusters. The log attached it to next Bugzilla. #65533;* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443 Ok, I'll follow up there. If it is necessary for us to work for the solution of the problem, please order it. Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Mon, Jun 14, 2010 at 4:46 AM, renayama19661...@ybb.ne.jp wrote: We tested 16 node constitution (15+1). We carried out the next procedure. Step1) Start 16 nodes. Step2) Send cib after a DC node was decided. An error occurs by the update of the attribute of pingd after Probe processing was over. Jun 14 10:58:03 hb0102 pingd: [2465]: info: ping_read: Retrying... Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 337 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 340 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 343 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 346 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 349 for default_ping_set=1600 failed: Remote node did not respond In the middle of this error, I carried out a cibadmin(-Q optin) command, but time-out occurred. In addition, cib of the DC node seemed to move by the top command very busily. In addition, a communication error with cib occurs in the DC node, and crmd reboots. Jun 14 10:58:09 hb0101 attrd: [2278]: WARN: xmlfromIPC: No message received in the required interval (120s) Jun 14 10:58:09 hb0101 attrd: [2278]: info: attrd_perform_update: Sent update -41: default_ping_set=1600 (snip) Jun 14 10:59:07 hb0101 crmd: [2280]: info: do_exit: [crmd] stopped (2) Jun 14 10:59:07 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:858 info: pcmk_ipc_exit: Client crmd (conn=0x106a2bf0, async-conn=0x106a2bf0) left Jun 14 10:59:08 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:481 ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=2280, rc=2) Jun 14 10:59:08 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:498 notice: pcmk_wait_dispatch: Respawning failed child process: crmd Jun 14 10:59:08 hb0101 corosync[2269]: #65533; [pcmk #65533;] utils.c:131 info: spawn_child: Forked child 2680 for process crmd Jun 14 10:59:08 hb0101 crmd: [2680]: info: Invoked: /usr/lib64/heartbeat/crmd Jun 14 10:59:08 hb0101 crmd: [2680]: info: main: CRM Hg Version: 9f04fa88cfd3da553e977cc79983d1c494c8b502 Jun 14 10:59:08 hb0101 crmd: [2680]: info: crmd_init: Starting crmd Jun 14 10:59:08 hb0101 crmd: [2680]: info: G_main_add_SignalHandler: Added signal handler for signal 17 There seems to be a problem in cib of the DC node somehow or other. We hope
Re: [Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.
Hi Andrew, I made the patch that extended an internal time-out as one of the optional clusters. There is a possibility that the necessity for extending this argument value comes out when there are a lot of numbers of nodes. Default is 120s according to the current. Regards, Tomo 2010年8月3日12:12 nozawat noza...@gmail.com: Hi Andrew, I changed cluster option to batch-limit=3,I re-tried it. However, similar time-out occurs. I measured processing just before the time-out(120s) in systemtap. The following only the function long time. - probe start! - cib_process_request [call-count:179][117,540,173,155 nsec] cib_process_command [call:179] [116,471,047,275 nsec] cib_process_command call function --- cib_config_changed [call:179] [101,169,909,572 nsec] cib_config_changed call function --- calculate_xml_digest [call:179] [ 68,820,560,745 nsec] create_xml_node [call:3012263] [ 19,855,469,976 nsec]※ xpath_search [call:179] [145,030,232 nsec] diff_xml_object [call:179] [ 32,677,359,476 nsec]※ calculate_xml_digest call function --- sorted_xml [call:1505799] [ 52,512,465,838 nsec]※ copy_xml [call:179] [ 3,692,232,073 nsec] dump_xml [call:536] [ 6,177,606,232 nsec] - Is there the method to make these processing early? 2010/6/14 renayama19661...@ybb.ne.jp Hi Andrew, Thank you for comment. More likely of the underlying messaging infrastructure, but I'll take a look. Perhaps the default cib operation timeouts are too low for larger clusters. The log attached it to next Bugzilla. #65533;* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443 Ok, I'll follow up there. If it is necessary for us to work for the solution of the problem, please order it. Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Mon, Jun 14, 2010 at 4:46 AM, renayama19661...@ybb.ne.jp wrote: We tested 16 node constitution (15+1). We carried out the next procedure. Step1) Start 16 nodes. Step2) Send cib after a DC node was decided. An error occurs by the update of the attribute of pingd after Probe processing was over. Jun 14 10:58:03 hb0102 pingd: [2465]: info: ping_read: Retrying... Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 337 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 340 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 343 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 346 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 349 for default_ping_set=1600 failed: Remote node did not respond In the middle of this error, I carried out a cibadmin(-Q optin) command, but time-out occurred. In addition, cib of the DC node seemed to move by the top command very busily. In addition, a communication error with cib occurs in the DC node, and crmd reboots. Jun 14 10:58:09 hb0101 attrd: [2278]: WARN: xmlfromIPC: No message received in the required interval (120s) Jun 14 10:58:09 hb0101 attrd: [2278]: info: attrd_perform_update: Sent update -41: default_ping_set=1600 (snip) Jun 14 10:59:07 hb0101 crmd: [2280]: info: do_exit: [crmd] stopped (2) Jun 14 10:59:07 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:858 info: pcmk_ipc_exit: Client crmd (conn=0x106a2bf0, async-conn=0x106a2bf0) left Jun 14 10:59:08 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:481 ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=2280, rc=2) Jun 14 10:59:08 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:498 notice: pcmk_wait_dispatch: Respawning failed child process: crmd Jun 14 10:59:08 hb0101 corosync[2269]: #65533; [pcmk #65533;] utils.c:131 info: spawn_child: Forked child 2680 for process crmd Jun 14 10:59:08 hb0101 crmd: [2680]: info: Invoked: /usr/lib64/heartbeat/crmd Jun 14 10:59:08 hb0101 crmd: [2680]: info: main: CRM Hg Version: 9f04fa88cfd3da553e977cc79983d1c494c8b502 Jun 14 10:59:08 hb0101 crmd: [2680]: info: crmd_init:
Re: [Pacemaker] [Problem]Cib cannot update an attribute by 16 node constitution.
Hi Andrew, Thank you for comment. More likely of the underlying messaging infrastructure, but I'll take a look. Perhaps the default cib operation timeouts are too low for larger clusters. The log attached it to next Bugzilla. #65533;* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443 Ok, I'll follow up there. If it is necessary for us to work for the solution of the problem, please order it. Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Mon, Jun 14, 2010 at 4:46 AM, renayama19661...@ybb.ne.jp wrote: We tested 16 node constitution (15+1). We carried out the next procedure. Step1) Start 16 nodes. Step2) Send cib after a DC node was decided. An error occurs by the update of the attribute of pingd after Probe processing was over. Jun 14 10:58:03 hb0102 pingd: [2465]: info: ping_read: Retrying... Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 337 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 340 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 343 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 346 for default_ping_set=1600 failed: Remote node did not respond Jun 14 10:58:13 hb0102 attrd: [2155]: WARN: attrd_cib_callback: Update 349 for default_ping_set=1600 failed: Remote node did not respond In the middle of this error, I carried out a cibadmin(-Q optin) command, but time-out occurred. In addition, cib of the DC node seemed to move by the top command very busily. In addition, a communication error with cib occurs in the DC node, and crmd reboots. Jun 14 10:58:09 hb0101 attrd: [2278]: WARN: xmlfromIPC: No message received in the required interval (120s) Jun 14 10:58:09 hb0101 attrd: [2278]: info: attrd_perform_update: Sent update -41: default_ping_set=1600 (snip) Jun 14 10:59:07 hb0101 crmd: [2280]: info: do_exit: [crmd] stopped (2) Jun 14 10:59:07 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:858 info: pcmk_ipc_exit: Client crmd (conn=0x106a2bf0, async-conn=0x106a2bf0) left Jun 14 10:59:08 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:481 ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=2280, rc=2) Jun 14 10:59:08 hb0101 corosync[2269]: #65533; [pcmk #65533;] plugin.c:498 notice: pcmk_wait_dispatch: Respawning failed child process: crmd Jun 14 10:59:08 hb0101 corosync[2269]: #65533; [pcmk #65533;] utils.c:131 info: spawn_child: Forked child 2680 for process crmd Jun 14 10:59:08 hb0101 crmd: [2680]: info: Invoked: /usr/lib64/heartbeat/crmd Jun 14 10:59:08 hb0101 crmd: [2680]: info: main: CRM Hg Version: 9f04fa88cfd3da553e977cc79983d1c494c8b502 Jun 14 10:59:08 hb0101 crmd: [2680]: info: crmd_init: Starting crmd Jun 14 10:59:08 hb0101 crmd: [2680]: info: G_main_add_SignalHandler: Added signal handler for signal 17 There seems to be a problem in cib of the DC node somehow or other. We hope that an attribute change is completed in 16 nodes definitely. #65533;* Is this phenomenon a limit of the current cib process? More likely of the underlying messaging infrastructure, but I'll take a look. Perhaps the default cib operation timeouts are too low for larger clusters. The log attached it to next Bugzilla. #65533;* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2443 Ok, I'll follow up there. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: