[ClusterLabs] MySQL resource causes error 0_monitor_20000.
Hi all, I made master-master replication on Pacemaker. But it causes error 0_monitor_2. If one of them boots Heartbeat and another doesn't, the error doesn't occur. What should I check? Thank's, Host: centillion.db01 and centillion.db02 OS: CentOS 6.3 Heartbeat: 3.0.5 Pacemaker: 1.0.13 MySQL: 5.6.16 Error messages: ``` # crm_mon Last updated: Fri Aug 14 17:28:58 2015 Stack: Heartbeat Current DC: centillion.db02 (0302e3d0-df06-4847-b0f9-9ebddfb6aec7) - partition with quorum Version: 1.0.13-a83fae5 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ centillion.db01 centillion.db02 ] vip_192.168.10.200 (ocf::heartbeat:IPaddr2): Started centillion.db02 Master/Slave Set: mysql-clone mysql:0(ocf::heartbeat:mysql): Master centillion.db01 FAILED Masters: [ centillion.db02 ] Failed actions: mysql:0_monitor_2 (node=centillion.db01, call=166, rc=8, status=complete): master mysql:0_monitor_3 (node=centillion.db01, call=167, rc=8, status=complete): master ``` -- Kiwamu Okabe at METASEPI DESIGN ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Single quotes in values for 'crm resource rsc param set'
Hi! Somewhat stupid question: Why don't you put monsters like subagent=/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm' ins a shell command file and execute that? Regards, Ulrich Vladislav Bogdanov bub...@hoster-ok.com schrieb am 17.08.2015 um 11:22 in Nachricht 55d1a7d9.20...@hoster-ok.com: 17.08.2015 10:39, Kristoffer Grönlund wrote: Vladislav Bogdanov bub...@hoster-ok.com writes: Hi Kristoffer, all. Could you please look why I get error when trying to update valid resource value (which already has single quotes inside) with the slightly different one by running the command in the subject? It looks like is_value_sane() doesn't accept single quotes just because crmsh quotes all arguments to crm_resource with them. I need to pass a command-line with semicolons in one of parameters which is run with eval in the resource agent. Backslashed double-quoting does not work in this case, but single-quotes work fine. Could that be some-how fixed? Well, first of all passing the command line through bash complicates things, so if that's what is causing you trouble you could try writing your command line to a file and passing it to crmsh using crm -f file. Another option is using crm -f - and piping the command line into crmsh. Do you mean one with double-quotes? Otherwise is_value_sane() will fail anyways. Using ... \string;string\ notation in the file strips quotes from the actual command run. Well, may be function I use is not smart enough, but that works with single-qouted value. What I think could be done for single-quotes support is to assume that value which contains them was actually passed in the double-quotes, so double-quotes should be used when running crm_resource. We may also have in mind that CIB uses double-quotes for values internally. If that doesn't help, it would help /me/ in figuring out just what the problem is if you could give me an example of what the current value is and what it is you are trying to set it to. :) Well, this is the (obfuscated a bit due to customer's policies) working resource definition (word wrap off): primitive staging-0-fs ocf:vendor:Filesystem \ params device=/dev/vg_staging_shared/staging_0 directory=/cluster/storage/staging-0 fstype=gfs2 options= manage_directory=true subagent=/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm' subagent_timeout=10 \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor interval=10 timeout=45 depth=0 \ op monitor interval=240 timeout=240 depth=10 \ op monitor interval=360 timeout=240 depth=20 Here is the command which fails: # crm resource param staging-0-fs set subagent /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm' DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.12 (1b9beb7)] DEBUG: found pacemaker version: 1.1.12 ERROR: /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm': bad name ERROR: Bad usage: Expected valid name, got '/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'', command: 'param staging-0-fs set subagent /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'' Replacing single-quotes with back-slashed double ones (\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\) makes that string unquoted in the CIB, so semicolons are recognized as command separators by the shell run from the RA. Using double-escaping (\\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\\) when passing value in the double quotes breaks the shell which runs command. Using single quotes with one or two back-slashes before double-quote inside for a value produces unparseable CIB with dqout; in it. Here is the function which runs that subagent command (I believe it should support several semicolon-separated commands as well, but did not test that yet): run_subagent() { local subagent_timeout=$1 local subagent_command=$2 local WRAPPER subagent_command=${subagent_command//%a/${__OCF_ACTION}} subagent_command=${subagent_command//%r/${OCF_RESOURCE_INSTANCE%:*}} subagent_command=${subagent_command//%n/$( crm_node -n )} case ${subagent_timeout} in 0||*[!0-9]*) WRAPPER=bash -c \${subagent_command}\ ;; *) WRAPPER=timeout -s KILL ${subagent_timeout} bash -c \${subagent_command}\ ;; esac ocf_run eval ${WRAPPER} } It is called with: run_subagent ${OCF_RESKEY_subagent_timeout}
Re: [ClusterLabs] Single quotes in values for 'crm resource rsc param set'
17.08.2015 10:39, Kristoffer Grönlund wrote: Vladislav Bogdanov bub...@hoster-ok.com writes: Hi Kristoffer, all. Could you please look why I get error when trying to update valid resource value (which already has single quotes inside) with the slightly different one by running the command in the subject? It looks like is_value_sane() doesn't accept single quotes just because crmsh quotes all arguments to crm_resource with them. I need to pass a command-line with semicolons in one of parameters which is run with eval in the resource agent. Backslashed double-quoting does not work in this case, but single-quotes work fine. Could that be some-how fixed? Well, first of all passing the command line through bash complicates things, so if that's what is causing you trouble you could try writing your command line to a file and passing it to crmsh using crm -f file. Another option is using crm -f - and piping the command line into crmsh. Do you mean one with double-quotes? Otherwise is_value_sane() will fail anyways. Using ... \string;string\ notation in the file strips quotes from the actual command run. Well, may be function I use is not smart enough, but that works with single-qouted value. What I think could be done for single-quotes support is to assume that value which contains them was actually passed in the double-quotes, so double-quotes should be used when running crm_resource. We may also have in mind that CIB uses double-quotes for values internally. If that doesn't help, it would help /me/ in figuring out just what the problem is if you could give me an example of what the current value is and what it is you are trying to set it to. :) Well, this is the (obfuscated a bit due to customer's policies) working resource definition (word wrap off): primitive staging-0-fs ocf:vendor:Filesystem \ params device=/dev/vg_staging_shared/staging_0 directory=/cluster/storage/staging-0 fstype=gfs2 options= manage_directory=true subagent=/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm' subagent_timeout=10 \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor interval=10 timeout=45 depth=0 \ op monitor interval=240 timeout=240 depth=10 \ op monitor interval=360 timeout=240 depth=20 Here is the command which fails: # crm resource param staging-0-fs set subagent /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm' DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.12 (1b9beb7)] DEBUG: found pacemaker version: 1.1.12 ERROR: /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm': bad name ERROR: Bad usage: Expected valid name, got '/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'', command: 'param staging-0-fs set subagent /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'' Replacing single-quotes with back-slashed double ones (\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\) makes that string unquoted in the CIB, so semicolons are recognized as command separators by the shell run from the RA. Using double-escaping (\\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\\) when passing value in the double quotes breaks the shell which runs command. Using single quotes with one or two back-slashes before double-quote inside for a value produces unparseable CIB with dqout; in it. Here is the function which runs that subagent command (I believe it should support several semicolon-separated commands as well, but did not test that yet): run_subagent() { local subagent_timeout=$1 local subagent_command=$2 local WRAPPER subagent_command=${subagent_command//%a/${__OCF_ACTION}} subagent_command=${subagent_command//%r/${OCF_RESOURCE_INSTANCE%:*}} subagent_command=${subagent_command//%n/$( crm_node -n )} case ${subagent_timeout} in 0||*[!0-9]*) WRAPPER=bash -c \${subagent_command}\ ;; *) WRAPPER=timeout -s KILL ${subagent_timeout} bash -c \${subagent_command}\ ;; esac ocf_run eval ${WRAPPER} } It is called with: run_subagent ${OCF_RESKEY_subagent_timeout} ${OCF_RESKEY_subagent} Best regards, Vladislav ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Single quotes in values for 'crm resource rsc param set'
14.08.2015 19:51, Jan Pokorný wrote: On 14/08/15 18:22 +0300, Vladislav Bogdanov wrote: I need to pass a command-line with semicolons in one of parameters which is run with eval in the resource agent. Backslashed double-quoting does not work in this case, but single-quotes work fine. Hmm, another data point to the recent shell can be troublesome: http://clusterlabs.org/pipermail/users/2015-August/000996.html Yes, see my last message for more shell madness ;) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Single quotes in values for 'crm resource rsc param set'
17.08.2015 12:44, Ulrich Windl wrote: Hi! Somewhat stupid question: Why don't you put monsters like subagent=/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm' ins a shell command file and execute that? Hmm, probably a good point. I will think about it, thanks. Regards, Ulrich Vladislav Bogdanov bub...@hoster-ok.com schrieb am 17.08.2015 um 11:22 in Nachricht 55d1a7d9.20...@hoster-ok.com: 17.08.2015 10:39, Kristoffer Grönlund wrote: Vladislav Bogdanov bub...@hoster-ok.com writes: Hi Kristoffer, all. Could you please look why I get error when trying to update valid resource value (which already has single quotes inside) with the slightly different one by running the command in the subject? It looks like is_value_sane() doesn't accept single quotes just because crmsh quotes all arguments to crm_resource with them. I need to pass a command-line with semicolons in one of parameters which is run with eval in the resource agent. Backslashed double-quoting does not work in this case, but single-quotes work fine. Could that be some-how fixed? Well, first of all passing the command line through bash complicates things, so if that's what is causing you trouble you could try writing your command line to a file and passing it to crmsh using crm -f file. Another option is using crm -f - and piping the command line into crmsh. Do you mean one with double-quotes? Otherwise is_value_sane() will fail anyways. Using ... \string;string\ notation in the file strips quotes from the actual command run. Well, may be function I use is not smart enough, but that works with single-qouted value. What I think could be done for single-quotes support is to assume that value which contains them was actually passed in the double-quotes, so double-quotes should be used when running crm_resource. We may also have in mind that CIB uses double-quotes for values internally. If that doesn't help, it would help /me/ in figuring out just what the problem is if you could give me an example of what the current value is and what it is you are trying to set it to. :) Well, this is the (obfuscated a bit due to customer's policies) working resource definition (word wrap off): primitive staging-0-fs ocf:vendor:Filesystem \ params device=/dev/vg_staging_shared/staging_0 directory=/cluster/storage/staging-0 fstype=gfs2 options= manage_directory=true subagent=/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm' subagent_timeout=10 \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor interval=10 timeout=45 depth=0 \ op monitor interval=240 timeout=240 depth=10 \ op monitor interval=360 timeout=240 depth=20 Here is the command which fails: # crm resource param staging-0-fs set subagent /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm' DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.12 (1b9beb7)] DEBUG: found pacemaker version: 1.1.12 ERROR: /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm': bad name ERROR: Bad usage: Expected valid name, got '/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'', command: 'param staging-0-fs set subagent /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'' Replacing single-quotes with back-slashed double ones (\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\) makes that string unquoted in the CIB, so semicolons are recognized as command separators by the shell run from the RA. Using double-escaping (\\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\\) when passing value in the double quotes breaks the shell which runs command. Using single quotes with one or two back-slashes before double-quote inside for a value produces unparseable CIB with dqout; in it. Here is the function which runs that subagent command (I believe it should support several semicolon-separated commands as well, but did not test that yet): run_subagent() { local subagent_timeout=$1 local subagent_command=$2 local WRAPPER subagent_command=${subagent_command//%a/${__OCF_ACTION}} subagent_command=${subagent_command//%r/${OCF_RESOURCE_INSTANCE%:*}} subagent_command=${subagent_command//%n/$( crm_node -n )} case ${subagent_timeout} in 0||*[!0-9]*) WRAPPER=bash -c \${subagent_command}\ ;; *) WRAPPER=timeout -s KILL ${subagent_timeout} bash -c \${subagent_command}\ ;; esac ocf_run eval ${WRAPPER} } It is called with: run_subagent ${OCF_RESKEY_subagent_timeout}
Re: [ClusterLabs] [Problem] The SNMP trap which has been already started is transmitted.
Hi Andrew, I used the built-in SNMP. I started as a daemon with -d option. Is it running on both nodes or just snmp1? On both nodes. [root@snmp1 ~]# ps -ef |grep crm_mon root 4923 1 0 09:42 ? 00:00:00 crm_mon -d -S 192.168.40.2 -W -p /tmp/ClusterMon-upstart.pid [root@snmp2 ~]# ps -ef |grep crm_mon root 4860 1 0 09:42 ? 00:00:00 crm_mon -d -S 192.168.40.2 -W -p /tmp/ClusterMon-upstart.pid Because there is no logic in crm_mon that would have remapped the monitor to start, so my working theory is that its a duplicate of an old event. Can you tell which node the trap is being sent from? The trap is transmitted by snmp1 node. The trap is not sent from the snmp2 node that rebooted. Aug 18 09:44:37 SNMP-MANAGER snmptrapd[1334]: 2015-08-18 09:44:37 snmp1 [UDP: [192.168.40.100]:59668-[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1439858677) 166 days, 15:36:26.77#011SNMPv2-MIB::snmpTrapOID.0 = OID: PACEMAKER-MIB::pacemakerNotification#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: prmDummy#011PACEMAKER-MIB::pacemakerNotificationNode = STRING: snmp1#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: start#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: OK#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 0 Aug 18 09:44:37 SNMP-MANAGER snmptrapd[1334]: 2015-08-18 09:44:37 snmp1 [UDP: [192.168.40.100]:59668-[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1439858677) 166 days, 15:36:26.77#011SNMPv2-MIB::snmpTrapOID.0 = OID: PACEMAKER-MIB::pacemakerNotification#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: prmDummy#011PACEMAKER-MIB::pacemakerNotificationNode = STRING: snmp1#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: monitor#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: OK#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 0 Best Regards, Hideo Yamauchi. - Original Message - From: renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp To: Cluster Labs - All topics related to open-source clustering welcomed users@clusterlabs.org Cc: Date: 2015/8/17, Mon 10:05 Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already started is transmitted. Hi Andrew, Thank you for comments. I will confirm it tomorrow. I am a vacation today. Best Regards, Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed users@clusterlabs.org Cc: Date: 2015/8/17, Mon 09:30 Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already started is transmitted. On 4 Aug 2015, at 7:36 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comments. However, a trap of crm_mon is sent to an SNMP manager. Are you using the built-in SNMP logic or using -E to give crm_mon a script which is then producing the trap? (I’m trying to figure out who could be turning the monitor action into a start) I used the built-in SNMP. I started as a daemon with -d option. Is it running on both nodes or just snmp1? Because there is no logic in crm_mon that would have remapped the monitor to start, so my working theory is that its a duplicate of an old event. Can you tell which node the trap is being sent from? Best Regards, Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed users@clusterlabs.org Cc: Date: 2015/8/4, Tue 14:15 Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already started is transmitted. On 27 Jul 2015, at 4:18 pm, renayama19661...@ybb.ne.jp wrote: Hi All, The transmission of the SNMP trap of crm_mon seems to have a problem. I identified a problem on latest Pacemaker and Pacemaker1.1.13. Step 1) I constitute a cluster and send simple CLI file. [root@snmp1 ~]# crm_mon -1 Last updated: Mon Jul 27 14:40:37 2015 Last change: Mon Jul 27 14:40:29 2015 by root via cibadmin on snmp1 Stack: corosync Current DC: snmp1 (version 1.1.13-3d781d3) - partition with quorum 2 nodes and 1 resource configured Online: [ snmp1 snmp2 ] prmDummy (ocf::heartbeat:Dummy): Started snmp1 Step 2) I stop a node of the standby once. [root@snmp2 ~]# stop pacemaker pacemaker stop/waiting Step 3) I start a node of the standby again.
Re: [ClusterLabs] upgrade from 1.1.9 to 1.1.12 fails to start
On 18 Aug 2015, at 7:13 am, Streeter, Michelle N michelle.n.stree...@boeing.com wrote: I was recommended to upgrade from 1.1.9 to 1.1.12. I had to uninstall the 1.1.9 version to install the 1.1.12 version Did you upgrade anything else? cman? corosync? heartbeat? What distro? Logs? Stack trace? Where did the packages come from? I am not allowed to connect to a repo and so I have to download the rpms and install them individually. After I installed pacemaker-lib, cli, cluster-lib, and pacemaker itself, when I rebooted, the cluster failed to start When I tried to manually start it, I got Starting Pacemaker Cluster Manager/etc/init.d/pacemaker: line 94: 8219 Segmentation fault (core dumped) $prog /dev/null 21 I deleted the Cluster.conf file and the cib.xml and all the back up versions and tried again and got the same error. I googled this error and really got nothing. Any ideas? Not based on what you’ve told us. Michelle Streeter ASC2 MCS – SDE/ACL/SDL/EDL OKC Software Engineer The Boeing Company ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Ordering constraint restart second resource group
On 17 Aug 2015, at 1:30 pm, Andrei Borzenkov arvidj...@gmail.com wrote: 17.08.2015 02:26, Andrew Beekhof пишет: On 13 Aug 2015, at 7:33 pm, Andrei Borzenkov arvidj...@gmail.com wrote: On Thu, Aug 13, 2015 at 11:25 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: And what exactly is your problem? Real life example. Database resource depends on storage resource(s). There are multiple filesystems/volumes with database files. Database admin needs to increase available space. You add new storage, configure it in cluster ... pooh, your database is restarted. “configure it in cluster” hmmm if you’re expanding an existing mount point, then I’d expect you don’t need to update the cluster. if you’re creating a new mount point, wouldn’t you need to take the db down in order to point to the new location? No. Those database I worked with can use multiple storage locations at the same time and those storage locations can be added (and removed) online. Nice. In that case, you could try adding it as a resource but waiting until it is active before creating the ordering constraint. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Question:pacemaker_remote] By the operation that remote node cannot carry out a cluster, the resource does not move. (STONITH is not carried out, too)
Should be fixed now. Thanks for the report! On 12 Aug 2015, at 1:20 pm, renayama19661...@ybb.ne.jp wrote: Hi All, We confirmed movement of pacemaker_remote.(version:pacemaker-ad1f397a8228a63949f86c96597da5cecc3ed977) It is the following cluster constitution. * bl460g8n3(KVM host) * bl460g8n4(KVM host) * pgsr01(Guest on the bl460g8n3 host) * pgsr02(Guest on the bl460g8n4 host) Step 1) I compose a cluster of a simple resource. [root@bl460g8n3 ~]# crm_mon -1 -Af Last updated: Wed Aug 12 11:52:27 2015 Last change: Wed Aug 12 11:51:47 2015 by root via crm_resource on bl460g8n4 Stack: corosync Current DC: bl460g8n3 (version 1.1.13-ad1f397) - partition with quorum 4 nodes and 10 resources configured Online: [ bl460g8n3 bl460g8n4 ] GuestOnline: [ pgsr01@bl460g8n3 pgsr02@bl460g8n4 ] prmDB1 (ocf::heartbeat:VirtualDomain): Started bl460g8n3 prmDB2 (ocf::heartbeat:VirtualDomain): Started bl460g8n4 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ipmi):Started bl460g8n4 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ipmi):Started bl460g8n3 Resource Group: master-group vip-master (ocf::heartbeat:Dummy): Started pgsr02 vip-rep(ocf::heartbeat:Dummy): Started pgsr02 Master/Slave Set: msPostgresql [pgsql] Masters: [ pgsr02 ] Slaves: [ pgsr01 ] Node Attributes: * Node bl460g8n3: * Node bl460g8n4: * Node pgsr01@bl460g8n3: + master-pgsql : 5 * Node pgsr02@bl460g8n4: + master-pgsql : 10 Migration Summary: * Node bl460g8n4: * Node bl460g8n3: * Node pgsr02@bl460g8n4: * Node pgsr01@bl460g8n3: Step 2) I cause trouble of pacemaker_remote in pgsr02. [root@pgsr02 ~]# ps -ef |grep remote root 1171 1 0 11:52 ?00:00:00 /usr/sbin/pacemaker_remoted root 1428 1377 0 11:53 pts/000:00:00 grep --color=auto remote [root@pgsr02 ~]# kill -9 1171 Step 3) After trouble, the master-group resource does not start in pgsr01. [root@bl460g8n3 ~]# crm_mon -1 -Af Last updated: Wed Aug 12 11:54:04 2015 Last change: Wed Aug 12 11:51:47 2015 by root via crm_resource on bl460g8n4 Stack: corosync Current DC: bl460g8n3 (version 1.1.13-ad1f397) - partition with quorum 4 nodes and 10 resources configured Online: [ bl460g8n3 bl460g8n4 ] GuestOnline: [ pgsr01@bl460g8n3 ] prmDB1 (ocf::heartbeat:VirtualDomain): Started bl460g8n3 prmDB2 (ocf::heartbeat:VirtualDomain): FAILED bl460g8n4 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ipmi):Started bl460g8n4 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ipmi):Started bl460g8n3 Master/Slave Set: msPostgresql [pgsql] Masters: [ pgsr01 ] Node Attributes: * Node bl460g8n3: * Node bl460g8n4: * Node pgsr01@bl460g8n3: + master-pgsql : 10 Migration Summary: * Node bl460g8n4: pgsr02: migration-threshold=1 fail-count=1 last-failure='Wed Aug 12 11:53:39 2015' * Node bl460g8n3: * Node pgsr01@bl460g8n3: Failed Actions: * pgsr02_monitor_3 on bl460g8n4 'unknown error' (1): call=2, status=Error, exitreason='none', last-rc-change='Wed Aug 12 11:53:39 2015', queued=0ms, exec=0ms It seems to be caused by the fact that STONITH is not carried out somehow or other. The demote operation that a cluster cannot handle seems to obstruct start in pgsr01. -- Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: Graph 10 with 20 actions: batch-limit=20 jobs, network-delay=0ms Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action4]: Pending rsc op prmDB2_stop_0 on bl460g8n4 (priority: 0, waiting: 70) Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 36]: Completed pseudo op master-group_stop_0on N/A (priority: 0, waiting: none) Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 34]: Completed pseudo op master-group_start_0 on N/A (priority: 0, waiting: none) Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 82]: Completed rsc op pgsql_post_notify_demote_0on pgsr01 (priority: 100, waiting: none) Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 81]: Completed rsc op pgsql_pre_notify_demote_0 on pgsr01 (priority: 0, waiting: none) Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 78]: Completed rsc op pgsql_post_notify_stop_0 on pgsr01 (priority: 100, waiting: none) Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 77]: Completed rsc op pgsql_pre_notify_stop_0 on pgsr01 (priority: 0, waiting: none) Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action 67]: Completed pseudo op msPostgresql_confirmed-post_notify_demoted_0 on N/A (priority: 100, waiting: none) Aug 12
Re: [ClusterLabs] [Question:pacemaker_remote] By the operation that remote node cannot carry out a cluster, the resource does not move. (STONITH is not carried out, too)
Hi Andrew, A correction seems to still have a problem. It is awaiting demote, and the master-group resource cannot move. [root@bl460g8n3 ~]# crm_mon -1 -Af Last updated: Tue Aug 18 11:13:39 2015 Last change: Tue Aug 18 11:11:01 2015 by root via crm_resource on bl460g8n4 Stack: corosync Current DC: bl460g8n3 (version 1.1.13-7d0cac0) - partition with quorum 4 nodes and 10 resources configured Online: [ bl460g8n3 bl460g8n4 ] GuestOnline: [ pgsr02@bl460g8n4 ] prmDB2 (ocf::heartbeat:VirtualDomain): Started bl460g8n4 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ipmi): Started bl460g8n4 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ipmi): Started bl460g8n3 Master/Slave Set: msPostgresql [pgsql] Masters: [ pgsr02 ] Node Attributes: * Node bl460g8n3: * Node bl460g8n4: * Node pgsr02@bl460g8n4: + master-pgsql : 10 Migration Summary: * Node bl460g8n3: pgsr01: migration-threshold=1 fail-count=1 last-failure='Tue Aug 18 11:12:03 2015' * Node bl460g8n4: * Node pgsr02@bl460g8n4: Failed Actions: * pgsr01_monitor_3 on bl460g8n3 'unknown error' (1): call=2, status=Error, exitreason='none', last-rc-change='Tue Aug 18 11:12:03 2015', queued=0ms, exec=0ms (snip) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Container prmDB1 and the resources within it have failed 1 times on bl460g8n3 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing prmDB1 away from bl460g8n3 after 1 failures (max=1) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsr01 has failed 1 times on bl460g8n3 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing pgsr01 away from bl460g8n3 after 1 failures (max=1) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: prmDB1: Rolling back scores from pgsr01Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource prmDB1 cannot run anywhere Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsr01 cannot run anywhere Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsql:0: Rolling back scores from vip-master Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsql:0 cannot run anywhere Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Promoting pgsql:1 (Master pgsr02) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: msPostgresql: Promoted 1 instances of a possible 1 to master Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-master_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Start recurring monitor (10s) for vip-master on pgsr02 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-rep_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Start recurring monitor (10s) for vip-rep on pgsr02 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Start recurring monitor (9s) for pgsql:1 on pgsr02 Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Start recurring monitor (9s) for pgsql:1 on pgsr02 Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Impliying node pgsr01 is down when container prmDB1 is stopped ((nil)) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave prmDB1 (Stopped) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave prmDB2 (Started bl460g8n4) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave prmStonith1-2 (Started bl460g8n4) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave prmStonith2-2 (Started bl460g8n3) Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop vip-master (Started pgsr01 - blocked) Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop vip-rep (Started pgsr01 - blocked) Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Demote pgsql:0 (Master - Stopped pgsr01 - blocked) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave pgsql:1 (Master pgsr02) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave pgsr01 (Stopped) Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave pgsr02 (Started bl460g8n4) Aug 18 11:12:07 bl460g8n3 pengine[10325]: crit: Cannot shut down node 'pgsr01' because of pgsql:0: blocked failed Aug
Re: [ClusterLabs] [Question:pacemaker_remote] About limitation of the placement of the resource to remote node.
On 13 Aug 2015, at 10:23 am, renayama19661...@ybb.ne.jp wrote: Hi All, We confirmed movement of pacemaker_remote.(version:pacemaker-ad1f397a8228a63949f86c96597da5cecc3ed977) It is the following cluster constitution. * sl7-01(KVM host) * snmp1(Guest on the sl7-01 host) * snmp2(Guest on the sl7-01 host) We prepared for the next CLI file to confirm the resource placement to remote node. -- property no-quorum-policy=ignore \ stonith-enabled=false \ startup-fencing=false rsc_defaults resource-stickiness=INFINITY \ migration-threshold=1 primitive remote-vm2 ocf:pacemaker:remote \ params server=snmp1 \ op monitor interval=3 timeout=15 primitive remote-vm3 ocf:pacemaker:remote \ params server=snmp2 \ op monitor interval=3 timeout=15 primitive dummy-remote-A Dummy \ op start interval=0s timeout=60s \ op monitor interval=30s timeout=60s \ op stop interval=0s timeout=60s primitive dummy-remote-B Dummy \ op start interval=0s timeout=60s \ op monitor interval=30s timeout=60s \ op stop interval=0s timeout=60s location loc1 dummy-remote-A \ rule 200: #uname eq remote-vm3 \ rule 100: #uname eq remote-vm2 \ rule -inf: #uname eq sl7-01 location loc2 dummy-remote-B \ rule 200: #uname eq remote-vm3 \ rule 100: #uname eq remote-vm2 \ rule -inf: #uname eq sl7-01 -- Case 1) The resource is placed as follows when I spend the CLI file which we prepared for. However, the placement of the dummy-remote resource does not meet a condition. dummy-remote-A starts in remote-vm2. [root@sl7-01 ~]# crm_mon -1 -Af Last updated: Thu Aug 13 08:49:09 2015 Last change: Thu Aug 13 08:41:14 2015 by root via cibadmin on sl7-01 Stack: corosync Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum 3 nodes and 4 resources configured Online: [ sl7-01 ] RemoteOnline: [ remote-vm2 remote-vm3 ] dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm2 dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3 remote-vm2 (ocf::pacemaker:remote):Started sl7-01 remote-vm3 (ocf::pacemaker:remote):Started sl7-01 It is possible that there was a time when only remote-vm2 was available (so we put dummy-remote-A there) and then before we could start dummy-remote-B there too, remote-vm3 showed up but due to resource-stickiness=“INFINITY”, we didn’t move dummy-remote-A. (snip) Case 2) When we change CLI file of it and spend it, You lost me here :-) Can you rephrase please? the resource is placed as follows. The resource is placed definitely. dummy-remote-A starts in remote-vm3. dummy-remote-B starts in remote-vm3. (snip) location loc1 dummy-remote-A \ rule 200: #uname eq remote-vm3 \ rule 100: #uname eq remote-vm2 \ rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \ rule -inf: #uname eq sl7-01 location loc2 dummy-remote-B \ rule 200: #uname eq remote-vm3 \ rule 100: #uname eq remote-vm2 \ rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \ rule -inf: #uname eq sl7-01 (snip) [root@sl7-01 ~]# crm_mon -1 -Af Last updated: Thu Aug 13 08:55:28 2015 Last change: Thu Aug 13 08:55:22 2015 by root via cibadmin on sl7-01 Stack: corosync Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum 3 nodes and 4 resources configured Online: [ sl7-01 ] RemoteOnline: [ remote-vm2 remote-vm3 ] dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm3 dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3 remote-vm2 (ocf::pacemaker:remote):Started sl7-01 remote-vm3 (ocf::pacemaker:remote):Started sl7-01 (snip) As for the placement of the resource being wrong with the first CLI file, the placement limitation of the remote node is like remote resource not being evaluated until it is done start. The placement becomes right with the CLI file which I revised, but the description of this limitation is very troublesome when I compose a cluster of more nodes. Does remote node not need processing delaying placement limitation until it is done start? Potentially. I’d need a crm_report to confirm though. Is there a method to easily describe the limitation of the resource to remote node? * As one means, we know that the placement of the resource goes well by dividing the first CLI file into two. * After a cluster sent CLI which remote node starts, I send CLI where a cluster starts a resource. * However, we do not want to divide CLI file into two if possible. Best Regards, Hideo Yamauchi. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
Re: [ClusterLabs] MySQL resource causes error 0_monitor_20000.
Отправлено с iPhone 18 авг. 2015 г., в 7:19, Kiwamu Okabe kiw...@gmail.com написал(а): Hi all, I made master-master replication on Pacemaker. But it causes error 0_monitor_2. It's not an error, it is just operation name. If one of them boots Heartbeat and another doesn't, the error doesn't occur. What should I check? Probably you have to allow more than one master (default is just one); see description of master-max resource option. Thank's, Host: centillion.db01 and centillion.db02 OS: CentOS 6.3 Heartbeat: 3.0.5 Pacemaker: 1.0.13 MySQL: 5.6.16 Error messages: ``` # crm_mon Last updated: Fri Aug 14 17:28:58 2015 Stack: Heartbeat Current DC: centillion.db02 (0302e3d0-df06-4847-b0f9-9ebddfb6aec7) - partition with quorum Version: 1.0.13-a83fae5 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ centillion.db01 centillion.db02 ] vip_192.168.10.200 (ocf::heartbeat:IPaddr2): Started centillion.db02 Master/Slave Set: mysql-clone mysql:0(ocf::heartbeat:mysql): Master centillion.db01 FAILED Masters: [ centillion.db02 ] Failed actions: mysql:0_monitor_2 (node=centillion.db01, call=166, rc=8, status=complete): master mysql:0_monitor_3 (node=centillion.db01, call=167, rc=8, status=complete): master ``` -- Kiwamu Okabe at METASEPI DESIGN ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Single quotes in values for 'crm resource rsc param set'
Vladislav Bogdanov bub...@hoster-ok.com writes: Hi Kristoffer, all. Could you please look why I get error when trying to update valid resource value (which already has single quotes inside) with the slightly different one by running the command in the subject? It looks like is_value_sane() doesn't accept single quotes just because crmsh quotes all arguments to crm_resource with them. I need to pass a command-line with semicolons in one of parameters which is run with eval in the resource agent. Backslashed double-quoting does not work in this case, but single-quotes work fine. Could that be some-how fixed? Well, first of all passing the command line through bash complicates things, so if that's what is causing you trouble you could try writing your command line to a file and passing it to crmsh using crm -f file. Another option is using crm -f - and piping the command line into crmsh. If that doesn't help, it would help /me/ in figuring out just what the problem is if you could give me an example of what the current value is and what it is you are trying to set it to. :) Thanks! Kristoffer Best, Vladislav -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] upgrade from 1.1.9 to 1.1.12 fails to start
I was recommended to upgrade from 1.1.9 to 1.1.12. I had to uninstall the 1.1.9 version to install the 1.1.12 version I am not allowed to connect to a repo and so I have to download the rpms and install them individually. After I installed pacemaker-lib, cli, cluster-lib, and pacemaker itself, when I rebooted, the cluster failed to start When I tried to manually start it, I got Starting Pacemaker Cluster Manager/etc/init.d/pacemaker: line 94: 8219 Segmentation fault (core dumped) $prog /dev/null 21 I deleted the Cluster.conf file and the cib.xml and all the back up versions and tried again and got the same error. I googled this error and really got nothing. Any ideas? Michelle Streeter ASC2 MCS - SDE/ACL/SDL/EDL OKC Software Engineer The Boeing Company ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] upgrade from 1.1.9 to 1.1.12 fails to start
On 17/08/15 05:13 PM, Streeter, Michelle N wrote: I was recommended to upgrade from 1.1.9 to 1.1.12. I had to uninstall the 1.1.9 version to install the 1.1.12 version I am not allowed to connect to a repo and so I have to download the rpms and install them individually. After I installed pacemaker-lib, cli, cluster-lib, and pacemaker itself, when I rebooted, the cluster failed to start When I tried to manually start it, I got Starting Pacemaker Cluster Manager/etc/init.d/pacemaker: line 94: 8219 Segmentation fault (core dumped) $prog /dev/null 21 I deleted the Cluster.conf file and the cib.xml and all the back up versions and tried again and got the same error. I googled this error and really got nothing. Any ideas? As a test, can you create a fresh, new cluster? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Memory leak in crm_mon ?
On 17 Aug 2015, at 4:35 pm, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Andrew Beekhof and...@beekhof.net schrieb am 17.08.2015 um 00:08 in Nachricht ff78be4f-173c-4a74-a989-92ea6c540...@beekhof.net: On 16 Aug 2015, at 9:41 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Hi Andrew, I managed to isolate / reproduce the issue. You might want to take a look, as it might be present in 1.1.12 as well. I monitor my cluster from putty, mainly this way: - I have a putty (Windows client) session, that connects via SSH to the box, authenticates using public key as a non-root user. - It immediately sends a sudo crm_mon -Af command, so with a single click I have a nice view of what the cluster is doing. Perhaps add -1 to the option list. The root cause seems to be that closing the putty window doesn’t actually kill the process running inside it. Sorry, the root cause seems to be that cm_mon happily writes to a closed filehandle (I guess). If crm_mon would handle that error by exiting the loop, ther would be no need for putty to kill any process. No, if you want a process to die you need to kill it. Whenever I close this putty window (terminate the app), crm_mon process gets to 100% cpu usage, starts to leak, in a few hours consumes all memory and then destroys the whole cluster. This does not happen if I leave crm_mon with Ctrl-C. I can reproduce this 100% with crm_mon 1.1.10, with the mainstream ubuntu trusty packages. This might be related on how sudo executes crm_mon, and what it signalls to crm_mon when it gets terminated. Now I know what I need to pay attention to in order to avoid this problem, but you might want to check whether this issue is still present. Thanks, Attila -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Friday, August 14, 2015 12:40 AM To: Cluster Labs - All topics related to open-source clustering welcomed users@clusterlabs.org Subject: Re: [ClusterLabs] Memory leak in crm_mon ? -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, August 11, 2015 2:49 AM To: Cluster Labs - All topics related to open-source clustering welcomed users@clusterlabs.org Subject: Re: [ClusterLabs] Memory leak in crm_mon ? On 10 Aug 2015, at 5:33 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Hi! We are building a new cluster on top of pacemaker/corosync and several times during the past days we noticed that „crm_mon -Af” used up all the memory+swap and caused high CPU usage. Killing the process solves the issue. We are using the binary package versions available in the latest ubuntu trusty, namely: crmsh 1.2.5+hg1034-1ubuntu4 pacemaker 1.1.10+git20130802-1ubuntu2.3 pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3 corosync 2.3.3-1ubuntu1 Kernel is 3.13.0-46-generic Looking back some „atop” data, the CPU went to 100% many times during the last couple of days, at various times, more often around midnight exaclty (strange). 08.05 14:00 08.06 21:41 08.07 00:00 08.07 00:00 08.08 00:00 08.09 06:27 Checked the corosync log and syslog, but did not find any correlation between the entries int he logs around the specific times. For most of the time, the node running the crm_mon was the DC as well – not running any resources (e.g. a pairless node for quorum). We have another running system, where everything works perfecly, whereas it is almost the same: crmsh 1.2.5+hg1034-1ubuntu4 pacemaker 1.1.10+git20130802-1ubuntu2.1 pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 corosync 2.3.3-1ubuntu1 Kernel is 3.13.0-8-generic Is this perhaps a known issue? Possibly, that version is over 2 years old. Any hints? Getting something a little more recent would be the best place to start Thanks Andew, I tried to upgrade to 1.1.12 using the packages availabe at https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a single node, to see how it works out but I ended up with errors like Could not establish cib_rw connection: Connection refused (111) I have disabled the firewall, no changes. The node appears to be running but does not see any of the other nodes. On the other nodes I see this node as an UNCLEAN one. (I assume corosync is fine, but pacemaker not) I use udpu for the transport. Am I doing something wrong? I tried to look for some howtos on upgrade,
[ClusterLabs] Antw: nfsServer Filesystem Failover average 76s
Streeter, Michelle N michelle.n.stree...@boeing.com schrieb am 14.08.2015 um 19:17 in Nachricht 9a18847a77a9a14da7e0fd240efcafc2502...@xch-phx-501.sw.nos.boeing.com: I am getting an average failover for nfs of 76s. I have set all the start and stop settings to 10s but no change. The Web page is instant but not nfs. Did you try options -o and -t for crm_mon? I get some timeing values then: e.g.: + (70) start: last-rc-change='Thu Jul 9 16:55:35 2015' last-run='Thu Jul 9 16:55:35 2015' exec-time=5572ms queue-time=0ms rc=0 (ok) + (129) monitor: interval=30ms last-rc-change='Fri Jul 10 12:55:29 2015' exec-time=16ms queue-time=0ms rc=0 (ok) The other thing is to watch syslog for timing of events. I am running two node cluster on rhel6 with pacemaker 1.1.9 Surely these times are not right? Any suggestions? Resources: Group: nfsgroup Resource: nfsshare (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/sdb1 directory=/data fstype=ext4 Operations: start interval=0s (nfsshare-start-interval-0s) stop interval=0s (nfsshare-stop-interval-0s) monitor interval=10s (nfsshare-monitor-interval-10s) Resource: nfsServer (class=ocf provider=heartbeat type=nfsserver) Attributes: nfs_shared_infodir=/data/nfsinfo nfs_no_notify=true Operations: start interval=0s timeout=10s (nfsServer-start-timeout-10s) stop interval=0s timeout=10s (nfsServer-stop-timeout-10s) monitor interval=10 timeout=20s (nfsServer-monitor-interval-10) Resource: NAS (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=192.168.56.110 cidr_netmask=24 Operations: start interval=0s timeout=20s (NAS-start-timeout-20s) stop interval=0s timeout=20s (NAS-stop-timeout-20s) monitor interval=10s timeout=20s (NAS-monitor-interval-10s) Michelle Streeter ASC2 MCS - SDE/ACL/SDL/EDL OKC Software Engineer The Boeing Company ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org