[ClusterLabs] MySQL resource causes error 0_monitor_20000.

2015-08-17 Thread Kiwamu Okabe
Hi all,

I made master-master replication on Pacemaker.
But it causes error 0_monitor_2.
If one of them boots Heartbeat and another doesn't, the error doesn't occur.

What should I check?

Thank's,

Host: centillion.db01 and centillion.db02
OS: CentOS 6.3
Heartbeat: 3.0.5
Pacemaker: 1.0.13
MySQL: 5.6.16

Error messages:

```
# crm_mon

Last updated: Fri Aug 14 17:28:58 2015
Stack: Heartbeat
Current DC: centillion.db02 (0302e3d0-df06-4847-b0f9-9ebddfb6aec7) -
partition with quorum
Version: 1.0.13-a83fae5
2 Nodes configured, unknown expected votes
2 Resources configured.


Online: [ centillion.db01 centillion.db02 ]

vip_192.168.10.200  (ocf::heartbeat:IPaddr2):   Started centillion.db02
 Master/Slave Set: mysql-clone
 mysql:0(ocf::heartbeat:mysql): Master centillion.db01 FAILED
 Masters: [ centillion.db02 ]

Failed actions:
mysql:0_monitor_2 (node=centillion.db01, call=166, rc=8,
status=complete): master
mysql:0_monitor_3 (node=centillion.db01, call=167, rc=8,
status=complete): master
```
-- 
Kiwamu Okabe at METASEPI DESIGN

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Single quotes in values for 'crm resource rsc param set'

2015-08-17 Thread Ulrich Windl
Hi!

Somewhat stupid question: Why don't you put monsters like
subagent=/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm'
ins a shell command file and execute that?

Regards,
Ulrich

 Vladislav Bogdanov bub...@hoster-ok.com schrieb am 17.08.2015 um 11:22
in
Nachricht 55d1a7d9.20...@hoster-ok.com:
 17.08.2015 10:39, Kristoffer Grönlund wrote:
 Vladislav Bogdanov bub...@hoster-ok.com writes:
 
 Hi Kristoffer, all.

 Could you please look why I get error when trying to update valid
 resource value (which already has single quotes inside) with the
 slightly different one by running the command in the subject?

 It looks like is_value_sane() doesn't accept single quotes just because
 crmsh quotes all arguments to crm_resource with them. I need to pass a
 command-line with semicolons in one of parameters which is run with eval
 in the resource agent. Backslashed double-quoting does not work in this
 case, but single-quotes work fine.

 Could that be some-how fixed?
 
 Well, first of all passing the command line through bash complicates
 things, so if that's what is causing you trouble you could try writing
 your command line to a file and passing it to crmsh using crm -f file.
 Another option is using crm -f - and piping the command line into
 crmsh.
 
 
 Do you mean one with double-quotes?
 Otherwise is_value_sane() will fail anyways.
 
 Using ... \string;string\ notation in the file strips quotes from the 
 actual command run.
 Well, may be function I use is not smart enough, but that works with 
 single-qouted value.
 
 What I think could be done for single-quotes support is to assume that value

 which contains
 them was actually passed in the double-quotes, so double-quotes should be 
 used when
 running crm_resource. We may also have in mind that CIB uses double-quotes 
 for values internally.
 
  
 If that doesn't help, it would help /me/ in figuring out just what the
 problem is if you could give me an example of what the current value is
 and what it is you are trying to set it to. :)
 
 Well, this is the (obfuscated a bit due to customer's policies) working 
 resource definition
 (word wrap off):
 
 primitive staging-0-fs ocf:vendor:Filesystem \
 params device=/dev/vg_staging_shared/staging_0 
 directory=/cluster/storage/staging-0 fstype=gfs2 options= 
 manage_directory=true subagent=/sbin/fs-io-throttle %a staging-0 
 /cluster/storage/staging-0 zone 0 
 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm' subagent_timeout=10

 \
 op start interval=0 timeout=90 \
 op stop interval=0 timeout=100 \
 op monitor interval=10 timeout=45 depth=0 \
 op monitor interval=240 timeout=240 depth=10 \
 op monitor interval=360 timeout=240 depth=20
 
 Here is the command which fails:
 
 # crm resource param staging-0-fs set subagent /sbin/fs-io-throttle %a 
 staging-0 /cluster/storage/staging-0 zone 0 
 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'
 DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.12 (1b9beb7)]
 DEBUG: found pacemaker version: 1.1.12
 ERROR: /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 
 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm': bad name
 ERROR: Bad usage: Expected valid name, got '/sbin/fs-io-throttle %a 
 staging-0 /cluster/storage/staging-0 zone 0 
 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'', command: 'param 
 staging-0-fs set subagent /sbin/fs-io-throttle %a staging-0 
 /cluster/storage/staging-0 zone 0 
 '5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm''
 
 Replacing single-quotes with back-slashed double ones 
 (\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\)
 makes that string unquoted in the CIB, so semicolons are recognized as 
 command separators by the shell
 run from the RA.
 Using double-escaping 
 (\\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\\) when passing 
 value in the
 double quotes breaks the shell which runs command.
 
 Using single quotes with one or two back-slashes before double-quote inside

 for a value produces
 unparseable CIB with dqout; in it.
 
 
 Here is the function which runs that subagent command (I believe it should 
 support several
 semicolon-separated commands as well, but did not test that yet):
 
 run_subagent() {
 local subagent_timeout=$1
 local subagent_command=$2
 local WRAPPER
 
 subagent_command=${subagent_command//%a/${__OCF_ACTION}}
 subagent_command=${subagent_command//%r/${OCF_RESOURCE_INSTANCE%:*}}
 subagent_command=${subagent_command//%n/$( crm_node -n )}
 
 case ${subagent_timeout} in
 0||*[!0-9]*)
 WRAPPER=bash -c \${subagent_command}\
 ;;
 *)
 WRAPPER=timeout -s KILL ${subagent_timeout} bash -c 
 \${subagent_command}\
 ;;
 esac
 
 ocf_run eval ${WRAPPER}
 }
 
 It is called with:
 
 run_subagent ${OCF_RESKEY_subagent_timeout} 

Re: [ClusterLabs] Single quotes in values for 'crm resource rsc param set'

2015-08-17 Thread Vladislav Bogdanov
17.08.2015 10:39, Kristoffer Grönlund wrote:
 Vladislav Bogdanov bub...@hoster-ok.com writes:
 
 Hi Kristoffer, all.

 Could you please look why I get error when trying to update valid
 resource value (which already has single quotes inside) with the
 slightly different one by running the command in the subject?

 It looks like is_value_sane() doesn't accept single quotes just because
 crmsh quotes all arguments to crm_resource with them. I need to pass a
 command-line with semicolons in one of parameters which is run with eval
 in the resource agent. Backslashed double-quoting does not work in this
 case, but single-quotes work fine.

 Could that be some-how fixed?
 
 Well, first of all passing the command line through bash complicates
 things, so if that's what is causing you trouble you could try writing
 your command line to a file and passing it to crmsh using crm -f file.
 Another option is using crm -f - and piping the command line into
 crmsh.
 

Do you mean one with double-quotes?
Otherwise is_value_sane() will fail anyways.

Using ... \string;string\ notation in the file strips quotes from the 
actual command run.
Well, may be function I use is not smart enough, but that works with 
single-qouted value.

What I think could be done for single-quotes support is to assume that value 
which contains
them was actually passed in the double-quotes, so double-quotes should be used 
when
running crm_resource. We may also have in mind that CIB uses double-quotes for 
values internally.

 
 If that doesn't help, it would help /me/ in figuring out just what the
 problem is if you could give me an example of what the current value is
 and what it is you are trying to set it to. :)

Well, this is the (obfuscated a bit due to customer's policies) working 
resource definition
(word wrap off):

primitive staging-0-fs ocf:vendor:Filesystem \
params device=/dev/vg_staging_shared/staging_0 
directory=/cluster/storage/staging-0 fstype=gfs2 options= 
manage_directory=true subagent=/sbin/fs-io-throttle %a staging-0 
/cluster/storage/staging-0 zone 0 
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm' subagent_timeout=10 \
op start interval=0 timeout=90 \
op stop interval=0 timeout=100 \
op monitor interval=10 timeout=45 depth=0 \
op monitor interval=240 timeout=240 depth=10 \
op monitor interval=360 timeout=240 depth=20

Here is the command which fails:

# crm resource param staging-0-fs set subagent /sbin/fs-io-throttle %a 
staging-0 /cluster/storage/staging-0 zone 0 
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.12 (1b9beb7)]
DEBUG: found pacemaker version: 1.1.12
ERROR: /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0 
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm': bad name
ERROR: Bad usage: Expected valid name, got '/sbin/fs-io-throttle %a staging-0 
/cluster/storage/staging-0 zone 0 
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'', command: 'param 
staging-0-fs set subagent /sbin/fs-io-throttle %a staging-0 
/cluster/storage/staging-0 zone 0 
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm''

Replacing single-quotes with back-slashed double ones 
(\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\)
makes that string unquoted in the CIB, so semicolons are recognized as command 
separators by the shell
run from the RA.
Using double-escaping 
(\\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\\) when passing 
value in the
double quotes breaks the shell which runs command.

Using single quotes with one or two back-slashes before double-quote inside for 
a value produces
unparseable CIB with dqout; in it.


Here is the function which runs that subagent command (I believe it should 
support several
semicolon-separated commands as well, but did not test that yet):

run_subagent() {
local subagent_timeout=$1
local subagent_command=$2
local WRAPPER

subagent_command=${subagent_command//%a/${__OCF_ACTION}}
subagent_command=${subagent_command//%r/${OCF_RESOURCE_INSTANCE%:*}}
subagent_command=${subagent_command//%n/$( crm_node -n )}

case ${subagent_timeout} in
0||*[!0-9]*)
WRAPPER=bash -c \${subagent_command}\
;;
*)
WRAPPER=timeout -s KILL ${subagent_timeout} bash -c 
\${subagent_command}\
;;
esac

ocf_run eval ${WRAPPER}
}

It is called with:

run_subagent ${OCF_RESKEY_subagent_timeout} ${OCF_RESKEY_subagent}


Best regards,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Single quotes in values for 'crm resource rsc param set'

2015-08-17 Thread Vladislav Bogdanov

14.08.2015 19:51, Jan Pokorný wrote:

On 14/08/15 18:22 +0300, Vladislav Bogdanov wrote:

I need to pass a command-line with semicolons in one of parameters
which is run with eval in the resource agent. Backslashed
double-quoting does not work in this case, but single-quotes work
fine.


Hmm, another data point to the recent shell can be troublesome:
http://clusterlabs.org/pipermail/users/2015-August/000996.html


Yes, see my last message for more shell madness ;)


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Single quotes in values for 'crm resource rsc param set'

2015-08-17 Thread Vladislav Bogdanov

17.08.2015 12:44, Ulrich Windl wrote:

Hi!

Somewhat stupid question: Why don't you put monsters like
subagent=/sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm'
ins a shell command file and execute that?


Hmm, probably a good point. I will think about it, thanks.



Regards,
Ulrich


Vladislav Bogdanov bub...@hoster-ok.com schrieb am 17.08.2015 um 11:22

in
Nachricht 55d1a7d9.20...@hoster-ok.com:

17.08.2015 10:39, Kristoffer Grönlund wrote:

Vladislav Bogdanov bub...@hoster-ok.com writes:


Hi Kristoffer, all.

Could you please look why I get error when trying to update valid
resource value (which already has single quotes inside) with the
slightly different one by running the command in the subject?

It looks like is_value_sane() doesn't accept single quotes just because
crmsh quotes all arguments to crm_resource with them. I need to pass a
command-line with semicolons in one of parameters which is run with eval
in the resource agent. Backslashed double-quoting does not work in this
case, but single-quotes work fine.

Could that be some-how fixed?


Well, first of all passing the command line through bash complicates
things, so if that's what is causing you trouble you could try writing
your command line to a file and passing it to crmsh using crm -f file.
Another option is using crm -f - and piping the command line into
crmsh.



Do you mean one with double-quotes?
Otherwise is_value_sane() will fail anyways.

Using ... \string;string\ notation in the file strips quotes from the
actual command run.
Well, may be function I use is not smart enough, but that works with
single-qouted value.

What I think could be done for single-quotes support is to assume that value



which contains
them was actually passed in the double-quotes, so double-quotes should be
used when
running crm_resource. We may also have in mind that CIB uses double-quotes
for values internally.



If that doesn't help, it would help /me/ in figuring out just what the
problem is if you could give me an example of what the current value is
and what it is you are trying to set it to. :)


Well, this is the (obfuscated a bit due to customer's policies) working
resource definition
(word wrap off):

primitive staging-0-fs ocf:vendor:Filesystem \
 params device=/dev/vg_staging_shared/staging_0
directory=/cluster/storage/staging-0 fstype=gfs2 options=
manage_directory=true subagent=/sbin/fs-io-throttle %a staging-0
/cluster/storage/staging-0 zone 0
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;300M:mm' subagent_timeout=10



\
 op start interval=0 timeout=90 \
 op stop interval=0 timeout=100 \
 op monitor interval=10 timeout=45 depth=0 \
 op monitor interval=240 timeout=240 depth=10 \
 op monitor interval=360 timeout=240 depth=20

Here is the command which fails:

# crm resource param staging-0-fs set subagent /sbin/fs-io-throttle %a
staging-0 /cluster/storage/staging-0 zone 0
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.12 (1b9beb7)]
DEBUG: found pacemaker version: 1.1.12
ERROR: /sbin/fs-io-throttle %a staging-0 /cluster/storage/staging-0 zone 0
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm': bad name
ERROR: Bad usage: Expected valid name, got '/sbin/fs-io-throttle %a
staging-0 /cluster/storage/staging-0 zone 0
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm'', command: 'param
staging-0-fs set subagent /sbin/fs-io-throttle %a staging-0
/cluster/storage/staging-0 zone 0
'5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm''

Replacing single-quotes with back-slashed double ones
(\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\)
makes that string unquoted in the CIB, so semicolons are recognized as
command separators by the shell
run from the RA.
Using double-escaping
(\\5000M:300;2500M:100;1500M:50;1000M:35;500M:10;400M:mm\\) when passing
value in the
double quotes breaks the shell which runs command.

Using single quotes with one or two back-slashes before double-quote inside



for a value produces
unparseable CIB with dqout; in it.


Here is the function which runs that subagent command (I believe it should
support several
semicolon-separated commands as well, but did not test that yet):

run_subagent() {
 local subagent_timeout=$1
 local subagent_command=$2
 local WRAPPER

 subagent_command=${subagent_command//%a/${__OCF_ACTION}}
 subagent_command=${subagent_command//%r/${OCF_RESOURCE_INSTANCE%:*}}
 subagent_command=${subagent_command//%n/$( crm_node -n )}

 case ${subagent_timeout} in
 0||*[!0-9]*)
 WRAPPER=bash -c \${subagent_command}\
 ;;
 *)
 WRAPPER=timeout -s KILL ${subagent_timeout} bash -c
\${subagent_command}\
 ;;
 esac

 ocf_run eval ${WRAPPER}
}

It is called with:

run_subagent ${OCF_RESKEY_subagent_timeout} 

Re: [ClusterLabs] [Problem] The SNMP trap which has been already started is transmitted.

2015-08-17 Thread renayama19661014
Hi Andrew,


   I used the built-in SNMP.

   I started as a daemon with -d option.
 
 Is it running on both nodes or just snmp1?


On both nodes.

[root@snmp1 ~]# ps -ef |grep crm_mon
root      4923     1  0 09:42 ?        00:00:00 crm_mon -d -S 192.168.40.2 -W 
-p /tmp/ClusterMon-upstart.pid
[root@snmp2 ~]# ps -ef |grep crm_mon
root      4860     1  0 09:42 ?        00:00:00 crm_mon -d -S 192.168.40.2 -W 
-p /tmp/ClusterMon-upstart.pid


 Because there is no logic in crm_mon that would have remapped the monitor 
 to 
 start, so my working theory is that its a duplicate of an old event.
 Can you tell which node the trap is being sent from?


The trap is transmitted by snmp1 node.

The trap is not sent from the snmp2 node that rebooted.


Aug 18 09:44:37 SNMP-MANAGER snmptrapd[1334]: 2015-08-18 09:44:37 snmp1 [UDP: 
[192.168.40.100]:59668-[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance 
= Timeticks: (1439858677) 166 days, 15:36:26.77#011SNMPv2-MIB::snmpTrapOID.0 = 
OID: 
PACEMAKER-MIB::pacemakerNotification#011PACEMAKER-MIB::pacemakerNotificationResource
 = STRING: prmDummy#011PACEMAKER-MIB::pacemakerNotificationNode = STRING: 
snmp1#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
start#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
OK#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 0
Aug 18 09:44:37 SNMP-MANAGER snmptrapd[1334]: 2015-08-18 09:44:37 snmp1 [UDP: 
[192.168.40.100]:59668-[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance 
= Timeticks: (1439858677) 166 days, 15:36:26.77#011SNMPv2-MIB::snmpTrapOID.0 = 
OID: 
PACEMAKER-MIB::pacemakerNotification#011PACEMAKER-MIB::pacemakerNotificationResource
 = STRING: prmDummy#011PACEMAKER-MIB::pacemakerNotificationNode = STRING: 
snmp1#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
monitor#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
OK#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 0


Best Regards,
Hideo Yamauchi.




- Original Message -
 From: renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp
 To: Cluster Labs - All topics related to open-source clustering welcomed 
 users@clusterlabs.org
 Cc: 
 Date: 2015/8/17, Mon 10:05
 Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already 
 started is transmitted.
 
 Hi Andrew,
 
 Thank you for comments.
 
 
 I will confirm it tomorrow.
 I am a vacation today.
 
 Best Regards,
 Hideo Yamauchi.
 
 
 - Original Message -
  From: Andrew Beekhof and...@beekhof.net
  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
 open-source clustering welcomed users@clusterlabs.org
  Cc: 
  Date: 2015/8/17, Mon 09:30
  Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already 
 started is transmitted.
 
 
   On 4 Aug 2015, at 7:36 pm, renayama19661...@ybb.ne.jp wrote:
 
   Hi Andrew,
 
   Thank you for comments.
 
   However, a trap of crm_mon is sent to an SNMP manager.
    
   Are you using the built-in SNMP logic or using -E to give crm_mon 
 a 
  script which 
   is then producing the trap?
   (I’m trying to figure out who could be turning the monitor action 
 into 
  a start)
 
 
   I used the built-in SNMP.
   I started as a daemon with -d option.
 
  Is it running on both nodes or just snmp1?
  Because there is no logic in crm_mon that would have remapped the monitor 
 to 
  start, so my working theory is that its a duplicate of an old event.
  Can you tell which node the trap is being sent from?
 
 
 
   Best Regards,
   Hideo Yamauchi.
 
 
   - Original Message -
   From: Andrew Beekhof and...@beekhof.net
   To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related 
 to 
  open-source clustering welcomed users@clusterlabs.org
   Cc: 
   Date: 2015/8/4, Tue 14:15
   Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been 
  already started is transmitted.
 
 
   On 27 Jul 2015, at 4:18 pm, renayama19661...@ybb.ne.jp wrote:
 
   Hi All,
 
   The transmission of the SNMP trap of crm_mon seems to have a 
  problem.
   I identified a problem on latest Pacemaker and 
 Pacemaker1.1.13.
 
 
   Step 1) I constitute a cluster and send simple CLI file.
 
   [root@snmp1 ~]# crm_mon -1 
   Last updated: Mon Jul 27 14:40:37 2015          Last change: 
 Mon 
  Jul 27 
   14:40:29 2015 by root via cibadmin on snmp1
   Stack: corosync
   Current DC: snmp1 (version 1.1.13-3d781d3) - partition with 
 quorum
   2 nodes and 1 resource configured
 
   Online: [ snmp1 snmp2 ]
 
     prmDummy       (ocf::heartbeat:Dummy): Started snmp1
 
   Step 2) I stop a node of the standby once.
 
   [root@snmp2 ~]# stop pacemaker
   pacemaker stop/waiting
 
 
   Step 3) I start a node of the standby again.
   

Re: [ClusterLabs] upgrade from 1.1.9 to 1.1.12 fails to start

2015-08-17 Thread Andrew Beekhof

 On 18 Aug 2015, at 7:13 am, Streeter, Michelle N 
 michelle.n.stree...@boeing.com wrote:
 
 I was recommended to upgrade from 1.1.9 to 1.1.12.  
 I had to uninstall the 1.1.9 version to install the 1.1.12 version

Did you upgrade anything else? cman? corosync? heartbeat? What distro? Logs? 
Stack trace? Where did the packages come from?

 I am not allowed to connect to a repo and so I have to download the rpms and 
 install them individually.
 After I installed pacemaker-lib, cli, cluster-lib, and pacemaker itself, when 
 I rebooted, the cluster failed to start
 When I tried to manually start it, I got
 Starting Pacemaker Cluster Manager/etc/init.d/pacemaker: line 94:  8219 
 Segmentation fault  (core dumped) $prog  /dev/null 21
 I deleted the Cluster.conf file and the cib.xml and all the back up versions 
 and tried again and got the same error.
 I googled this error and really got nothing.   Any ideas?

Not based on what you’ve told us.

  
 Michelle Streeter
 ASC2 MCS – SDE/ACL/SDL/EDL OKC Software Engineer
 The Boeing Company 
  
 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Ordering constraint restart second resource group

2015-08-17 Thread Andrew Beekhof

 On 17 Aug 2015, at 1:30 pm, Andrei Borzenkov arvidj...@gmail.com wrote:
 
 17.08.2015 02:26, Andrew Beekhof пишет:
 
 On 13 Aug 2015, at 7:33 pm, Andrei Borzenkov arvidj...@gmail.com wrote:
 
 On Thu, Aug 13, 2015 at 11:25 AM, Ulrich Windl
 ulrich.wi...@rz.uni-regensburg.de wrote:
 And what exactly is your problem?
 
 Real life example. Database resource depends on storage resource(s).
 There are multiple filesystems/volumes with database files. Database
 admin needs to increase available space. You add new storage,
 configure it in cluster ... pooh, your database is restarted.
 
 “configure it in cluster” hmmm
 
 if you’re expanding an existing mount point, then I’d expect you don’t need 
 to update the cluster.
 if you’re creating a new mount point, wouldn’t you need to take the db down 
 in order to point to the new location?
 
 
 No. Those database I worked with can use multiple storage locations at the 
 same time and those storage locations can be added (and removed) online.

Nice.  In that case, you could try adding it as a resource but waiting until it 
is active before creating the ordering constraint.

 
 
 
 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Question:pacemaker_remote] By the operation that remote node cannot carry out a cluster, the resource does not move. (STONITH is not carried out, too)

2015-08-17 Thread Andrew Beekhof
Should be fixed now. Thanks for the report!

 On 12 Aug 2015, at 1:20 pm, renayama19661...@ybb.ne.jp wrote:
 
 Hi All,
 
 We confirmed movement of 
 pacemaker_remote.(version:pacemaker-ad1f397a8228a63949f86c96597da5cecc3ed977)
 
 It is the following cluster constitution.
  * bl460g8n3(KVM host)
  * bl460g8n4(KVM host)
  * pgsr01(Guest on the bl460g8n3 host)
  * pgsr02(Guest on the bl460g8n4 host)
 
 
 Step 1) I compose a cluster of a simple resource.
 
 [root@bl460g8n3 ~]# crm_mon -1 -Af
 Last updated: Wed Aug 12 11:52:27 2015  Last change: Wed Aug 12 
 11:51:47 2015 by root via crm_resource on bl460g8n4
 Stack: corosync
 Current DC: bl460g8n3 (version 1.1.13-ad1f397) - partition with quorum
 4 nodes and 10 resources configured
 
 Online: [ bl460g8n3 bl460g8n4 ]
 GuestOnline: [ pgsr01@bl460g8n3 pgsr02@bl460g8n4 ]
 
  prmDB1 (ocf::heartbeat:VirtualDomain): Started bl460g8n3
  prmDB2 (ocf::heartbeat:VirtualDomain): Started bl460g8n4
  Resource Group: grpStonith1
  prmStonith1-2  (stonith:external/ipmi):Started bl460g8n4
  Resource Group: grpStonith2
  prmStonith2-2  (stonith:external/ipmi):Started bl460g8n3
  Resource Group: master-group
  vip-master (ocf::heartbeat:Dummy): Started pgsr02
  vip-rep(ocf::heartbeat:Dummy): Started pgsr02
  Master/Slave Set: msPostgresql [pgsql]
  Masters: [ pgsr02 ]
  Slaves: [ pgsr01 ]
 
 Node Attributes:
 * Node bl460g8n3:
 * Node bl460g8n4:
 * Node pgsr01@bl460g8n3:
 + master-pgsql  : 5 
 * Node pgsr02@bl460g8n4:
 + master-pgsql  : 10
 
 Migration Summary:
 * Node bl460g8n4:
 * Node bl460g8n3:
 * Node pgsr02@bl460g8n4:
 * Node pgsr01@bl460g8n3:
 
 
 Step 2) I cause trouble of pacemaker_remote in pgsr02.
 
 [root@pgsr02 ~]# ps -ef |grep remote
 root  1171 1  0 11:52 ?00:00:00 /usr/sbin/pacemaker_remoted
 root  1428  1377  0 11:53 pts/000:00:00 grep --color=auto remote
 [root@pgsr02 ~]# kill -9 1171
 
 
 Step 3) After trouble, the master-group resource does not start in pgsr01.
 
 [root@bl460g8n3 ~]# crm_mon -1 -Af
 Last updated: Wed Aug 12 11:54:04 2015  Last change: Wed Aug 12 
 11:51:47 2015 by root via crm_resource on bl460g8n4
 Stack: corosync
 Current DC: bl460g8n3 (version 1.1.13-ad1f397) - partition with quorum
 4 nodes and 10 resources configured
 
 Online: [ bl460g8n3 bl460g8n4 ]
 GuestOnline: [ pgsr01@bl460g8n3 ]
 
  prmDB1 (ocf::heartbeat:VirtualDomain): Started bl460g8n3
  prmDB2 (ocf::heartbeat:VirtualDomain): FAILED bl460g8n4
  Resource Group: grpStonith1
  prmStonith1-2  (stonith:external/ipmi):Started bl460g8n4
  Resource Group: grpStonith2
  prmStonith2-2  (stonith:external/ipmi):Started bl460g8n3
  Master/Slave Set: msPostgresql [pgsql]
  Masters: [ pgsr01 ]
 
 Node Attributes:
 * Node bl460g8n3:
 * Node bl460g8n4:
 * Node pgsr01@bl460g8n3:
 + master-pgsql  : 10
 
 Migration Summary:
 * Node bl460g8n4:
pgsr02: migration-threshold=1 fail-count=1 last-failure='Wed Aug 12 
 11:53:39 2015'
 * Node bl460g8n3:
 * Node pgsr01@bl460g8n3:
 
 Failed Actions:
 * pgsr02_monitor_3 on bl460g8n4 'unknown error' (1): call=2, 
 status=Error, exitreason='none',
 last-rc-change='Wed Aug 12 11:53:39 2015', queued=0ms, exec=0ms
 
 
 It seems to be caused by the fact that STONITH is not carried out somehow or 
 other.
 The demote operation that a cluster cannot handle seems to obstruct start in 
 pgsr01.
 --
 Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: Graph 10 with 20 actions: 
 batch-limit=20 jobs, network-delay=0ms
 Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action4]: Pending rsc op 
 prmDB2_stop_0   on bl460g8n4 (priority: 0, waiting:  70)
 Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   36]: Completed pseudo 
 op master-group_stop_0on N/A (priority: 0, waiting: none)
 Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   34]: Completed pseudo 
 op master-group_start_0   on N/A (priority: 0, waiting: none)
 Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   82]: Completed rsc op 
 pgsql_post_notify_demote_0on pgsr01 (priority: 100, waiting: none)
 Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   81]: Completed rsc op 
 pgsql_pre_notify_demote_0 on pgsr01 (priority: 0, waiting: none)
 Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   78]: Completed rsc op 
 pgsql_post_notify_stop_0  on pgsr01 (priority: 100, waiting: none)
 Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   77]: Completed rsc op 
 pgsql_pre_notify_stop_0   on pgsr01 (priority: 0, waiting: none)
 Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   67]: Completed pseudo 
 op msPostgresql_confirmed-post_notify_demoted_0 on N/A (priority: 100, 
 waiting: none)
 Aug 12 

Re: [ClusterLabs] [Question:pacemaker_remote] By the operation that remote node cannot carry out a cluster, the resource does not move. (STONITH is not carried out, too)

2015-08-17 Thread renayama19661014
Hi Andrew,


A correction seems to still have a problem.

It is awaiting demote, and the master-group resource cannot move.
[root@bl460g8n3 ~]# crm_mon -1 -Af
Last updated: Tue Aug 18 11:13:39 2015          Last change: Tue Aug 18 
11:11:01 2015 by root via crm_resource on bl460g8n4
Stack: corosync
Current DC: bl460g8n3 (version 1.1.13-7d0cac0) - partition with quorum
4 nodes and 10 resources configured

Online: [ bl460g8n3 bl460g8n4 ]
GuestOnline: [ pgsr02@bl460g8n4 ]

 prmDB2 (ocf::heartbeat:VirtualDomain): Started bl460g8n4
 Resource Group: grpStonith1
     prmStonith1-2      (stonith:external/ipmi):        Started bl460g8n4
 Resource Group: grpStonith2
     prmStonith2-2      (stonith:external/ipmi):        Started bl460g8n3
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ pgsr02 ]

Node Attributes:
* Node bl460g8n3:
* Node bl460g8n4:
* Node pgsr02@bl460g8n4:
    + master-pgsql                      : 10        

Migration Summary:
* Node bl460g8n3:
   pgsr01: migration-threshold=1 fail-count=1 last-failure='Tue Aug 18 11:12:03 
2015'
* Node bl460g8n4:
* Node pgsr02@bl460g8n4:

Failed Actions:
* pgsr01_monitor_3 on bl460g8n3 'unknown error' (1): call=2, status=Error, 
exitreason='none',
    last-rc-change='Tue Aug 18 11:12:03 2015', queued=0ms, exec=0ms

(snip)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Container prmDB1 and the 
resources within it have failed 1 times on bl460g8n3
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing prmDB1 away from 
bl460g8n3 after 1 failures (max=1)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsr01 has failed 1 times on 
bl460g8n3
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing pgsr01 away from 
bl460g8n3 after 1 failures (max=1)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: prmDB1: Rolling back scores 
from pgsr01Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource prmDB1 
cannot run anywhere
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsr01 cannot run 
anywhere
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsql:0: Rolling back scores 
from vip-master
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsql:0 cannot run 
anywhere
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Promoting pgsql:1 (Master 
pgsr02)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: msPostgresql: Promoted 1 
instances of a possible 1 to master
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-master_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (10s) 
for vip-master on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-rep_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (10s) 
for vip-rep on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (9s) 
for pgsql:1 on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (9s) 
for pgsql:1 on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Impliying node pgsr01 is down 
when container prmDB1 is stopped ((nil))
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmDB1  (Stopped)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmDB2  (Started 
bl460g8n4)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmStonith1-2   
(Started bl460g8n4)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmStonith2-2   
(Started bl460g8n3)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop    vip-master    
(Started pgsr01 - blocked)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop    vip-rep       
(Started pgsr01 - blocked)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Demote  pgsql:0       (Master 
- Stopped pgsr01 - blocked)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   pgsql:1 (Master pgsr02)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   pgsr01  (Stopped)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   pgsr02  (Started 
bl460g8n4)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: crit: Cannot shut down node 'pgsr01' 
because of pgsql:0: blocked failed
Aug 

Re: [ClusterLabs] [Question:pacemaker_remote] About limitation of the placement of the resource to remote node.

2015-08-17 Thread Andrew Beekhof

 On 13 Aug 2015, at 10:23 am, renayama19661...@ybb.ne.jp wrote:
 
 Hi All,
 
 We confirmed movement of 
 pacemaker_remote.(version:pacemaker-ad1f397a8228a63949f86c96597da5cecc3ed977)
 
 It is the following cluster constitution.
  * sl7-01(KVM host)
  * snmp1(Guest on the sl7-01 host)
  * snmp2(Guest on the sl7-01 host)
 
 We prepared for the next CLI file to confirm the resource placement to remote 
 node.
 
 --
 property no-quorum-policy=ignore \
   stonith-enabled=false \
   startup-fencing=false
 
 rsc_defaults resource-stickiness=INFINITY \
   migration-threshold=1
 
 primitive remote-vm2 ocf:pacemaker:remote \
   params server=snmp1 \
   op monitor interval=3 timeout=15
 
 primitive remote-vm3 ocf:pacemaker:remote \
   params server=snmp2 \
   op monitor interval=3 timeout=15
 
 primitive dummy-remote-A Dummy \
   op start interval=0s timeout=60s \
   op monitor interval=30s timeout=60s \
   op stop interval=0s timeout=60s
 
 primitive dummy-remote-B Dummy \
   op start interval=0s timeout=60s \
   op monitor interval=30s timeout=60s \
   op stop interval=0s timeout=60s
 
 location loc1 dummy-remote-A \
   rule 200: #uname eq remote-vm3 \
   rule 100: #uname eq remote-vm2 \
   rule -inf: #uname eq sl7-01
 location loc2 dummy-remote-B \
   rule 200: #uname eq remote-vm3 \
   rule 100: #uname eq remote-vm2 \
   rule -inf: #uname eq sl7-01
 --
 
 Case 1) The resource is placed as follows when I spend the CLI file which we 
 prepared for.
  However, the placement of the dummy-remote resource does not meet a 
 condition.
  dummy-remote-A starts in remote-vm2.
 
 [root@sl7-01 ~]# crm_mon -1 -Af
 Last updated: Thu Aug 13 08:49:09 2015  Last change: Thu Aug 13 
 08:41:14 2015 by root via cibadmin on sl7-01
 Stack: corosync
 Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum
 3 nodes and 4 resources configured
 
 Online: [ sl7-01 ]
 RemoteOnline: [ remote-vm2 remote-vm3 ]
 
  dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm2
  dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3
  remote-vm2 (ocf::pacemaker:remote):Started sl7-01
  remote-vm3 (ocf::pacemaker:remote):Started sl7-01

It is possible that there was a time when only remote-vm2 was available (so we 
put dummy-remote-A there) and then before we could start dummy-remote-B there 
too, remote-vm3 showed up but due to resource-stickiness=“INFINITY”, we didn’t 
move dummy-remote-A.

 
 (snip)
 
 Case 2) When we change CLI file of it and spend it,

You lost me here :-)
Can you rephrase please?

 the resource is placed as follows.
  The resource is placed definitely.
  dummy-remote-A starts in remote-vm3.
  dummy-remote-B starts in remote-vm3.
 
 
 (snip)
 location loc1 dummy-remote-A \
   rule 200: #uname eq remote-vm3 \
   rule 100: #uname eq remote-vm2 \
   rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \
   rule -inf: #uname eq sl7-01
 location loc2 dummy-remote-B \
   rule 200: #uname eq remote-vm3 \
   rule 100: #uname eq remote-vm2 \
   rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \
   rule -inf: #uname eq sl7-01
 (snip)
 
 
 [root@sl7-01 ~]# crm_mon -1 -Af
 Last updated: Thu Aug 13 08:55:28 2015  Last change: Thu Aug 13 
 08:55:22 2015 by root via cibadmin on sl7-01
 Stack: corosync
 Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum
 3 nodes and 4 resources configured
 
 Online: [ sl7-01 ]
 RemoteOnline: [ remote-vm2 remote-vm3 ]
 
  dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm3
  dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3
  remote-vm2 (ocf::pacemaker:remote):Started sl7-01
  remote-vm3 (ocf::pacemaker:remote):Started sl7-01
 
 (snip)
 
 As for the placement of the resource being wrong with the first CLI file, the 
 placement limitation of the remote node is like remote resource not being 
 evaluated until it is done start.
 
 The placement becomes right with the CLI file which I revised, but the 
 description of this limitation is very troublesome when I compose a cluster 
 of more nodes.
 
 Does remote node not need processing delaying placement limitation until it 
 is done start?

Potentially.  I’d need a crm_report to confirm though.

 
 Is there a method to easily describe the limitation of the resource to remote 
 node?
 
  * As one means, we know that the placement of the resource goes well by 
 dividing the first CLI file into two.
* After a cluster sent CLI which remote node starts, I send CLI where a 
 cluster starts a resource.
  * However, we do not want to divide CLI file into two if possible.
 
 Best Regards,
 Hideo Yamauchi.
 
 
 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 

Re: [ClusterLabs] MySQL resource causes error 0_monitor_20000.

2015-08-17 Thread Andrei Borzenkov


Отправлено с iPhone

 18 авг. 2015 г., в 7:19, Kiwamu Okabe kiw...@gmail.com написал(а):
 
 Hi all,
 
 I made master-master replication on Pacemaker.
 But it causes error 0_monitor_2.

It's not an error, it is just operation name.


 If one of them boots Heartbeat and another doesn't, the error doesn't occur.
 
 What should I check?
 

Probably you have to allow more than one master (default is just one); see 
description of master-max resource option.

 Thank's,
 
 Host: centillion.db01 and centillion.db02
 OS: CentOS 6.3
 Heartbeat: 3.0.5
 Pacemaker: 1.0.13
 MySQL: 5.6.16
 
 Error messages:
 
 ```
 # crm_mon
 
 Last updated: Fri Aug 14 17:28:58 2015
 Stack: Heartbeat
 Current DC: centillion.db02 (0302e3d0-df06-4847-b0f9-9ebddfb6aec7) -
 partition with quorum
 Version: 1.0.13-a83fae5
 2 Nodes configured, unknown expected votes
 2 Resources configured.
 
 
 Online: [ centillion.db01 centillion.db02 ]
 
 vip_192.168.10.200  (ocf::heartbeat:IPaddr2):   Started 
 centillion.db02
 Master/Slave Set: mysql-clone
 mysql:0(ocf::heartbeat:mysql): Master centillion.db01 FAILED
 Masters: [ centillion.db02 ]
 
 Failed actions:
mysql:0_monitor_2 (node=centillion.db01, call=166, rc=8,
 status=complete): master
mysql:0_monitor_3 (node=centillion.db01, call=167, rc=8,
 status=complete): master
 ```
 -- 
 Kiwamu Okabe at METASEPI DESIGN
 
 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Single quotes in values for 'crm resource rsc param set'

2015-08-17 Thread Kristoffer Grönlund
Vladislav Bogdanov bub...@hoster-ok.com writes:

 Hi Kristoffer, all.

 Could you please look why I get error when trying to update valid 
 resource value (which already has single quotes inside) with the 
 slightly different one by running the command in the subject?

 It looks like is_value_sane() doesn't accept single quotes just because 
 crmsh quotes all arguments to crm_resource with them. I need to pass a 
 command-line with semicolons in one of parameters which is run with eval 
 in the resource agent. Backslashed double-quoting does not work in this 
 case, but single-quotes work fine.

 Could that be some-how fixed?

Well, first of all passing the command line through bash complicates
things, so if that's what is causing you trouble you could try writing
your command line to a file and passing it to crmsh using crm -f file.
Another option is using crm -f - and piping the command line into
crmsh.

If that doesn't help, it would help /me/ in figuring out just what the
problem is if you could give me an example of what the current value is
and what it is you are trying to set it to. :)

Thanks!
Kristoffer



 Best,
 Vladislav


-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] upgrade from 1.1.9 to 1.1.12 fails to start

2015-08-17 Thread Streeter, Michelle N
I was recommended to upgrade from 1.1.9 to 1.1.12.
I had to uninstall the 1.1.9 version to install the 1.1.12 version
I am not allowed to connect to a repo and so I have to download the rpms and 
install them individually.
After I installed pacemaker-lib, cli, cluster-lib, and pacemaker itself, when I 
rebooted, the cluster failed to start
When I tried to manually start it, I got
Starting Pacemaker Cluster Manager/etc/init.d/pacemaker: line 94:  8219 
Segmentation fault  (core dumped) $prog  /dev/null 21
I deleted the Cluster.conf file and the cib.xml and all the back up versions 
and tried again and got the same error.
I googled this error and really got nothing.   Any ideas?

Michelle Streeter
ASC2 MCS - SDE/ACL/SDL/EDL OKC Software Engineer
The Boeing Company

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] upgrade from 1.1.9 to 1.1.12 fails to start

2015-08-17 Thread Digimer
On 17/08/15 05:13 PM, Streeter, Michelle N wrote:
 I was recommended to upgrade from 1.1.9 to 1.1.12. 
 
 I had to uninstall the 1.1.9 version to install the 1.1.12 version
 
 I am not allowed to connect to a repo and so I have to download the rpms
 and install them individually.
 
 After I installed pacemaker-lib, cli, cluster-lib, and pacemaker itself,
 when I rebooted, the cluster failed to start
 
 When I tried to manually start it, I got
 
 Starting Pacemaker Cluster Manager/etc/init.d/pacemaker: line 94:  8219
 Segmentation fault  (core dumped) $prog  /dev/null 21
 
 I deleted the Cluster.conf file and the cib.xml and all the back up
 versions and tried again and got the same error.
 
 I googled this error and really got nothing.   Any ideas?

As a test, can you create a fresh, new cluster?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Memory leak in crm_mon ?

2015-08-17 Thread Andrew Beekhof

 On 17 Aug 2015, at 4:35 pm, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
 wrote:
 
 Andrew Beekhof and...@beekhof.net schrieb am 17.08.2015 um 00:08 in
 Nachricht
 ff78be4f-173c-4a74-a989-92ea6c540...@beekhof.net:
 
 On 16 Aug 2015, at 9:41 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 Hi Andrew,
 
 I managed to isolate / reproduce the issue. You might want to take a look,
 
 as it might be present in 1.1.12 as well.
 
 I monitor my cluster from putty, mainly this way:
 - I have a putty (Windows client) session, that connects via SSH to the
 box, 
 authenticates using public key as a non-root user.
 - It immediately sends a sudo crm_mon -Af command, so with a single click
 
 I have a nice view of what the cluster is doing.
 
 Perhaps add -1 to the option list.
 The root cause seems to be that closing the putty window doesn’t actually
 
 kill the process running inside it.
 
 Sorry, the root cause seems to be that cm_mon happily writes to a closed
 filehandle (I guess). If crm_mon would handle that error by exiting the loop,
 ther would be no need for putty  to kill any process.

No, if you want a process to die you need to kill it.

 
 
 
 Whenever I close this putty window (terminate the app), crm_mon process
 gets 
 to 100% cpu usage, starts to leak, in a few hours consumes all memory and 
 then destroys the whole cluster.
 This does not happen if I leave crm_mon with Ctrl-C.
 
 I can reproduce this 100% with crm_mon 1.1.10, with the mainstream ubuntu 
 trusty packages.
 This might be related on how sudo executes crm_mon, and what it signalls to
 
 crm_mon when it gets terminated.
 
 Now I know what I need to pay attention to in order to avoid this problem,
 
 but you might want to check whether this issue is still present.
 
 
 Thanks,
 Attila 
 
 
 
 
 
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
 Sent: Friday, August 14, 2015 12:40 AM
 To: Cluster Labs - All topics related to open-source clustering welcomed 
 users@clusterlabs.org
 Subject: Re: [ClusterLabs] Memory leak in crm_mon ?
 
 
 
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net] 
 Sent: Tuesday, August 11, 2015 2:49 AM
 To: Cluster Labs - All topics related to open-source clustering welcomed 
 users@clusterlabs.org
 Subject: Re: [ClusterLabs] Memory leak in crm_mon ?
 
 
 On 10 Aug 2015, at 5:33 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 Hi!
 
 We are building a new cluster on top of pacemaker/corosync and several
 times 
 during the past days we noticed that „crm_mon -Af” used up all the 
 memory+swap and caused high CPU usage. Killing the process solves the
 issue.
 
 We are using the binary package versions available in the latest ubuntu 
 trusty, namely:
 
 crmsh 
 1.2.5+hg1034-1ubuntu4 
 
 pacemaker
 1.1.10+git20130802-1ubuntu2.3  
 pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3 
 
 corosync 2.3.3-1ubuntu1   
 
 Kernel is 3.13.0-46-generic
 
 Looking back some „atop” data, the CPU went to 100% many times during
 the 
 last couple of days, at various times, more often around midnight exaclty 
 (strange).
 
 08.05 14:00
 08.06 21:41
 08.07 00:00
 08.07 00:00
 08.08 00:00
 08.09 06:27
 
 Checked the corosync log and syslog, but did not find any correlation 
 between the entries int he logs around the specific times.
 For most of the time, the node running the crm_mon was the DC as well –
 not 
 running any resources (e.g. a pairless node for quorum).
 
 
 We have another running system, where everything works perfecly, whereas
 it 
 is almost the same:
 
 crmsh 
 1.2.5+hg1034-1ubuntu4 
 
 pacemaker
 1.1.10+git20130802-1ubuntu2.1 
 pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 
 corosync 2.3.3-1ubuntu1  
 
 Kernel is 3.13.0-8-generic
 
 
 Is this perhaps a known issue?
 
 Possibly, that version is over 2 years old.
 
 Any hints?
 
 Getting something a little more recent would be the best place to start
 
 Thanks Andew,
 
 I tried to upgrade to 1.1.12 using the packages availabe at 
 https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a
 
 single node, to see how it works out but I ended up with errors like
 
 Could not establish cib_rw connection: Connection refused (111)
 
 I have disabled the firewall, no changes. The node appears to be running
 but 
 does not see any of the other nodes. On the other nodes I see this node as
 an 
 UNCLEAN one. (I assume corosync is fine, but pacemaker not)
 I use udpu for the transport.
 
 Am I doing something wrong? I tried to look for some howtos on upgrade, 

[ClusterLabs] Antw: nfsServer Filesystem Failover average 76s

2015-08-17 Thread Ulrich Windl
 Streeter, Michelle N michelle.n.stree...@boeing.com schrieb am 
 14.08.2015
um 19:17 in Nachricht
9a18847a77a9a14da7e0fd240efcafc2502...@xch-phx-501.sw.nos.boeing.com:
 I am getting an average failover for nfs of 76s.   I have set all the start 
 and stop settings to 10s but no change. The Web page is instant but not nfs.

Did you try options -o and -t for crm_mon? I get some timeing values then:

e.g.:
+ (70) start: last-rc-change='Thu Jul  9 16:55:35 2015' last-run='Thu Jul  
9 16:55:35 2015' exec-time=5572ms queue-time=0ms rc=0 (ok)
+ (129) monitor: interval=30ms last-rc-change='Fri Jul 10 12:55:29 
2015' exec-time=16ms queue-time=0ms rc=0 (ok)

The other thing is to watch syslog for timing of events.

 
 I am running two node cluster on rhel6 with pacemaker 1.1.9
 
 Surely these times are not right?  Any suggestions?
 
 Resources:
 Group: nfsgroup
   Resource: nfsshare (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/sdb1 directory=/data fstype=ext4
Operations: start interval=0s (nfsshare-start-interval-0s)
stop interval=0s (nfsshare-stop-interval-0s)
monitor interval=10s (nfsshare-monitor-interval-10s)
   Resource: nfsServer (class=ocf provider=heartbeat type=nfsserver)
Attributes: nfs_shared_infodir=/data/nfsinfo nfs_no_notify=true
Operations: start interval=0s timeout=10s (nfsServer-start-timeout-10s)
stop interval=0s timeout=10s (nfsServer-stop-timeout-10s)
monitor interval=10 timeout=20s (nfsServer-monitor-interval-10)
   Resource: NAS (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=192.168.56.110 cidr_netmask=24
Operations: start interval=0s timeout=20s (NAS-start-timeout-20s)
stop interval=0s timeout=20s (NAS-stop-timeout-20s)
monitor interval=10s timeout=20s (NAS-monitor-interval-10s)
 
 Michelle Streeter
 ASC2 MCS - SDE/ACL/SDL/EDL OKC Software Engineer
 The Boeing Company





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org