[ClusterLabs] Antw: Re: Memory leak in crm_mon ?

2015-08-16 Thread Ulrich Windl
>>> Andrew Beekhof  schrieb am 17.08.2015 um 00:08 in
Nachricht
:

>> On 16 Aug 2015, at 9:41 pm, Attila Megyeri 
wrote:
>> 
>> Hi Andrew,
>> 
>> I managed to isolate / reproduce the issue. You might want to take a look,

> as it might be present in 1.1.12 as well.
>> 
>> I monitor my cluster from putty, mainly this way:
>> - I have a putty (Windows client) session, that connects via SSH to the
box, 
> authenticates using public key as a non-root user.
>> - It immediately sends a "sudo crm_mon -Af" command, so with a single click

> I have a nice view of what the cluster is doing.
> 
> Perhaps add -1 to the option list.
> The root cause seems to be that closing the putty window doesn’t actually

> kill the process running inside it.

Sorry, the root cause seems to be that cm_mon happily writes to a closed
filehandle (I guess). If crm_mon would handle that error by exiting the loop,
ther would be no need for putty  to kill any process.

> 
>> 
>> Whenever I close this putty window (terminate the app), crm_mon process
gets 
> to 100% cpu usage, starts to leak, in a few hours consumes all memory and 
> then destroys the whole cluster.
>> This does not happen if I leave crm_mon with Ctrl-C.
>> 
>> I can reproduce this 100% with crm_mon 1.1.10, with the mainstream ubuntu 
> trusty packages.
>> This might be related on how sudo executes crm_mon, and what it signalls to

> crm_mon when it gets terminated.
>> 
>> Now I know what I need to pay attention to in order to avoid this problem,

> but you might want to check whether this issue is still present.
>> 
>> 
>> Thanks,
>> Attila 
>> 
>> 
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
>> Sent: Friday, August 14, 2015 12:40 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
>> Subject: Re: [ClusterLabs] Memory leak in crm_mon ?
>> 
>> 
>> 
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net] 
>> Sent: Tuesday, August 11, 2015 2:49 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
>> Subject: Re: [ClusterLabs] Memory leak in crm_mon ?
>> 
>> 
>>> On 10 Aug 2015, at 5:33 pm, Attila Megyeri 
wrote:
>>> 
>>> Hi!
>>> 
>>> We are building a new cluster on top of pacemaker/corosync and several
times 
> during the past days we noticed that „crm_mon -Af” used up all the 
> memory+swap and caused high CPU usage. Killing the process solves the
issue.
>>> 
>>> We are using the binary package versions available in the latest ubuntu 
> trusty, namely:
>>> 
>>> crmsh 
1.2.5+hg1034-1ubuntu4 
> 
>>> pacemaker
> 1.1.10+git20130802-1ubuntu2.3  
>>> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3 

>>> corosync 2.3.3-1ubuntu1   
>>> 
>>> Kernel is 3.13.0-46-generic
>>> 
>>> Looking back some „atop” data, the CPU went to 100% many times during
the 
> last couple of days, at various times, more often around midnight exaclty 
> (strange).
>>> 
>>> 08.05 14:00
>>> 08.06 21:41
>>> 08.07 00:00
>>> 08.07 00:00
>>> 08.08 00:00
>>> 08.09 06:27
>>> 
>>> Checked the corosync log and syslog, but did not find any correlation 
> between the entries int he logs around the specific times.
>>> For most of the time, the node running the crm_mon was the DC as well –
not 
> running any resources (e.g. a pairless node for quorum).
>>> 
>>> 
>>> We have another running system, where everything works perfecly, whereas
it 
> is almost the same:
>>> 
>>> crmsh 
1.2.5+hg1034-1ubuntu4 
>  
>>> pacemaker
> 1.1.10+git20130802-1ubuntu2.1 
>>> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 
>>> corosync 2.3.3-1ubuntu1  
>>> 
>>> Kernel is 3.13.0-8-generic
>>> 
>>> 
>>> Is this perhaps a known issue?
>> 
>> Possibly, that version is over 2 years old.
>> 
>>> Any hints?
>> 
>> Getting something a little more recent would be the best place to start
>> 
>> Thanks Andew,
>> 
>> I tried to upgrade to 1.1.12 using the packages availabe at 
> https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a

> single node, to see how it works out but I ended up with errors like
>> 
>> Could not establish cib_rw connection: Connection refused (111)
>> 
>> I have disabled the firewall, no changes. The node appears to be running
but 
> does not see any of the other nodes. On the other nodes I see this node as
an 
> UNCLEAN one. (I assume corosync is fine, but pacemaker not)
>> I use udpu for the transport.
>> 
>> Am I doing something wrong? I tried to look for some howtos on upgrade, but

> the on

[ClusterLabs] Antw: nfsServer Filesystem Failover average 76s

2015-08-16 Thread Ulrich Windl
>>> "Streeter, Michelle N"  schrieb am 
>>> 14.08.2015
um 19:17 in Nachricht
<9a18847a77a9a14da7e0fd240efcafc2502...@xch-phx-501.sw.nos.boeing.com>:
> I am getting an average failover for nfs of 76s.   I have set all the start 
> and stop settings to 10s but no change. The Web page is instant but not nfs.

Did you try options -o and -t for crm_mon? I get some timeing values then:

e.g.:
+ (70) start: last-rc-change='Thu Jul  9 16:55:35 2015' last-run='Thu Jul  
9 16:55:35 2015' exec-time=5572ms queue-time=0ms rc=0 (ok)
+ (129) monitor: interval=30ms last-rc-change='Fri Jul 10 12:55:29 
2015' exec-time=16ms queue-time=0ms rc=0 (ok)

The other thing is to watch syslog for timing of events.

> 
> I am running two node cluster on rhel6 with pacemaker 1.1.9
> 
> Surely these times are not right?  Any suggestions?
> 
> Resources:
> Group: nfsgroup
>   Resource: nfsshare (class=ocf provider=heartbeat type=Filesystem)
>Attributes: device=/dev/sdb1 directory=/data fstype=ext4
>Operations: start interval=0s (nfsshare-start-interval-0s)
>stop interval=0s (nfsshare-stop-interval-0s)
>monitor interval=10s (nfsshare-monitor-interval-10s)
>   Resource: nfsServer (class=ocf provider=heartbeat type=nfsserver)
>Attributes: nfs_shared_infodir=/data/nfsinfo nfs_no_notify=true
>Operations: start interval=0s timeout=10s (nfsServer-start-timeout-10s)
>stop interval=0s timeout=10s (nfsServer-stop-timeout-10s)
>monitor interval=10 timeout=20s (nfsServer-monitor-interval-10)
>   Resource: NAS (class=ocf provider=heartbeat type=IPaddr2)
>Attributes: ip=192.168.56.110 cidr_netmask=24
>Operations: start interval=0s timeout=20s (NAS-start-timeout-20s)
>stop interval=0s timeout=20s (NAS-stop-timeout-20s)
>monitor interval=10s timeout=20s (NAS-monitor-interval-10s)
> 
> Michelle Streeter
> ASC2 MCS - SDE/ACL/SDL/EDL OKC Software Engineer
> The Boeing Company





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Ordering constraint restart second resource group

2015-08-16 Thread Andrei Borzenkov

17.08.2015 02:26, Andrew Beekhof пишет:



On 13 Aug 2015, at 7:33 pm, Andrei Borzenkov  wrote:

On Thu, Aug 13, 2015 at 11:25 AM, Ulrich Windl
 wrote:

And what exactly is your problem?


Real life example. Database resource depends on storage resource(s).
There are multiple filesystems/volumes with database files. Database
admin needs to increase available space. You add new storage,
configure it in cluster ... pooh, your database is restarted.


“configure it in cluster” hmmm

if you’re expanding an existing mount point, then I’d expect you don’t need to 
update the cluster.
if you’re creating a new mount point, wouldn’t you need to take the db down in 
order to point to the new location?



No. Those database I worked with can use multiple storage locations at 
the same time and those storage locations can be added (and removed) online.




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Problem] The SNMP trap which has been already started is transmitted.

2015-08-16 Thread renayama19661014
Hi Andrew,

Thank you for comments.


I will confirm it tomorrow.
I am a vacation today.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: Andrew Beekhof 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2015/8/17, Mon 09:30
> Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already 
> started is transmitted.
> 
> 
>>  On 4 Aug 2015, at 7:36 pm, renayama19661...@ybb.ne.jp wrote:
>> 
>>  Hi Andrew,
>> 
>>  Thank you for comments.
>> 
  However, a trap of crm_mon is sent to an SNMP manager.
>>>   
>>>  Are you using the built-in SNMP logic or using -E to give crm_mon a 
> script which 
>>>  is then producing the trap?
>>>  (I’m trying to figure out who could be turning the monitor action into 
> a start)
>> 
>> 
>>  I used the built-in SNMP.
>>  I started as a daemon with -d option.
> 
> Is it running on both nodes or just snmp1?
> Because there is no logic in crm_mon that would have remapped the monitor to 
> start, so my working theory is that its a duplicate of an old event.
> Can you tell which node the trap is being sent from?
> 
>> 
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>>  From: Andrew Beekhof 
>>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
>>>  Cc: 
>>>  Date: 2015/8/4, Tue 14:15
>>>  Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been 
> already started is transmitted.
>>> 
>>> 
  On 27 Jul 2015, at 4:18 pm, renayama19661...@ybb.ne.jp wrote:
 
  Hi All,
 
  The transmission of the SNMP trap of crm_mon seems to have a 
> problem.
  I identified a problem on latest Pacemaker and Pacemaker1.1.13.
 
 
  Step 1) I constitute a cluster and send simple CLI file.
 
  [root@snmp1 ~]# crm_mon -1 
  Last updated: Mon Jul 27 14:40:37 2015          Last change: Mon 
> Jul 27 
>>>  14:40:29 2015 by root via cibadmin on snmp1
  Stack: corosync
  Current DC: snmp1 (version 1.1.13-3d781d3) - partition with quorum
  2 nodes and 1 resource configured
 
  Online: [ snmp1 snmp2 ]
 
    prmDummy       (ocf::heartbeat:Dummy): Started snmp1
 
  Step 2) I stop a node of the standby once.
 
  [root@snmp2 ~]# stop pacemaker
  pacemaker stop/waiting
 
 
  Step 3) I start a node of the standby again.
  [root@snmp2 ~]# start pacemaker
  pacemaker start/running, process 2284
 
  Step 4) The indication of crm_mon does not change in particular.
  [root@snmp1 ~]# crm_mon -1
  Last updated: Mon Jul 27 14:45:12 2015          Last change: Mon 
> Jul 27 
>>>  14:40:29 2015 by root via cibadmin on snmp1
  Stack: corosync
  Current DC: snmp1 (version 1.1.13-3d781d3) - partition with quorum
  2 nodes and 1 resource configured
 
  Online: [ snmp1 snmp2 ]
 
    prmDummy       (ocf::heartbeat:Dummy): Started snmp1
 
 
  In addition, as for the resource that started in snmp1 node, 
> nothing 
>>>  changes.
 
  ---
  Jul 27 14:41:39 snmp1 crmd[29116]:   notice: State transition 
> S_IDLE -> 
>>>  S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
>>>  origin=abort_transition_graph ]
  Jul 27 14:41:39 snmp1 cib[29111]:     info: Completed cib_modify 
> operation 
>>>  for section status: OK (rc=0, origin=snmp1/attrd/11, version=0.4.20)
  Jul 27 14:41:39 snmp1 attrd[29114]:     info: Update 11 for 
> probe_complete: 
>>>  OK (0)
  Jul 27 14:41:39 snmp1 attrd[29114]:     info: Update 11 for 
>>>  probe_complete[snmp1]=true: OK (0)
  Jul 27 14:41:39 snmp1 attrd[29114]:     info: Update 11 for 
>>>  probe_complete[snmp2]=true: OK (0)
  Jul 27 14:41:39 snmp1 cib[29202]:     info: Wrote version 0.4.0 of 
> the CIB 
>>>  to disk (digest: a1f1920279fe0b1466a79cab09fa77d6)
  Jul 27 14:41:39 snmp1 pengine[29115]:   notice: On loss of CCM 
> Quorum: 
>>>  Ignore
  Jul 27 14:41:39 snmp1 pengine[29115]:     info: Node snmp2 is 
> online
  Jul 27 14:41:39 snmp1 pengine[29115]:     info: Node snmp1 is 
> online
  Jul 27 14:41:39 snmp1 pengine[29115]:     info: 
>>>  prmDummy#011(ocf::heartbeat:Dummy):#011Started snmp1
  Jul 27 14:41:39 snmp1 pengine[29115]:     info: Leave  
>>>  prmDummy#011(Started snmp1)
  ---
 
  However, a trap of crm_mon is sent to an SNMP manager.
>>> 
>>>  Are you using the built-in SNMP logic or using -E to give crm_mon a 
> script which 
>>>  is then producing the trap?
>>>  (I’m trying to figure out who could be turning the monitor action into 
> a start)
>>> 
  The resource does not reboot, but the SNMP trap which a resource 
> started is 
>>>  sent.
 
  ---
  Jul 27 14:41:39 SNMP-MANAGER snmptrapd[4521]: 2015-07-27 14:41:39 
> snmp1 
>>>  [UDP: 
>>> 
> [192.168.40.100]:35265->[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance
>  

Re: [ClusterLabs] [Problem] The SNMP trap which has been already started is transmitted.

2015-08-16 Thread Andrew Beekhof

> On 4 Aug 2015, at 7:36 pm, renayama19661...@ybb.ne.jp wrote:
> 
> Hi Andrew,
> 
> Thank you for comments.
> 
>>> However, a trap of crm_mon is sent to an SNMP manager.
>>  
>> Are you using the built-in SNMP logic or using -E to give crm_mon a script 
>> which 
>> is then producing the trap?
>> (I’m trying to figure out who could be turning the monitor action into a 
>> start)
> 
> 
> I used the built-in SNMP.
> I started as a daemon with -d option.

Is it running on both nodes or just snmp1?
Because there is no logic in crm_mon that would have remapped the monitor to 
start, so my working theory is that its a duplicate of an old event.
Can you tell which node the trap is being sent from?

> 
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>> From: Andrew Beekhof 
>> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
>> open-source clustering welcomed 
>> Cc: 
>> Date: 2015/8/4, Tue 14:15
>> Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already 
>> started is transmitted.
>> 
>> 
>>> On 27 Jul 2015, at 4:18 pm, renayama19661...@ybb.ne.jp wrote:
>>> 
>>> Hi All,
>>> 
>>> The transmission of the SNMP trap of crm_mon seems to have a problem.
>>> I identified a problem on latest Pacemaker and Pacemaker1.1.13.
>>> 
>>> 
>>> Step 1) I constitute a cluster and send simple CLI file.
>>> 
>>> [root@snmp1 ~]# crm_mon -1 
>>> Last updated: Mon Jul 27 14:40:37 2015  Last change: Mon Jul 27 
>> 14:40:29 2015 by root via cibadmin on snmp1
>>> Stack: corosync
>>> Current DC: snmp1 (version 1.1.13-3d781d3) - partition with quorum
>>> 2 nodes and 1 resource configured
>>> 
>>> Online: [ snmp1 snmp2 ]
>>> 
>>>   prmDummy   (ocf::heartbeat:Dummy): Started snmp1
>>> 
>>> Step 2) I stop a node of the standby once.
>>> 
>>> [root@snmp2 ~]# stop pacemaker
>>> pacemaker stop/waiting
>>> 
>>> 
>>> Step 3) I start a node of the standby again.
>>> [root@snmp2 ~]# start pacemaker
>>> pacemaker start/running, process 2284
>>> 
>>> Step 4) The indication of crm_mon does not change in particular.
>>> [root@snmp1 ~]# crm_mon -1
>>> Last updated: Mon Jul 27 14:45:12 2015  Last change: Mon Jul 27 
>> 14:40:29 2015 by root via cibadmin on snmp1
>>> Stack: corosync
>>> Current DC: snmp1 (version 1.1.13-3d781d3) - partition with quorum
>>> 2 nodes and 1 resource configured
>>> 
>>> Online: [ snmp1 snmp2 ]
>>> 
>>>   prmDummy   (ocf::heartbeat:Dummy): Started snmp1
>>> 
>>> 
>>> In addition, as for the resource that started in snmp1 node, nothing 
>> changes.
>>> 
>>> ---
>>> Jul 27 14:41:39 snmp1 crmd[29116]:   notice: State transition S_IDLE -> 
>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
>> origin=abort_transition_graph ]
>>> Jul 27 14:41:39 snmp1 cib[29111]: info: Completed cib_modify operation 
>> for section status: OK (rc=0, origin=snmp1/attrd/11, version=0.4.20)
>>> Jul 27 14:41:39 snmp1 attrd[29114]: info: Update 11 for probe_complete: 
>> OK (0)
>>> Jul 27 14:41:39 snmp1 attrd[29114]: info: Update 11 for 
>> probe_complete[snmp1]=true: OK (0)
>>> Jul 27 14:41:39 snmp1 attrd[29114]: info: Update 11 for 
>> probe_complete[snmp2]=true: OK (0)
>>> Jul 27 14:41:39 snmp1 cib[29202]: info: Wrote version 0.4.0 of the CIB 
>> to disk (digest: a1f1920279fe0b1466a79cab09fa77d6)
>>> Jul 27 14:41:39 snmp1 pengine[29115]:   notice: On loss of CCM Quorum: 
>> Ignore
>>> Jul 27 14:41:39 snmp1 pengine[29115]: info: Node snmp2 is online
>>> Jul 27 14:41:39 snmp1 pengine[29115]: info: Node snmp1 is online
>>> Jul 27 14:41:39 snmp1 pengine[29115]: info: 
>> prmDummy#011(ocf::heartbeat:Dummy):#011Started snmp1
>>> Jul 27 14:41:39 snmp1 pengine[29115]: info: Leave  
>> prmDummy#011(Started snmp1)
>>> ---
>>> 
>>> However, a trap of crm_mon is sent to an SNMP manager.
>> 
>> Are you using the built-in SNMP logic or using -E to give crm_mon a script 
>> which 
>> is then producing the trap?
>> (I’m trying to figure out who could be turning the monitor action into a 
>> start)
>> 
>>> The resource does not reboot, but the SNMP trap which a resource started is 
>> sent.
>>> 
>>> ---
>>> Jul 27 14:41:39 SNMP-MANAGER snmptrapd[4521]: 2015-07-27 14:41:39 snmp1 
>> [UDP: 
>> [192.168.40.100]:35265->[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance
>>  
>> = Timeticks: (1437975699) 166 days, 10:22:36.99#011SNMPv2-MIB::snmpTrapOID.0 
>> = 
>> OID: 
>> PACEMAKER-MIB::pacemakerNotification#011PACEMAKER-MIB::pacemakerNotificationResource
>>  
>> = STRING: "prmDummy"#011PACEMAKER-MIB::pacemakerNotificationNode = 
>> STRING: "snmp1"#011PACEMAKER-MIB::pacemakerNotificationOperation = 
>> STRING: "start"#011PACEMAKER-MIB::pacemakerNotificationDescription = 
>> STRING: "OK"#011PACEMAKER-MIB::pacemakerNotificationReturnCode = 
>> INTEGER: 0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = 
>> INTEGER: 
>> 0#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 0
>>> Jul 27 14:41:39 SNMP-MANAGER snmptrapd

Re: [ClusterLabs] Phantom Node

2015-08-16 Thread Andrew Beekhof

> On 14 Aug 2015, at 7:53 am, Allan Brand  wrote:
> 
> I can't seem to track this down and am hoping someone has seen this or can 
> tell me what's happening.

Try this:

- shut down the cluster
- remove the stray node entry from the cib (/var/lib/pacemaker/cib/cib.xml)
- delete the .sig file (/var/lib/pacemaker/cib/cib.xml.sig)
- clear the logs
- start the cluster

if you see the node come back, send us the logs and we should be able to 
determine where its coming from :)

possibility… does uname -n return node01 or node01.private ? same for node02?

> 
> I have a 2 node test cluster, node01.private and node02.private.
> 
> [root@node01 ~]# cat /etc/hosts
> 127.0.0.1   localhost
> ::1 localhost
> 
> 192.168.168.9   node01.private
> 192.168.168.10  node02.private
> 192.168.168.14  cluster.private
> 
> The issue is when I run 'pcs status' it shows both nodes online but a 3rd 
> node, node01, to be offline:
> 
> [root@node01 ~]# pcs status
> Cluster name: cluster.private
> Last updated: Thu Aug 13 16:41:54 2015
> Last change: Wed Aug 12 18:23:22 2015
> Stack: cman
> Current DC: node01.private - partition with quorum
> Version: 1.1.11-97629de
> 3 Nodes configured
> 1 Resources configured
> 
> 
> Online: [ node01.private node02.private ]
> OFFLINE: [ node01 ]
> 
> Full list of resources:
> 
>  privateIP  (ocf::heartbeat:IPaddr2):   Started node01.private
> 
> [root@node01 ~]#
> [root@node01 ~]# pcs config
> Cluster Name: cluster.private
> Corosync Nodes:
>  node01.private node02.private
> Pacemaker Nodes:
>  node01 node01.private node02.private
> 
> Resources:
>  Resource: privateIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: ip=192.168.168.14 cidr_netmask=32
>   Operations: start interval=0s timeout=20s (privateIP-start-interval-0s)
>   stop interval=0s timeout=20s (privateIP-stop-interval-0s)
>   monitor interval=30s (privateIP-monitor-interval-30s)
> 
> Stonith Devices:
> Fencing Levels:
> 
> Location Constraints:
>   Resource: privateIP
> Enabled on: node01.private (score:INFINITY) 
> (id:location-privateIP-node01.private-INFINITY)
> Ordering Constraints:
> Colocation Constraints:
> 
> Resources Defaults:
>  No defaults set
> Operations Defaults:
>  No defaults set
> 
> Cluster Properties:
>  cluster-infrastructure: cman
>  dc-version: 1.1.11-97629de
>  expected-quorum-votes: 2
>  no-quorum-policy: ignore
>  stonith-enabled: false
> [root@node01 ~]#
> [root@node01 ~]# cat /etc/cluster/cluster.conf
> 
>   
>   
> 
>   
> 
>   
> 
>   
> 
> 
>   
> 
>   
> 
>   
> 
>   
>   
>   
> 
>   
>   
> 
> 
>   
> 
> [root@node01 ~]#
> 
> 
> Everything appears to be working correctly, just that phantom offline node 
> shows up.
> 
> Thanks,
> Allan
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: Antw: pacemaker doesn't correctly handle a resource after time/date change

2015-08-16 Thread Andrew Beekhof

> On 8 Aug 2015, at 12:43 am, Kostiantyn Ponomarenko 
>  wrote:
> 
> Hi Andrew,
> 
> So the issue is:
> 
> Having one node up and running, set time on the node backward to, say, 15 min 
> (generally more than 10 min), then do "stop" for a resource.
> That leads to the next - the cluster fails the resource once, then shows it 
> as "started", but the resource actually remains "stopped".
> 
> Do you need more input from me on the issue?

I think “why” :)

I’m struggling to imagine why this would need to happen.

> 
> Thank you,
> Kostya
> 
> On Wed, Aug 5, 2015 at 3:01 AM, Andrew Beekhof  wrote:
> 
> > On 4 Aug 2015, at 7:31 pm, Kostiantyn Ponomarenko 
> >  wrote:
> >
> >
> > On Tue, Aug 4, 2015 at 3:57 AM, Andrew Beekhof  wrote:
> > Github might be another.
> >
> > I am not able to open an issue/bug here 
> > https://github.com/ClusterLabs/pacemaker
> 
> Oh, for pacemaker bugs see http://clusterlabs.org/help.html
> Can someone clearly state what the issue is?  The thread was quite fractured 
> and hard to follow.
> 
> >
> > Thank you,
> > Kostya
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Delayed first monitoring

2015-08-16 Thread Andrew Beekhof

> On 13 Aug 2015, at 2:20 am, Ken Gaillot  wrote:
> 
> On 08/12/2015 10:45 AM, Miloš Kozák wrote:
>> Thank you for your answer, but.
>> 
>> 1) This sounds ok, but in other words it means the first delayed check
>> is not possible to be done.
>> 
>> 2) Start of init script? I follow lsb scripts from distribution, so
>> there is not way to change them (I can change them, but with packages
>> upgade they will go void). The is quite typical approach, how can I do
>> HA for atlassian for example? Jira loads 5minutes..
> 
> I think your situation involves multiple issues which are worth
> separating for clarity:
> 
> 1. As Alexander mentioned, Pacemaker will do a monitor BEFORE trying to
> start a service, to make sure it's not already running. So these don't
> need any delay and are expected to "fail".
> 
> 2. Resource agents MUST NOT return success for "start" until the service
> is fully up and running, so the next monitor should succeed, again
> without needing any delay. If that's not the case, it's a bug in the agent.

Consider the ordering constraint “start A then B”.

Regardless of whether you delay A’s monitor operation, B is going to expect A 
is up when “start A” completes.
So it should only indicate completion once its actually usable.

> 
> 3. It's generally better to use OCF resource agents whenever available,
> as they have better integration with pacemaker than lsb/systemd/upstart.
> In this case, take a look at ocf:heartbeat:apache.
> 
> 4. You can configure the timeout used with each action (stop, start,
> monitor, restart) on a given resource. The default is 20 seconds. For
> example, if a "start" action is expected to take 5 minutes, you would
> define a start operation on the resource with timeout=300s. How you do
> that depends on your management tool (pcs, crmsh, or cibadmin).
> 
> Bottom line, you should never need a delay on the monitor, instead set
> appropriate timeouts for each action, and make sure that the agent does
> not return from "start" until the service is fully up.
> 
>> Dne 12.8.2015 v 16:14 Nekrasov, Alexander napsal(a):
>>> 1. Pacemaker will/may call a monitor before starting a resource, in
>>> which case it expects a NOT_RUNNING response. It's just checking
>>> assumptions at that point.
>>> 
>>> 2. A resource::start must only return when resource::monitor is
>>> successful. Basically the logic of a start() must follow this:
>>> 
>>> start() {
>>>   start_daemon()
>>>   while ! monitor() ; do
>>>   sleep some
>>>   done
>>>   return $OCF_SUCCESS
>>> }
>>> 
 -Original Message-
 From: Miloš Kozák [mailto:milos.ko...@lejmr.com]
 Sent: Wednesday, August 12, 2015 10:03 AM
 To: users@clusterlabs.org
 Subject: [ClusterLabs] Delayed first monitoring
 
 Hi,
 
 I have set up and CoroSync+CMAN+Pacemaker at CentOS 6.5 in order to
 provide high-availability of opennebula. However, I am facing to a
 strange problem which raises from my lack of knowleadge..
 
 In the log I can see that when I create a resource based on an init
 script, typically:
 
 pcs resource create httpd lsb:httpd
 
 The httpd daemon gets started, but monitor is initiated at the same time
 and the resource is identified as not running. This behaviour makes
 sense since we realize that the daemon starting takes some time. In this
 particular case, I get error code 2 which means that process is running,
 but environment is not locked. The effect of this is that httpd resource
 gets restarted.
 
 My workaround is extra sleep in status function of the init script, but
 I dont like this solution at all! Do you have idea how to tackle this
 problem in a proper way? I expected an op attribut which would specify
 delay after service start and first monitoring, but I could not find
 it..
 
 Thank you, Milos
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLabs Developers] Resource Agent language discussion

2015-08-16 Thread Andrew Beekhof

> On 11 Aug 2015, at 5:34 pm, Jehan-Guillaume de Rorthais  
> wrote:
> 
> On Tue, 11 Aug 2015 11:30:03 +1000
> Andrew Beekhof  wrote:
> 
>> 
>>> On 8 Aug 2015, at 1:14 am, Jehan-Guillaume de Rorthais 
>>> wrote:
>>> 
>>> Hi Jan,
>>> 
>>> On Fri, 7 Aug 2015 15:36:57 +0200
>>> Jan Pokorný  wrote:
>>> 
 On 07/08/15 12:09 +0200, Jehan-Guillaume de Rorthais wrote:
> Now, I would like to discuss about the language used to write a RA in
> Pacemaker. I never seen discussion or page about this so far.
 
 it wasn't in such a "heretic :)" tone, but I tried to show that it
 is extremely hard (if not impossible in some instances) to write
 bullet-proof code in bash (or POSIX shell, for that matter) because
 it's so cumbersome to move from "whitespace-delimited words as
 a single argument" and "words as standalone arguments" back and forth,
 connected with quotation-desired/-counterproductive madness
 (what if one wants to indeed pass quotation marks as legitimate
 characters within the passed value, etc.) few months back:
 
 http://clusterlabs.org/pipermail/users/2015-May/000403.html
 (even on developers list, but with fewer replies and broken threading:
 http://clusterlabs.org/pipermail/developers/2015-May/23.html).
>>> 
>>> Thanks for the links and history. You add some more argument to my points :)
>>> 
> HINT: I don't want to discuss (neither troll about) what is the best
> language. I would like to know why **ALL** the RA are written in
> bash
 
 I would expect the original influence were the init scripts (as RAs
 are mostly just enriched variants to support more flexible
 configuration and better diagnostics back to the cluster stack),
 which in turn were born having simplicity and ease of debugging
 (maintainability) in mind.
>>> 
>>> That sounds legitimate. And bash is still appropriate for some simple RA.
>>> 
>>> But for the same ease of code debugging and maintainability arguments (and
>>> many others), complexe RA shouldn't use shell as language.
>> 
>> You can and should use whatever language you like for your own private RAs.
>> But if you want it accepted and maintained by the resource-agents project,
>> you would be advised to use the language they have standardised on.
> 
> Well, let's imagine our RA was written in bash (in fact, we have a bash 
> version
> pretty close to the current perl version we abandoned). I wonder if it would 
> be
> accepted in the resource-agents project anyway as another one already exists
> there. I can easily list the reasons we rewrote a new one, but this is not the
> subject here.
> 
> The discussion here is more about the language, if I should extract a
> ocf-perl-module from my RA and if there is any chance the resource-agents
> project would accept it.

Well, it depends on the reasons you didn’t list :-)

The first questions any maintainer is going to ask are:
- why did you write a new one?
- can we merge this with the old one?
- can the new one replace the old one? (ie. full superset)

Because if both are included, then they will forevermore be answering the 
question “which one should I use?”.

Basically, if you want it accepted upstream, then yes, you probably want to 
ditch the perl bit.
But not having seen the agent or knowing why it exists, its hard to say.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] circumstances under which resources become unmanaged

2015-08-16 Thread Andrew Beekhof

> On 13 Aug 2015, at 2:27 pm, N, Ravikiran  wrote:
> 
> Thanks for reply Andrei. What happens to the resources added with a 
> COLOCATION or an ORDER constraint with this resource (unmanaged FAILED 
> resource).. ? will the constraint be removed.. ?

the resource is considered stopped for the purposes of colocation and ordering

> 
> Also please point me to any resource to understand this in detail.
> 
> Regards
> Ravikiran
> 
> -Original Message-
> From: Andrei Borzenkov [mailto:arvidj...@gmail.com] 
> Sent: Thursday, August 13, 2015 9:33 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] circumstances under which resources become 
> unmanaged
> 
> 
> 
> On 12.08.2015 20:46, N, Ravikiran wrote:
>> Hi All,
>> 
>> I have a resource added to pacemaker called 'cmsd' whose state is getting to 
>> 'unmanaged FAILED' state.
>> 
>> Apart from manually changing the resource to unmanaged using "pcs resource 
>> unmanage cmsd" , I'm trying to understand under what all circumstances a 
>> resource can become unmanaged.. ?
>> I have not set any value for "multilple-active" field, which means by 
>> default it is set to "stop-start", and hence I believe the resource can 
>> never go to unmanaged if it finds the resource active on more than one node.
>> 
> 
> unmanaged FAILED means pacemaker (or better resource agent) failed to stop 
> resource. At this point resource state is undefined so pacemaker won't do 
> anything with it.
> 
>> Also, it would be more helpful if anyone can point out to specific sections 
>> of the pacemaker manuals for the answer.
>> 
>> Regards,
>> Ravikiran
>> 
>> 
>> 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Ordering constraint restart second resource group

2015-08-16 Thread Andrew Beekhof

> On 13 Aug 2015, at 7:33 pm, Andrei Borzenkov  wrote:
> 
> On Thu, Aug 13, 2015 at 11:25 AM, Ulrich Windl
>  wrote:
>> And what exactly is your problem?
> 
> Real life example. Database resource depends on storage resource(s).
> There are multiple filesystems/volumes with database files. Database
> admin needs to increase available space. You add new storage,
> configure it in cluster ... pooh, your database is restarted.

“configure it in cluster” hmmm

if you’re expanding an existing mount point, then I’d expect you don’t need to 
update the cluster.
if you’re creating a new mount point, wouldn’t you need to take the db down in 
order to point to the new location?


> There is
> zero need to restart database because it does not even use new
> resource yet.
> 
> I do the above routinely with other cluster implementation without any
> visible impact.
> 
>>   If you change a resource, 
>> it will be
>> restrted, and if a resource is restarted, constraints will be followed...
>> 
>> Despite of that: If I understand your configuration correctly, it's very much
>> the same as
>> 
>> resource_group
>>  ip1
>>  ip2
>>  apache1
>> 
>> Regards,
>> Ulrich
>> 
> John Gogu  schrieb am 12.08.2015 um 18:35 in
>> Nachricht
>> :
>>> Hello,
>>> in my cluster configuration I have following situation:
>>> 
>>> resource_group_A
>>>   ip1
>>>   ip2
>>> resource_group_B
>>>   apache1
>>> 
>>> ordering constraint resource_group_A then resource_group_B symetrical=true
>>> 
>>> When I add a new resource from group_A, resources from group_B are
>>> restarted. If I remove constraint all ok but I need to keep this ordering
>>> constraint.
>>> 
>>> 
>>> John
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] systemd: xxxx.service start request repeated too quickly

2015-08-16 Thread Andrew Beekhof

> On 6 Aug 2015, at 11:59 pm, Juha Heinanen  wrote:
> 
> Ken Gaillot writes:
> 
 Also, I want to add some delay to the restart attempts so that systemd
 does not complain about too quick restarts.
>>> 
>>> This is outside of pacemaker control. "Service respawning too rapidly"
>>> means systemd itself attempts to restart it. You need to modify
>>> service definition in systemd to either disable restart on failure
>>> completely and let pacemaker manage it or at least add delay before
>>> restarts. See man systemd.service, specifically RestartSec and Restart
>>> parameters.
> 
> The service in question only has old an style init.d file inherited from
> Debian Wheezy and I don't have in it any restart definition.  Based on
> systemd.service man page, Restart value defaults to 'no'.  So I'm not
> sure if it is systemd that is automatically restarting the service too
> rapidly.


Well its systemd thats printing the message, so its involved somehow.
What does the resource definition for that resource look like in pacemaker?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Delayed first monitoring

2015-08-16 Thread Andrew Beekhof

> On 13 Aug 2015, at 5:01 pm, Miloš Kozák  wrote:
> 
> However,
> this does not make sense at all. Presumably, the pacemaker should get along 
> with lsb scripts which comes from system repository, right?

Explicitly no.
We get along only with /LSB compliant/ init scripts.  

Not all meet this criteria.  
Debian’s init scripts were some of the biggest offenders for many many years. 

A program such as Pacemaker needs (for example) sane return codes, for start to 
actually complete before returning, for starting something thats already 
started not to be an error. 
A human can gloss over these things, Pacemaker is not quite as smart enough to 
know when these kinds of errors are ok.


> 
> Therefore, there is not way how to modify lsb script because changes is lsb 
> script erase after every package update.
> 
> 
> I believe, the systematical approach is in introducing of delayed monitoring 
> or something like this into Pacemaker. I quite wonder that nobody has come 
> around this problem already?
> 
> 
> Milos
> 
> 
> 
> 
> 
> Dne 13.8.2015 v 08:44 Ulrich Windl napsal(a):
>> I think the start script has to be fixed to return success when httpd is
>> actually running.
>> 
> Miloš Kozák  schrieb am 12.08.2015 um 16:03 in
>> Nachricht
>> <55cb521a.8090...@lejmr.com>:
>>> Hi,
>>> 
>>> I have set up and CoroSync+CMAN+Pacemaker at CentOS 6.5 in order to
>>> provide high-availability of opennebula. However, I am facing to a
>>> strange problem which raises from my lack of knowleadge..
>>> 
>>> In the log I can see that when I create a resource based on an init
>>> script, typically:
>>> 
>>> pcs resource create httpd lsb:httpd
>>> 
>>> The httpd daemon gets started, but monitor is initiated at the same time
>>> and the resource is identified as not running. This behaviour makes
>>> sense since we realize that the daemon starting takes some time. In this
>>> particular case, I get error code 2 which means that process is running,
>>> but environment is not locked. The effect of this is that httpd resource
>>> gets restarted.
>>> 
>>> My workaround is extra sleep in status function of the init script, but
>>> I dont like this solution at all! Do you have idea how to tackle this
>>> problem in a proper way? I expected an op attribut which would specify
>>> delay after service start and first monitoring, but I could not find it..
>>> 
>>> Thank you, Milos
>>> 
>>> 
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonithd: stonith_choose_peer: Couldn't find anyone to fence with

2015-08-16 Thread Andrew Beekhof

> On 13 Aug 2015, at 9:39 pm, Kostiantyn Ponomarenko 
>  wrote:
> 
> Hi,
> 
> Brief description of the STONITH problem: 
> 
> I see two different behaviors with two different STONITH configurations. If 
> Pacemaker cannot find a device that can STONITH a problematic node, the node 
> remains up and running. Which is bad, because it must be STONITHed.
> As opposite to it, if Pacemaker finds a device that, it thinks, can STONITH a 
> problematic node, even if the device actually cannot,

You left out “but the devices reports that it did”.  Your fencing agent needs 
to report the truth. 

> Pacemaker goes down after STONITH returns false positive. The Pacemaker 
> shutdowns itself right after STONITH.
> Is it the expected behavior?

Yes, its a safety check:

Aug 11 16:09:53 [9009] A6-4U24-402-T   crmd: crit: 
tengine_stonith_notify:  We were alegedly just fenced by node-0 for node-0!


> Do I need to configure a two more STONITH agents for just rebooting nodes on 
> which they are running (e.g. with # reboot -f)?
> 
> 
> 
> +-
> + Set-up:
> +-
> - two node cluster (node-0 and node-1);
> - two fencing (STONITH) agents are configured (STONITH_node-0 and 
> STONITH_node-1).
> - "STONITH_node-0" runs only on "node-1" // this fencing agent can only fence 
> node-0
> - "STONITH_node-1" runs only on "node-0" // this fencing agent can only fence 
> node-1
> 
> +-
> + Environment:
> +-
> - one node - "node-0" - is up and running;
> - one STONITH agent - "STONITH_node-1" - is up and running
> 
> +-
> + Test case:
> +-
> Simulate error of stopping a resource.
> 1. start cluster
> 2. change a RA's script to return "$OCF_ERR_GENERIC" from "Stop" function.
> 3. stop the resource by "# crm resource stop "
> 
> +-
> + Actual behavior:
> +-
> 
> CASE 1:
> STONITH is configured with:
> # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw \
> params pcmk_host_list="node-1" pcmk_host_check="static-list"
> 
> After issuing a "stop" command:
> - the resource changes its state to "FAILED"
> - Pacemaker remains working
> 
> See below LOG_snippet_1 section. 
> 
> 
> CASE 2:
> STONITH is configured with:
> # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw
> 
> After issuing a "stop" command:
> - the resource changes its state to "FAILED"
> - Pacemaker stops working
> 
> See below LOG_snippet_2 section.
> 
> 
> +-
> + LOG_snippet_1:
> +-
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: handle_request:   
>   Client crmd.39210.fa40430f wants to fence (reboot) 'node-0' with device 
> '(any)'
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: 
> initiate_remote_stonith_op: Initiating remote operation reboot for 
> node-0: 18cc29db-b7e4-4994-85f1-df891f091a0d (0)
> 
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: 
> can_fence_host_with_device: STONITH_node-1 can not fence (reboot) node-0: 
> static-list
> 
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: 
> stonith_choose_peer:Couldn't find anyone to fence node-0 with 
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd: info: 
> call_remote_stonith:Total remote op timeout set to 60 for fencing of node 
> node-0 for crmd.39210.18cc29db
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd: info: 
> call_remote_stonith:None of the 1 peers have devices capable of 
> terminating node-0 for crmd.39210 (0)
> 
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:  warning: get_xpath_object: 
>   No match for //@st_delegate in /st-reply
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:error: remote_op_done:   
>   Operation reboot of node-0 by node-0 for crmd.39210@node-0.18cc29db: No 
> such device
> 
> Aug 12 16:42:47 [39210] A6-4U24-402-T   crmd:   notice: 
> tengine_stonith_callback:   Stonith operation 
> 3/23:16:0:0856a484-6b69-4280-b93f-1af9a6a542ee: No such device (-19)
> Aug 12 16:42:47 [39210] A6-4U24-402-T   crmd:   notice: 
> tengine_stonith_callback:   Stonith operation 3 for node-0 failed (No such 
> device): aborting transition.
> Aug 12 16:42:47 [39210] A6-4U24-402-T   crmd: info: 
> abort_transition_graph: Transition aborted: Stonith failed 
> (source=tengine_stonith_callback:697, 0)
> Aug 12 16:42:47 [39210] A6-4U24-402-T   crmd:   notice: 
> tengine_stonith_notify: Peer node-0 was not terminated (reboot) by node-0 
> for node-0: No such device
> 
> 
> +-
> + LOG_snippet_2:
> +-
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice: handle_request:  
> Client crmd.9009.cabd2154 wants to fence (reboot) 'node-0' with device '(any)'
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice: 
> initiate_rem

Re: [ClusterLabs] stonithd: stonith_choose_peer: Couldn't find anyone to fence with

2015-08-16 Thread Andrew Beekhof

> On 13 Aug 2015, at 10:36 pm, Kostiantyn Ponomarenko 
>  wrote:
> 
> > Then make sure it can be stonithd. Add additional stonith agent using
> > independent communication channel.
> 
> Not possible. Only one node up and running in the cluster and I am wondering 
> - can it STONITH itself?

Recent versions allow this depending on what the configured fencing devices 
report.

> Because most likely, after reboot, the problem can be gone.
> 
> > I have no idea what fence_sbb_hw is or does
> 
> That just reboots the peer. It is our specific STONITH agent.
> 
> > What this node does by itself really does not matter.
> 
> What if at some point there is only one node in the cluster?
> In the solution am I working on there are two nodes form the cluster.
> And it is possible to use this solution even with only one node.
> 
> I am satisfied with the "CASE 2" where Pacemaker shutdowns itself after 
> calling STONITH, despite that stonith agent didn't reboot "the needed node" 
> but returned false positive.
> The only question is why this doesn't happen in "CASE 1"?
>  
> 
> Thank you,
> Kostya
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Memory leak in crm_mon ?

2015-08-16 Thread Andrew Beekhof

> On 16 Aug 2015, at 9:41 pm, Attila Megyeri  wrote:
> 
> Hi Andrew,
> 
> I managed to isolate / reproduce the issue. You might want to take a look, as 
> it might be present in 1.1.12 as well.
> 
> I monitor my cluster from putty, mainly this way:
> - I have a putty (Windows client) session, that connects via SSH to the box, 
> authenticates using public key as a non-root user.
> - It immediately sends a "sudo crm_mon -Af" command, so with a single click I 
> have a nice view of what the cluster is doing.

Perhaps add -1 to the option list.
The root cause seems to be that closing the putty window doesn’t actually kill 
the process running inside it.

> 
> Whenever I close this putty window (terminate the app), crm_mon process gets 
> to 100% cpu usage, starts to leak, in a few hours consumes all memory and 
> then destroys the whole cluster.
> This does not happen if I leave crm_mon with Ctrl-C.
> 
> I can reproduce this 100% with crm_mon 1.1.10, with the mainstream ubuntu 
> trusty packages.
> This might be related on how sudo executes crm_mon, and what it signalls to 
> crm_mon when it gets terminated.
> 
> Now I know what I need to pay attention to in order to avoid this problem, 
> but you might want to check whether this issue is still present.
> 
> 
> Thanks,
> Attila 
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
> Sent: Friday, August 14, 2015 12:40 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Subject: Re: [ClusterLabs] Memory leak in crm_mon ?
> 
> 
> 
> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net] 
> Sent: Tuesday, August 11, 2015 2:49 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Subject: Re: [ClusterLabs] Memory leak in crm_mon ?
> 
> 
>> On 10 Aug 2015, at 5:33 pm, Attila Megyeri  wrote:
>> 
>> Hi!
>> 
>> We are building a new cluster on top of pacemaker/corosync and several times 
>> during the past days we noticed that „crm_mon -Af” used up all the 
>> memory+swap and caused high CPU usage. Killing the process solves the issue.
>> 
>> We are using the binary package versions available in the latest ubuntu 
>> trusty, namely:
>> 
>> crmsh  1.2.5+hg1034-1ubuntu4 
>> 
>> pacemaker
>> 1.1.10+git20130802-1ubuntu2.3  
>> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3  
>> corosync 2.3.3-1ubuntu1   
>> 
>> Kernel is 3.13.0-46-generic
>> 
>> Looking back some „atop” data, the CPU went to 100% many times during the 
>> last couple of days, at various times, more often around midnight exaclty 
>> (strange).
>> 
>> 08.05 14:00
>> 08.06 21:41
>> 08.07 00:00
>> 08.07 00:00
>> 08.08 00:00
>> 08.09 06:27
>> 
>> Checked the corosync log and syslog, but did not find any correlation 
>> between the entries int he logs around the specific times.
>> For most of the time, the node running the crm_mon was the DC as well – not 
>> running any resources (e.g. a pairless node for quorum).
>> 
>> 
>> We have another running system, where everything works perfecly, whereas it 
>> is almost the same:
>> 
>> crmsh  1.2.5+hg1034-1ubuntu4 
>>  
>> pacemaker
>> 1.1.10+git20130802-1ubuntu2.1 
>> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 
>> corosync 2.3.3-1ubuntu1  
>> 
>> Kernel is 3.13.0-8-generic
>> 
>> 
>> Is this perhaps a known issue?
> 
> Possibly, that version is over 2 years old.
> 
>> Any hints?
> 
> Getting something a little more recent would be the best place to start
> 
> Thanks Andew,
> 
> I tried to upgrade to 1.1.12 using the packages availabe at 
> https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a 
> single node, to see how it works out but I ended up with errors like
> 
> Could not establish cib_rw connection: Connection refused (111)
> 
> I have disabled the firewall, no changes. The node appears to be running but 
> does not see any of the other nodes. On the other nodes I see this node as an 
> UNCLEAN one. (I assume corosync is fine, but pacemaker not)
> I use udpu for the transport.
> 
> Am I doing something wrong? I tried to look for some howtos on upgrade, but 
> the only thing I found was the rather outdated   
> http://clusterlabs.org/wiki/Upgrade
> 
> Could you please direct me to some howto/guide on how to perform the upgrade?
> 
> Or am I facing some compatibility issue, so I should extract the whole cib, 
> upgrade all nodes and reconfigure the cluster from the scratch? (The cluster 
> is meant to go live in 2 days... :) )
> 

Re: [ClusterLabs] Memory leak in crm_mon ?

2015-08-16 Thread Attila Megyeri
Hi Andrew,

I managed to isolate / reproduce the issue. You might want to take a look, as 
it might be present in 1.1.12 as well.

I monitor my cluster from putty, mainly this way:
- I have a putty (Windows client) session, that connects via SSH to the box, 
authenticates using public key as a non-root user.
- It immediately sends a "sudo crm_mon -Af" command, so with a single click I 
have a nice view of what the cluster is doing.

Whenever I close this putty window (terminate the app), crm_mon process gets to 
100% cpu usage, starts to leak, in a few hours consumes all memory and then 
destroys the whole cluster.
This does not happen if I leave crm_mon with Ctrl-C.

I can reproduce this 100% with crm_mon 1.1.10, with the mainstream ubuntu 
trusty packages.
This might be related on how sudo executes crm_mon, and what it signalls to 
crm_mon when it gets terminated.

Now I know what I need to pay attention to in order to avoid this problem, but 
you might want to check whether this issue is still present.


Thanks,
Attila 






-Original Message-
From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
Sent: Friday, August 14, 2015 12:40 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Memory leak in crm_mon ?



-Original Message-
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: Tuesday, August 11, 2015 2:49 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Memory leak in crm_mon ?


> On 10 Aug 2015, at 5:33 pm, Attila Megyeri  wrote:
> 
> Hi!
>  
> We are building a new cluster on top of pacemaker/corosync and several times 
> during the past days we noticed that „crm_mon -Af” used up all the 
> memory+swap and caused high CPU usage. Killing the process solves the issue.
>  
> We are using the binary package versions available in the latest ubuntu 
> trusty, namely:
>  
> crmsh  1.2.5+hg1034-1ubuntu4  
>
> pacemaker
> 1.1.10+git20130802-1ubuntu2.3  
> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3  
> corosync 2.3.3-1ubuntu1   
>  
> Kernel is 3.13.0-46-generic
>  
> Looking back some „atop” data, the CPU went to 100% many times during the 
> last couple of days, at various times, more often around midnight exaclty 
> (strange).
>  
> 08.05 14:00
> 08.06 21:41
> 08.07 00:00
> 08.07 00:00
> 08.08 00:00
> 08.09 06:27
>  
> Checked the corosync log and syslog, but did not find any correlation between 
> the entries int he logs around the specific times.
> For most of the time, the node running the crm_mon was the DC as well – not 
> running any resources (e.g. a pairless node for quorum).
>  
>  
> We have another running system, where everything works perfecly, whereas it 
> is almost the same:
>  
> crmsh  1.2.5+hg1034-1ubuntu4  
> 
> pacemaker
> 1.1.10+git20130802-1ubuntu2.1 
> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 
> corosync 2.3.3-1ubuntu1  
>  
> Kernel is 3.13.0-8-generic
>  
>  
> Is this perhaps a known issue?

Possibly, that version is over 2 years old.

> Any hints?

Getting something a little more recent would be the best place to start

Thanks Andew,

I tried to upgrade to 1.1.12 using the packages availabe at 
https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a 
single node, to see how it works out but I ended up with errors like

Could not establish cib_rw connection: Connection refused (111)

I have disabled the firewall, no changes. The node appears to be running but 
does not see any of the other nodes. On the other nodes I see this node as an 
UNCLEAN one. (I assume corosync is fine, but pacemaker not)
I use udpu for the transport.

Am I doing something wrong? I tried to look for some howtos on upgrade, but the 
only thing I found was the rather outdated   http://clusterlabs.org/wiki/Upgrade

Could you please direct me to some howto/guide on how to perform the upgrade?

Or am I facing some compatibility issue, so I should extract the whole cib, 
upgrade all nodes and reconfigure the cluster from the scratch? (The cluster is 
meant to go live in 2 days... :) )

Thanks a lot in advance




>  
> Thanks!
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mai