Re: [Pacemaker] crm resource doesn´t move after hardware crash

2014-04-01 Thread Beo Banks
hi,

the kvm guest are different kvm host.


2014-03-24 0:30 GMT+01:00 Andrew Beekhof and...@beekhof.net:


 On 21 Mar 2014, at 11:11 pm, Beo Banks beo.ba...@googlemail.com wrote:

  yap and that´s my issue.
 
  stonith is very powerfull but how can the cluster handle hardware
 failure?

 by connecting to the switch that supplies power to said hardware
 exactly the reason devices like fence_virsh and external/ssh are not
 considered reliable.

 are both these VMs running on the same physical hardware?

 
  primitive stonith-linux01 stonith:fence_virsh \
  params pcmk_host_list=linux01 pcmk_host_check=dynamic-list
 pcmk_host_map=linux01:linux01 action=reboot ipaddr=XX
 secure=true login=root identity_file=/root/.ssh/id_rsa
 debug=/var/log/stonith.log verbose=false \

 you dont need the host map if the name and value (name:value) are the same

  op monitor interval=300s \
  op start interval=0 timeout=60s \
  meta failure-timeout=180s
  primitive stonith-linux02 stonith:fence_virsh \
  params pcmk_host_list=linux02 pcmk_host_check=dynamic-list
 pcmk_host_map=linux02:linux02 action=reboot ipaddr=X
 secure=true login=root identity_file=/root/.ssh/id_rsa delay=5
 debug=/var/log/stonith.log verbose=false \
  op monitor interval=60s \
  op start interval=0 timeout=60s \
  meta failure-timeout=180s
 
 
 
 
  2014-03-18 13:54 GMT+01:00 emmanuel segura emi2f...@gmail.com:
  do you have stonith configured?
 
 
  2014-03-18 13:07 GMT+01:00 Alex Samad - Yieldbroker 
 alex.sa...@yieldbroker.com:
  Im not expert but
 
 
 
  Current DC: linux02 - partition WITHOUT quorum
  Version: 1.1.10-14.el6_5.2-368c726
  2 Nodes configured, 2 expected votes
 
 
 
 
  I think your 2nd node can't make quorum, there is some special config
 for 2 node cluster to allow nodes to make quorum with 1 vote..
 
 
 
  A
 
 
 
  From: Beo Banks [mailto:beo.ba...@googlemail.com]
  Sent: Tuesday, 18 March 2014 10:06 PM
  To: pacemaker@oss.clusterlabs.org
  Subject: [Pacemaker] crm resource doesn´t move after hardware crash
 
 
 
  hi,
 
  i have a hardware crash in a two-node drbd cluster.
 
  the active node has a hardware failure is actual down.
 
  i am wondering that my 2nd doesn´t migrate/move the resource.
 
  the 2nd node want´s to fence the device but that´s not possible (it´s
 down)
 
 
  how can i enable the services on the last good node?
 
  and how can i optimize my config to handle that kind of error?
 
  crm status
 
  Last updated: Tue Mar 18 12:01:07 2014
  Last change: Tue Mar 18 11:28:22 2014 via crmd on linux02
  Stack: classic openais (with plugin)
  Current DC: linux02 - partition WITHOUT quorum
  Version: 1.1.10-14.el6_5.2-368c726
  2 Nodes configured, 2 expected votes
  21 Resources configured
 
 
  Node linux01: UNCLEAN (offline)
  Online: [ linux02 ]
 
   Resource Group: mysql
   mysql_fs   (ocf::heartbeat:Filesystem):Started linux01
   mysql_ip   (ocf::heartbeat:IPaddr2):   Started linux01
 
   and so on
 
 
 
  cluster.log
 
 
  Mar 18 11:54:43 [2234] linux02   crmd:   notice:
 tengine_stonith_callback:  Stonith operation 17 for linux01 failed
 (Timer expired): aborting transition.
  Mar 18 11:54:43 [2234] linux02   crmd: info:
 abort_transition_graph:tengine_stonith_callback:463 - Triggered
 transition abort (complete=0) : Stonith failed
  Mar 18 11:54:43 [2234] linux02   crmd:   notice: run_graph:
 Transition 15 (Complete=9, Pending=0, Fired=0, Skipped=36, Incomplete=19,
 Source=/var/lib/pacemaker/pengine/pe-warn-63.bz2): Stopped
  Mar 18 11:54:43 [2234] linux02   crmd:   notice:
 too_many_st_failures:  Too many failures to fence linux01 (16), giving up
  Mar 18 11:54:43 [2234] linux02   crmd: info: do_log:FSA:
 Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
  Mar 18 11:54:43 [2234] linux02   crmd:   notice:
 do_state_transition:   State transition S_TRANSITION_ENGINE - S_IDLE [
 input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
  Mar 18 11:54:43 [2230] linux02 stonith-ng: info: stonith_command:
 Processed st_notify reply from linux02: OK (0)
  Mar 18 11:54:43 [2234] linux02   crmd:   notice:
 tengine_stonith_notify:Peer linux01 was not terminated (reboot) by
 linux02 for linux02: Timer expired
 (ref=7939b264-699c-4d00-a89c-07e7e0193a80) by client crmd.2234
  Mar 18 11:54:44 [2229] linux02cib: info: crm_client_new:
Connecting 0x155ac00 for uid=0 gid=0 pid=23360
 id=b88b2690-0c3f-48ac-b8b4-3a47b7f9114a
  Mar 18 11:54:44 [2229] linux02cib: info:
 cib_process_request:   Completed cib_query operation for section 'all': OK
 (rc=0, origin=local/crm_mon/2, version=0.125.2)
  Mar 18 11:54:44 [2229] linux02cib: info: crm_client_destroy:
Destroying 0 events
  Mar 18 11:55:03 [2229] linux02cib: info: crm_client_new:
Connecting 0x155ac00 for uid=0 gid=0 pid=23415
 

[Pacemaker] failed actions are not removed

2014-04-01 Thread Attila Megyeri
Hi Andrew, all,

We use Pacemeaker 1.1.10, with corosync 2.2.3 and we notice that failed actions 
are not reset after the cluster recheck interval has elapsed.
Is this a known issue, or shall I provide some more details?

It has worked in previous setups properly, we have no idea what could be the 
issue here.


Some background:

In properties:

cluster-recheck-interval=2m \

In the relevant resources:

primitive jboss_imssrv2 ocf:heartbeat:jboss \
params shutdown_timeout=10 user=jboss

op start interval=0 timeout=60s on-fail=restart \
op monitor interval=10s timeout=90s on-fail=restart \
op stop interval=0 timeout=120s on-fail=block \
meta migration-threshold=5 failure-timeout=2m target-role=Started


Thanks!

Attila
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] failed actions are not removed

2014-04-01 Thread Lars Marowsky-Bree
On 2014-04-01T14:41:11, Attila Megyeri amegy...@minerva-soft.com wrote:

 Hi Andrew, all,
 
 We use Pacemeaker 1.1.10, with corosync 2.2.3 and we notice that failed 
 actions are not reset after the cluster recheck interval has elapsed.
 Is this a known issue, or shall I provide some more details?

What have you set failure-timeout set to?

Are they just still being shown, or are they having an impact on your
resource placement still too?

If you can provide a CIB for this scenario it's easier to answer.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org