Re: [Pacemaker] Pacemaker Corosync Issue

2014-10-16 Thread Andrew Beekhof

On 16 Oct 2014, at 6:33 pm, Sahil Aggarwal sahilaggarw...@gmail.com wrote:

 Hello , 
 
 Yes that log might be due to that reason but , it should not ignore the 
 resource as it is not taking any action for that resource i..e. not starting 
 the resource . 

it doesn't know that at the time

 
 and second thing 
 
 generally ignoring expired failure log comes as  
  notice: unpack_rsc_op: Ignoring expired failure Server_last_failure_0
 
 but in case where service is ignored , log comes as 
  notice: unpack_rsc_op: Ignoring expired failure (calculated) 
 Server_last_failure_0
 
 this might be some another case.  

possibly in the old code, but the latest has them combined

 
 Please Suggest . 
 
  
 
 On Thu, Oct 16, 2014 at 2:38 AM, Andrew Beekhof and...@beekhof.net wrote:
 You don't think that might be a little short?
 Any failure that happened more than 10s is going to be ignored, leading to 
 the pengine message you saw.
 
 On 16 Oct 2014, at 12:21 am, Sahil Aggarwal sahilaggarw...@gmail.com wrote:
 
  failure timeout for resource is 10s.
 
  On Wed, Oct 15, 2014 at 2:51 AM, Andrew Beekhof and...@beekhof.net wrote:
 
  On 15 Oct 2014, at 4:23 am, Sahil Aggarwal sahilaggarw...@gmail.com wrote:
 
  
   Hello Team Pacemaker,
  
   I am facing a constant issue with Pacemaker, it does not restart the 
   Service even when he knows that the Service is down. It generates a 
   message saying Ignoring Expired Failure for the service.
 
  What is the failure timeout set to?
 
   Pacemaker and Corosync version are given below. OS CentOS 6.2
  
   corosync-1.4.1-4.el6_2.2.x86_64 pacemaker-1.1.9-2.el6.x86_64
  
   Log which pengine provide is:
  
pengine[45232]:   notice: unpack_rsc_op: Ignoring expired failure 
   (calculated) Server_last_failure_0 (rc=7, 
   magic=0:7;14:5699:0:459093cc-f3a1-483b-b853-53a1d9791361)
  
   Some more info is:
  
   1.This is a two node cluster. There is time difference of 10 min b/w the 
   two nodes.
  
  
   --
   Regards,
   Sahil
   Mobile - 09467607999
   fbAddress-www.facebook.com/SahilAggarwalg
 
 
 
 
  --
  Sahil
  Mobile - 09467607999
  fbAddress-www.facebook.com/SahilAggarwalg
 
 
 
 
 -- 
 Sahil
 Mobile - 09467607999
 fbAddress-www.facebook.com/SahilAggarwalg



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker Corosync Issue

2014-10-16 Thread Andrew Beekhof

On 16 Oct 2014, at 7:56 pm, Sahil Aggarwal sahilaggarw...@gmail.com wrote:

 Sorry, i didn't get your point and i am again re-iterating the problem: 
 
 Two Node cluster Node A , Node B .
 
 Service X running on Node A, Node B is DC.
 
 We are using stack corosync with Pacemaker.
 Failure Timeout is 10 sec . 
 Target-Role is started . 
 
 Events happens like this
   • Node A sends event to Node B Service X is down
   • Node B prints Ignoring expired failure for Service X
   • After this Service X is never restarted by the Cluster.
 
 
 Now questions are:
 
   • Why is Node B (DC) ignoring the expired failure?

Because you told it to

   • Even for this time DC ignored but as the Service X is down, Node A 
 should monitor the service and again send failure status to Node B and at 
 that time Node B should restart the service. Why this no hapenning?
 
 
 For FAILURE TIMEOUT: my understanding is:
 
   • Node A sends Failure event of Service X to Node B(DC) at time T and 
 failcount of Service X on Node A reached infinity and Node A is the only node 
 where Service X can run
   • Now Node B (DC) will after T+FailureTimeoutSecounds will set the 
 failcount of Service X on Node A to Zero and again restart the Service X on 
 Node A.
 
 
 As per you Node B will ignore the Service X failure on Node A after Failure 
 Timeout seconds. From which point Node B  starts calculating those seconds??
 
 
 
 On Thu, Oct 16, 2014 at 1:07 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 On 16 Oct 2014, at 6:33 pm, Sahil Aggarwal sahilaggarw...@gmail.com wrote:
 
  Hello ,
 
  Yes that log might be due to that reason but , it should not ignore the 
  resource as it is not taking any action for that resource i..e. not 
  starting the resource .
 
 it doesn't know that at the time
 
 
  and second thing
 
  generally ignoring expired failure log comes as
   notice: unpack_rsc_op: Ignoring expired failure Server_last_failure_0
 
  but in case where service is ignored , log comes as
   notice: unpack_rsc_op: Ignoring expired failure (calculated) 
  Server_last_failure_0
 
  this might be some another case.
 
 possibly in the old code, but the latest has them combined
 
 
  Please Suggest .
 
 
 
  On Thu, Oct 16, 2014 at 2:38 AM, Andrew Beekhof and...@beekhof.net wrote:
  You don't think that might be a little short?
  Any failure that happened more than 10s is going to be ignored, leading to 
  the pengine message you saw.
 
  On 16 Oct 2014, at 12:21 am, Sahil Aggarwal sahilaggarw...@gmail.com 
  wrote:
 
   failure timeout for resource is 10s.
  
   On Wed, Oct 15, 2014 at 2:51 AM, Andrew Beekhof and...@beekhof.net 
   wrote:
  
   On 15 Oct 2014, at 4:23 am, Sahil Aggarwal sahilaggarw...@gmail.com 
   wrote:
  
   
Hello Team Pacemaker,
   
I am facing a constant issue with Pacemaker, it does not restart the 
Service even when he knows that the Service is down. It generates a 
message saying Ignoring Expired Failure for the service.
  
   What is the failure timeout set to?
  
Pacemaker and Corosync version are given below. OS CentOS 6.2
   
corosync-1.4.1-4.el6_2.2.x86_64 pacemaker-1.1.9-2.el6.x86_64
   
Log which pengine provide is:
   
 pengine[45232]:   notice: unpack_rsc_op: Ignoring expired failure 
(calculated) Server_last_failure_0 (rc=7, 
magic=0:7;14:5699:0:459093cc-f3a1-483b-b853-53a1d9791361)
   
Some more info is:
   
1.This is a two node cluster. There is time difference of 10 min b/w 
the two nodes.
   
   
--
Regards,
Sahil
Mobile - 09467607999
fbAddress-www.facebook.com/SahilAggarwalg
  
  
  
  
   --
   Sahil
   Mobile - 09467607999
   fbAddress-www.facebook.com/SahilAggarwalg
 
 
 
 
  --
  Sahil
  Mobile - 09467607999
  fbAddress-www.facebook.com/SahilAggarwalg
 
 
 
 
 -- 
 Sahil
 Mobile - 09467607999
 fbAddress-www.facebook.com/SahilAggarwalg



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Linux HA setup for CentOS 6.5

2014-10-16 Thread Sihan Goi
Thanks!

OK, so I've followed the DRBD steps in the guide all the way till cib
commit fs in Section 7.4, right before Testing Migration. However, when
I do a crm_mon, I get the following failed actions.

Last updated: Thu Oct 16 17:28:34 2014
Last change: Thu Oct 16 17:26:04 2014 via crm_shadow on node01
Stack: cman
Current DC: node02 - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured
5 Resources configured


Online: [ node01 node02 ]

ClusterIP(ocf::heartbeat:IPaddr2):Started node02
 Master/Slave Set: WebDataClone [WebData]
 Masters: [ node02 ]
 Slaves: [ node01 ]
WebFS   (ocf::heartbeat:Filesystem):Started node02

Failed actions:
WebSite_start_0 on node02 'unknown error' (1): call=278, status=Timed
Out, l
ast-rc-change='Thu Oct 16 17:26:28 2014', queued=2ms, exec=0ms
WebSite_start_0 on node01 'unknown error' (1): call=203, status=Timed
Out, l
ast-rc-change='Thu Oct 16 17:26:09 2014', queued=2ms, exec=0ms

Seems like the apache Website resource isn't starting up. Apache was
working just fine before I configured DRBD. What did I do wrong?

On Thu, Oct 16, 2014 at 1:49 PM, Digimer li...@alteeve.ca wrote:

 On 16/10/14 12:14 AM, Sihan Goi wrote:

 After following the guide, I've successfully managed to get Apache
 server up and running in the cluster as an active/passive setup, but
 with some differences. My cluster stack is stated as being cman while
 the guide's is openais. Not sure if that's a problem. Also, some
 commands in the guide don't seem to work.


 If you can provide examples of what issues you're having, I will be happy
 to try an help.

  I'm moving on to DRBD installation now, but when I do a yum install
 drbd-pacemaker drbd-udev, these packages are not available. After some
 googling, it seems that drbd83-utils/kmod-drbd83 or
 drbd84-utils/kmod-drbd84 is available via another repo. Does this work
 with the guide?


 You need to get them from a 3rd party repo (or install from source). I
 personally still use 8.3.16 (consistency during Anvil! generations), but
 I know that 8.4 is fine on EL6 (and EL7, to address an earlier comment). I
 have my own repos with these packages, but you would likely be better
 served using the ELRepo ones.

 https://alteeve.ca/w/AN!Cluster_Tutorial_2#Installing_DRBD

 The only real difference is to s/83/84/:

 + yum install drbd84-utils kmod-drbd84
 - yum install drbd83-utils kmod-drbd83

 If you run into any troubles, please share details and I am sure we'll get
 you sorted out in no time.

 Cheers


 --
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 
- Goi Sihan
gois...@gmail.com
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Stopping/restarting pacemaker without stopping resources?

2014-10-16 Thread Andrei Borzenkov
The primary goal is to transparently update software in cluster. I
just did HA suite update using simple RPM and observed that RPM
attempts to restart stack (rcopenais try-restart). So

a) if it worked, it would mean resources had been migrated from this
node - interruption

b) it did not work - apparently new versions of installed utils were
incompatible with running pacemaker so request to shutdown crm fails
and openais hung forever.

The usual workflow with one cluster products I worked before was -
stop cluster processes without stopping resources; update; restart
cluster processes. They would detect that resources are started and
return to the same state as before stopping. Is something like this
possible with pacemaker?

TIA

-andrei

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] meta failure-timeout: crashed resource is assumed to be Started?

2014-10-16 Thread Carsten Otto
Dear all,

I configured meta failure-timeout=60sec on all of my resources. For the
sake of simplicity, assume I have a group of two resources FIRST and
SECOND (where SECOND is started after FIRST, surprise!).

If now FIRST crashes, I see a failure, as expected. I also see that
SECOND is stopped, as expected.

Sadly, SECOND needs more than 60 seconds to stop. Thus, it can happen
that the failure-timeout for FIRST is reached, and its failure is
cleaned. This also is expected.

The problem now is that after the 60sec timeout pacemaker assumes that
FIRST is in the Started state. There is no indication about that in the
log files, and the last monitor operation which ran just a few seconds
before also indicated that FIRST is actually not running.

As a consequence of the bug, pacemaker tries to re-start SECOND on the
same system, which fails to start (as it depends on FIRST, which
actually is not running). Only then the resources are started on the
other system.

So, my question is:
Why does pacemaker assume that a previously failed resource is Started
when the meta failure-timeout is triggered? Why is the monitor
operation not invoked to determine the correct state?

The corresponding lines of the log file, about a minute after FIRST
crashed and the stop operation for SECOND was triggered:

Oct 16 16:27:20 [2100] HOSTNAME [...] (monitor operation indicating that FIRST 
is not running)
[...]
Oct 16 16:27:23 [2104] HOSTNAME   lrmd: info: log_finished: 
finished - rsc:SECOND action:stop call_id:123 pid:29314 exit-code:0 
exec-time:62827ms queue-time:0ms
Oct 16 16:27:23 [2107] HOSTNAME   crmd:   notice: process_lrm_event:LRM 
operation SECOND_stop_0 (call=123, rc=0, cib-update=225, confirmed=true) ok
Oct 16 16:27:23 [2107] HOSTNAME   crmd: info: match_graph_event:
Action SECOND_stop_0 (74) confirmed on HOSTNAME (rc=0)
Oct 16 16:27:23 [2107] HOSTNAME   crmd:   notice: run_graph:Transition 
40 (Complete=5, Pending=0, Fired=0, Skipped=31, Incomplete=10, 
Source=/var/lib/pacemaker/pengine/pe-input-2937.bz2): Stopped
Oct 16 16:27:23 [2107] HOSTNAME   crmd: info: do_state_transition:  
State transition S_TRANSITION_ENGINE - S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 16 16:27:23 [2100] HOSTNAMEcib: info: cib_process_request:  
Completed cib_modify operation for section status: OK (rc=0, 
origin=local/crmd/225, version=0.1450.89)
Oct 16 16:27:23 [2100] HOSTNAMEcib: info: cib_process_request:  
Completed cib_query operation for section 'all': OK (rc=0, 
origin=local/crmd/226, version=0.1450.89)
Oct 16 16:27:23 [2106] HOSTNAMEpengine:   notice: unpack_config:On 
loss of CCM Quorum: Ignore
Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: 
determine_online_status_fencing:  Node HOSTNAME is active
Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: determine_online_status:  
Node HOSTNAME is online
[...]
Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: get_failcount_full:   
FIRST has failed 1 times on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAMEpengine:   notice: unpack_rsc_op:
Clearing expired failcount for FIRST on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: get_failcount_full:   
FIRST has failed 1 times on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAMEpengine:   notice: unpack_rsc_op:
Clearing expired failcount for FIRST on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: get_failcount_full:   
FIRST has failed 1 times on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAMEpengine:   notice: unpack_rsc_op:
Clearing expired failcount for FIRST on HOSTNAME
Oct 16 16:27:23 [2106] HOSTNAMEpengine:   notice: unpack_rsc_op:
Re-initiated expired calculated failure FIRST_last_failure_0 (rc=7, 
magic=0:7;68:31:0:28c68203-6990-48fd-96cc-09f86e2b21f9) on HOSTNAME
[...]
Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: group_print:   Resource 
Group: GROUP
Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: native_print: 
 FIRST   (ocf::heartbeat:xxx):  Started HOSTNAME 
Oct 16 16:27:23 [2106] HOSTNAMEpengine: info: native_print: 
 SECOND (ocf::heartbeat:yyy):Stopped 

Thank you,
Carsten
-- 
andrena objects ag
Büro Frankfurt
Clemensstr. 8
60487 Frankfurt

Tel: +49 (0) 69 977 860 38
Fax: +49 (0) 69 977 860 39
http://www.andrena.de

Vorstand: Hagen Buchwald, Matthias Grund, Dr. Dieter Kuhn
Aufsichtsratsvorsitzender: Rolf Hetzelberger

Sitz der Gesellschaft: Karlsruhe
Amtsgericht Mannheim, HRB 109694
USt-IdNr. DE174314824

Bitte beachten Sie auch unsere anstehenden Veranstaltungen:
http://www.andrena.de/events


signature.asc
Description: Digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: 

Re: [Pacemaker] Raid RA Changes to Enable ms configuration -- need some assistance plz.

2014-10-16 Thread Errol Neal
Andrew Beekhof andrew@... writes:

 
 Yes. If you want the cluster to start things in a particular order, 
then you need to specify it.

Andrew, but my issue isn't getting the resources to start in a a 
specific order. My issue is that I can't get the slave resource to get 
promoted when the previous master goes offline. 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org