Re: [Pacemaker] First confused (then enlightened ? :)

2011-02-15 Thread Florian Haas
On 2011-02-14 22:37, Carlos G Mendioroz wrote:
 Andrew Beekhof @ 14/02/2011 05:44 -0300 dixit:
 -Is still the case that Heartbeat is not to be considered for new
 deployments ? (I read something along that line)

 pretty much

 http://www.clusterlabs.org/wiki/FAQ#Should_I_Run_Pacemaker_on_Heartbeat_or_Coroysnc.3F

 
 That was the place I was referring to. Still, the thing is pretty
 confusing. DR-BD talks about heartbeat in its description.

Happy to take a patch for the User's Guide.

 When
 you install pacemaker using package managers, heartbeat is in the
 dependency list. And goes on...

Only on Debian/Ubuntu, which has a dependency on corosync OR heartbeat
-- something that many other distros don't even support --, all of which
is to make rolling upgrades from the previous major release possible --
something that no other distro supports.

Florian



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] [Problem]post_notify_start_0 is carried out in the node that disappeared.

2011-02-15 Thread renayama19661014
Hi all,

We test trouble at the time of the start of the Master/Slave resource.

Step1) We start the first node and send cib.


Last updated: Thu Feb 10 16:32:12 2011
Stack: Heartbeat
Current DC: srv01 (c7435833-8bc5-43aa-8195-c666b818677f) - partition with quorum
Version: 1.0.10-b0266dd5ffa9c51377c68b1f29d6bc84367f51dd
1 Nodes configured, unknown expected votes
5 Resources configured.


Online: [ srv01 ]

prmIpPostgreSQLDB  (ocf::heartbeat:IPaddr2):   Started srv01
Resource Group: grpStonith2
 prmStonith2-2  (stonith:external/ssh): Started srv01
 prmStonith2-3  (stonith:meatware): Started srv01
Master/Slave Set: msPostgreSQLDB
 Masters: [ srv01 ]
 Stopped: [ prmApPostgreSQLDB:1 ]
Clone Set: clnPingd
 Started: [ srv01 ]
 Stopped: [ prmPingd:1 ]

Migration summary:
* Node srv01: 


Step2) We change Stateful of the node of the second of them.
(snip)
stateful_start() {
ocf_log info Start of Stateful.
sleep 120 add sleep.
stateful_check_state master
(snip)

Step3) We start a node of the second of them.

Step4) We confirm sleep and reboot a node of the second of them.
[root@srv02 ~]# ps -ef |grep sleep

Step5) The node of the first of them detects the disappearance of the node of 
the second of them.
* But, STONITH is delayed because post_notify_start_0 of the node of the 
second of them is carried
out.
* In the srv02 node that disappeared, we do not need the practice of 
post_notify_start_0.
* STONITH should be carried out immediately.(STONITH is kept waiting to 
time-out of
post_notify_start_0 now.)


--(snip)
Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=3, nodes=1, new=0,
lost=1 n_idx=0, new_idx=1, old_idx=3
Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail: CURRENT: srv01 
[nodeid=0, born=3]
Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail: LOST:srv02 
[nodeid=1, born=2]
Feb 10 16:33:18 srv01 crmd: [4293]: info: ais_status_callback: status: srv02 is 
now lost (was member)
Feb 10 16:33:18 srv01 crmd: [4293]: info: crm_update_peer: Node srv02: id=1 
state=lost (new)
addr=(null) votes=-1 born=2 seen=2 proc=0200
Feb 10 16:33:18 srv01 crmd: [4293]: info: erase_node_from_join: Removed node 
srv02 from join
calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
Feb 10 16:33:18 srv01 crmd: [4293]: info: populate_cib_nodes_ha: Requesting the 
list of configured
nodes
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 36 
fired and confirmed
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 39 
fired and confirmed
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating action 75: 
notify
prmApPostgreSQLDB:0_post_notify_start_0 on srv01 (local)
Feb 10 16:33:19 srv01 crmd: [4293]: info: do_lrm_rsc_op: Performing
key=75:7:0:6918f8dc-fe1a-4c28-8aff-e8ac7a5e7143 op=prmApPostgreSQLDB:0_notify_0 
)
Feb 10 16:33:19 srv01 lrmd: [4290]: info: rsc:prmApPostgreSQLDB:0:24: notify
Feb 10 16:33:19 srv01 cib: [4289]: info: cib_process_request: Operation 
complete: op cib_modify for
section nodes (origin=local/crmd/99, version=0.9.22): ok (rc=0)
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating action 76: 
notify
prmApPostgreSQLDB:1_post_notify_start_0 on srv02
Feb 10 16:33:19 srv01 lrmd: [4290]: info: RA output: 
(prmApPostgreSQLDB:0:notify:stdout) usage:
/usr/lib/ocf/resource.d//pacemaker/Stateful 
{start|stop|promote|demote|monitor|validate-all|meta-data}
Expects to have a fully populated OCF RA-compliant environment set.
Feb 10 16:33:19 srv01 crmd: [4293]: info: process_lrm_event: LRM operation
prmApPostgreSQLDB:0_notify_0 (call=24, rc=0, cib-update=101, confirmed=true) ok
Feb 10 16:33:19 srv01 crmd: [4293]: info: match_graph_event: Action
prmApPostgreSQLDB:0_post_notify_start_0 (75) confirmed on srv01 (rc=0)
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 40 
fired and confirmed
Feb 10 16:34:39 srv01 crmd: [4293]: WARN: action_timer_callback: Timer popped 
(timeout=2,
abort_level=100, complete=false)
Feb 10 16:34:39 srv01 crmd: [4293]: ERROR: print_elem: Aborting transition, 
action lost: [Action 76]:
Failed (id: prmApPostgreSQLDB:1_post_notify_start_0, loc: srv02, priority: 
100)
Feb 10 16:34:39 srv01 crmd: [4293]: info: abort_transition_graph: 
action_timer_callback:486 -
Triggered transition abort (complete=0) : Action lost
Feb 10 16:34:39 srv01 crmd: [4293]: WARN: cib_action_update: rsc_op 76:
prmApPostgreSQLDB:1_post_notify_start_0 on srv02 timed out
Feb 10 16:34:39 srv01 crmd: [4293]: info: run_graph:

Feb 10 16:34:39 srv01 crmd: [4293]: notice: run_graph: Transition 7 
(Complete=16, Pending=0, Fired=0,
Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-7.bz2): Stopped
Feb 10 16:34:39 srv01 crmd: [4293]: info: 

Re: [Pacemaker] First confused (then enlightened ? :)

2011-02-15 Thread Dan Frincu
Hi,

snip

Is there a searchable repository of the list content so I may find
 if some of my doubts are already explained ?


 Answering myself, I found that this (and some related lists) are archived
 and indexed at GossamerThreads,
http://www.gossamer-threads.com/lists/linuxha
 I usually find that indexing a list like this is an invaluable tool, so
 here for the record.


For future reference, maybe this method will help someone else. From
http://www.clusterlabs.org/wiki/Mailing_lists there are 3 main archives:
- http://oss.clusterlabs.org/pipermail/pacemaker
- http://lists.linux-ha.org/pipermail/linux-ha
- http://lists.linux-foundation.org/pipermail/openais
+ 1 for drbd
- http://lists.linbit.com/pipermail/drbd-user/

What I do is to take the gzipped archives from all of the above, extract
them as text and index them with google desktop for quick reference.

Here's the one liner to do that:

for i in http://oss.clusterlabs.org/pipermail/pacemaker
http://lists.linux-ha.org/pipermail/linux-ha
http://lists.linux-foundation.org/pipermail/openais
http://lists.linbit.com/pipermail/drbd-user/ ; do mkdir -p $(pwd)/${i##*/}
 for j in $(wget $i -O - 2/dev/null | awk -F '' -v var=$i '/.gz/
{print var/$2}') ; do wget $j -P $(pwd)/${i##*/} 2/dev/null; done 
gunzip $(pwd)/${i##*/}/*.gz 2/dev/null; done

Regards,
Dan

-- 
Dan Frincu
CCNA, RHCE
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] First confused (then enlightened ? :)

2011-02-15 Thread Andrew Beekhof
On Mon, Feb 14, 2011 at 10:37 PM, Carlos G Mendioroz t...@huapi.ba.ar wrote:
 Andrew Beekhof @ 14/02/2011 05:44 -0300 dixit:

 -Is still the case that Heartbeat is not to be considered for new
 deployments ? (I read something along that line)

 pretty much


 http://www.clusterlabs.org/wiki/FAQ#Should_I_Run_Pacemaker_on_Heartbeat_or_Coroysnc.3F

 That was the place I was referring to.

That section is 100% accurate.

 Still, the thing is pretty confusing.
 DR-BD talks about heartbeat in its description. When
 you install pacemaker using package managers, heartbeat is in the
 dependency list. And goes on...

Well its still completely supported, but as a developer community
we're moving away from it.
So particularly for people coming to clustering for the first time, it
doesn't make much sense to learn a deprecated/dead technology.


 also have a look at clusters from scratch:
   http://www.clusterlabs.org/doc

 Reading now. Nice doc. Would you accept errata items ?

Of course.
Well, actually, only for the 1.1 version - thats the only version we
generate from docbook format.

 May be I should PM you, but page 10, pcmk-2 is supposed to be
 19.168.9.42 . Typo^2 ? 192.168.122.102 ?

Thats been fixed in the 1.1 version

 Also, ...add additional entries for the three machines. Three ?

Yeah, adding a third machine was going to be part of the guide.
But I never got to that part.  Fixed.



 --
 Carlos G Mendioroz  t...@huapi.ba.ar  LW7 EQI  Argentina

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem]post_notify_start_0 is carried out in the node that disappeared.

2011-02-15 Thread Andrew Beekhof
On Tue, Feb 15, 2011 at 9:32 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi all,

 We test trouble at the time of the start of the Master/Slave resource.

 Step1) We start the first node and send cib.

 
 Last updated: Thu Feb 10 16:32:12 2011
 Stack: Heartbeat
 Current DC: srv01 (c7435833-8bc5-43aa-8195-c666b818677f) - partition with 
 quorum
 Version: 1.0.10-b0266dd5ffa9c51377c68b1f29d6bc84367f51dd
 1 Nodes configured, unknown expected votes
 5 Resources configured.
 

 Online: [ srv01 ]

 prmIpPostgreSQLDB      (ocf::heartbeat:IPaddr2):       Started srv01
 Resource Group: grpStonith2
     prmStonith2-2      (stonith:external/ssh): Started srv01
     prmStonith2-3      (stonith:meatware):     Started srv01
 Master/Slave Set: msPostgreSQLDB
     Masters: [ srv01 ]
     Stopped: [ prmApPostgreSQLDB:1 ]
 Clone Set: clnPingd
     Started: [ srv01 ]
     Stopped: [ prmPingd:1 ]

 Migration summary:
 * Node srv01:


 Step2) We change Stateful of the node of the second of them.
 (snip)
 stateful_start() {
    ocf_log info Start of Stateful.
    sleep 120                             add sleep.
    stateful_check_state master
 (snip)

 Step3) We start a node of the second of them.

 Step4) We confirm sleep and reboot a node of the second of them.
 [root@srv02 ~]# ps -ef |grep sleep

 Step5) The node of the first of them detects the disappearance of the node of 
 the second of them.
 * But, STONITH is delayed because post_notify_start_0 of the node of the 
 second of them is carried
 out.

Wait, what?
Why would post_notify_start_0 of prmApPostgreSQLDB block stonith?

You didn't put a stonith resource in an ordering constraint did you?


 * In the srv02 node that disappeared, we do not need the practice of 
 post_notify_start_0.
 * STONITH should be carried out immediately.(STONITH is kept waiting to 
 time-out of
 post_notify_start_0 now.)


 --(snip)
 Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail: NEW MEMBERSHIP: 
 trans=3, nodes=1, new=0,
 lost=1 n_idx=0, new_idx=1, old_idx=3
 Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail:     CURRENT: 
 srv01 [nodeid=0, born=3]
 Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail:     LOST:    
 srv02 [nodeid=1, born=2]
 Feb 10 16:33:18 srv01 crmd: [4293]: info: ais_status_callback: status: srv02 
 is now lost (was member)
 Feb 10 16:33:18 srv01 crmd: [4293]: info: crm_update_peer: Node srv02: id=1 
 state=lost (new)
 addr=(null) votes=-1 born=2 seen=2 proc=0200
 Feb 10 16:33:18 srv01 crmd: [4293]: info: erase_node_from_join: Removed node 
 srv02 from join
 calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
 Feb 10 16:33:18 srv01 crmd: [4293]: info: populate_cib_nodes_ha: Requesting 
 the list of configured
 nodes
 Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 36 
 fired and confirmed
 Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 39 
 fired and confirmed
 Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating action 
 75: notify
 prmApPostgreSQLDB:0_post_notify_start_0 on srv01 (local)
 Feb 10 16:33:19 srv01 crmd: [4293]: info: do_lrm_rsc_op: Performing
 key=75:7:0:6918f8dc-fe1a-4c28-8aff-e8ac7a5e7143 
 op=prmApPostgreSQLDB:0_notify_0 )
 Feb 10 16:33:19 srv01 lrmd: [4290]: info: rsc:prmApPostgreSQLDB:0:24: notify
 Feb 10 16:33:19 srv01 cib: [4289]: info: cib_process_request: Operation 
 complete: op cib_modify for
 section nodes (origin=local/crmd/99, version=0.9.22): ok (rc=0)
 Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating action 
 76: notify
 prmApPostgreSQLDB:1_post_notify_start_0 on srv02
 Feb 10 16:33:19 srv01 lrmd: [4290]: info: RA output: 
 (prmApPostgreSQLDB:0:notify:stdout) usage:
 /usr/lib/ocf/resource.d//pacemaker/Stateful 
 {start|stop|promote|demote|monitor|validate-all|meta-data}
 Expects to have a fully populated OCF RA-compliant environment set.
 Feb 10 16:33:19 srv01 crmd: [4293]: info: process_lrm_event: LRM operation
 prmApPostgreSQLDB:0_notify_0 (call=24, rc=0, cib-update=101, confirmed=true) 
 ok
 Feb 10 16:33:19 srv01 crmd: [4293]: info: match_graph_event: Action
 prmApPostgreSQLDB:0_post_notify_start_0 (75) confirmed on srv01 (rc=0)
 Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 40 
 fired and confirmed
 Feb 10 16:34:39 srv01 crmd: [4293]: WARN: action_timer_callback: Timer popped 
 (timeout=2,
 abort_level=100, complete=false)
 Feb 10 16:34:39 srv01 crmd: [4293]: ERROR: print_elem: Aborting transition, 
 action lost: [Action 76]:
 Failed (id: prmApPostgreSQLDB:1_post_notify_start_0, loc: srv02, priority: 
 100)
 Feb 10 16:34:39 srv01 crmd: [4293]: info: abort_transition_graph: 
 action_timer_callback:486 -
 Triggered transition abort (complete=0) : Action lost
 Feb 10 16:34:39 srv01 crmd: [4293]: WARN: cib_action_update: rsc_op 76:
 prmApPostgreSQLDB:1_post_notify_start_0 on srv02 timed out
 Feb 10 

Re: [Pacemaker] First confused (then enlightened ? :)

2011-02-15 Thread Carlos G Mendioroz

Andrew Beekhof @ 15/02/2011 04:25 -0300 dixit:

For what I understand, you want the brains of the action at pacemaker,
so VRRP, HSRP or (U)CARP seem more a trouble than a solution.
(i.e. twin head) right ?

In other words, it seems to better align with the solution idea to
have pacemaker decide and some script-set do the changing.


What you typically want to avoid is having two isolated entities
trying to make decisions in the cluster - pulling it to pieces in the
process.

Right, makes a lot of sense, only one boss in the office and one place
to define policy.
But to integrate with other protocols thought as independent, like VRRP
or (U)CARP, the dependency has to be implemented.


Something like DRBD solves this by using crm_master to tell Pacemaker
which instance it would like promoted, but not actually doing the
promotion itself.

I don't know if this is feasible for your application.


In my case, it seems better to get rid of VRRP and use a more
comprehensive look of pacemaker.


Nevertheless, I don't see the concerns of MAC mutation being addressed
anywhere. And I have my suspocious at ARP caches too.


Both would be properties of the RA itself rather than Pacemaker or Heartbeat.
So if you can script MAC mutation, you can also create an RA for it
(or add it to an existing one).


Is there a guide to implemented RAs ?
I've seen that the shell can list them. Are they embedded or just
showing a directory of entities found in some predefined places ?


I'm currently thinking about a couple of ideas:
-using mac-vlan to move an active mac from one server to another
-using bonding to have something like a MEC, multichasis ether channel.
(i.e. a way to not only migrate the MAC but also to signal the migration
to the attachment switch using 802.1ad)

Are there any statistics on how much time does it take to migrate
an IP address by current resource ? (IPAddr2 I guess)
I'm looking for a subsecond delay since failure detection,
and I guess it's obvious, an active-standby setup.


I've not done any measurements lately.
Mostly its dependent on how long the RA takes.


Ok, now I'm getting into RA arena I guess.
For speedy failover, I would need a hot standby approach. Is that a 
pacemaker known state ?


--
Carlos G Mendioroz  t...@huapi.ba.ar  LW7 EQI  Argentina

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem]post_notify_start_0 is carried out in the node that disappeared.

2011-02-15 Thread renayama19661014
Hi Andrew,

Thank you for comment.

 Perhaps I misunderstood - does the node fail _while_ we're running 
 post_notify_start_0?
 Is that the ordering you're talking about?

Yes.
I think that stonith do not have to wait for post_notify_start_0 of the 
inoperative node.

 If so, then the crmd is already supposed to be smart enough not to bother 
 waiting for those actions - perhaps the logic got broken at some point.

If you need detailed information, give me communication.

Best Regards,
Hideo Yamauchi.


--- On Tue, 2011/2/15, Andrew Beekhof and...@beekhof.net wrote:

 
 On Feb 15, 2011, at 12:10 PM, renayama19661...@ybb.ne.jp wrote:
 
  Hi Andrew,
  
  Thank you for comment.
  
  Sorry...I may understand your opinion by mistake.
  
  Wait, what?
  Why would post_notify_start_0 of prmApPostgreSQLDB block stonith?
  
  Yes.
  I was able to see it like that.
  post_notify_start_0 of crmd seems to keep processing waiting to time-out.
  
  Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating action 
  76: notify prmApPostgreSQLDB:1_post_notify_start_0 on srv02
  Feb 10 16:33:19 srv01 lrmd: [4290]: info: RA output: 
  (prmApPostgreSQLDB:0:notify:stdout) usage: 
  /usr/lib/ocf/resource.d//pacemaker/Stateful 
  {start|stop|promote|demote|monitor|validate-all|meta-data}  Expects to have 
  a fully populated OCF RA-compliant environment set. 
  Feb 10 16:33:19 srv01 crmd: [4293]: info: process_lrm_event: LRM operation 
  prmApPostgreSQLDB:0_notify_0 (call=24, rc=0, cib-update=101, 
  confirmed=true) ok
  Feb 10 16:33:19 srv01 crmd: [4293]: info: match_graph_event: Action 
  prmApPostgreSQLDB:0_post_notify_start_0 (75) confirmed on srv01 (rc=0)
  Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 
  40 fired and confirmed
  Feb 10 16:34:39 srv01 crmd: [4293]: WARN: action_timer_callback: Timer 
  popped (timeout=2, abort_level=100, complete=false)
  Feb 10 16:34:39 srv01 crmd: [4293]: ERROR: print_elem: Aborting transition, 
  action lost: [Action 76]: Failed (id: 
  prmApPostgreSQLDB:1_post_notify_start_0, loc: srv02, priority: 100)
  Feb 10 16:34:39 srv01 crmd: [4293]: info: abort_transition_graph: 
  action_timer_callback:486 - Triggered transition abort (complete=0) : 
  Action lost
  Feb 10 16:34:39 srv01 crmd: [4293]: WARN: cib_action_update: rsc_op 76: 
  prmApPostgreSQLDB:1_post_notify_start_0 on srv02 timed out
  
  
  You didn't put a stonith resource in an ordering constraint did you?
  I did not set order of stonith. 
  We want to carry out STONITH without waiting for time-out of 
  post_notify_start_0
  Is there a method to solve this problem?
 
 Perhaps I misunderstood - does the node fail _while_ we're running 
 post_notify_start_0?
 Is that the ordering you're talking about?
 
 If so, then the crmd is already supposed to be smart enough not to bother 
 waiting for those actions - perhaps the logic got broken at some point.
 
  
  Best Regards,
  Hideo Yamauchi.
  
  
  
  --- On Tue, 2011/2/15, Andrew Beekhof and...@beekhof.net wrote:
  
  On Tue, Feb 15, 2011 at 9:32 AM,  renayama19661...@ybb.ne.jp wrote:
  Hi all,
  
  We test trouble at the time of the start of the Master/Slave resource.
  
  Step1) We start the first node and send cib.
  
  
  Last updated: Thu Feb 10 16:32:12 2011
  Stack: Heartbeat
  Current DC: srv01 (c7435833-8bc5-43aa-8195-c666b818677f) - partition with 
  quorum
  Version: 1.0.10-b0266dd5ffa9c51377c68b1f29d6bc84367f51dd
  1 Nodes configured, unknown expected votes
  5 Resources configured.
  
  
  Online: [ srv01 ]
  
  prmIpPostgreSQLDB      (ocf::heartbeat:IPaddr2):       Started srv01
  Resource Group: grpStonith2
      prmStonith2-2      (stonith:external/ssh): Started srv01
      prmStonith2-3      (stonith:meatware):     Started srv01
  Master/Slave Set: msPostgreSQLDB
      Masters: [ srv01 ]
      Stopped: [ prmApPostgreSQLDB:1 ]
  Clone Set: clnPingd
      Started: [ srv01 ]
      Stopped: [ prmPingd:1 ]
  
  Migration summary:
  * Node srv01:
  
  
  Step2) We change Stateful of the node of the second of them.
  (snip)
  stateful_start() {
     ocf_log info Start of Stateful.
     sleep 120                             add sleep.
     stateful_check_state master
  (snip)
  
  Step3) We start a node of the second of them.
  
  Step4) We confirm sleep and reboot a node of the second of them.
  [root@srv02 ~]# ps -ef |grep sleep
  
  Step5) The node of the first of them detects the disappearance of the 
  node of the second of them.
  * But, STONITH is delayed because post_notify_start_0 of the node of 
  the second of them is carried
  out.
  
  Wait, what?
  Why would post_notify_start_0 of prmApPostgreSQLDB block stonith?
  
  You didn't put a stonith resource in an ordering constraint did you?
  
  
  * In the srv02 node that disappeared, we do not need the practice of 
  post_notify_start_0.
  * STONITH should be carried out immediately.(STONITH is kept waiting to 
  time-out of
  

Re: [Pacemaker] [Problem]post_notify_start_0 is carried out in the node that disappeared.

2011-02-15 Thread Andrew Beekhof
On Tue, Feb 15, 2011 at 3:01 PM,  renayama19661...@ybb.ne.jp wrote:
 Hi Andrew,

 Thank you for comment.

 Perhaps I misunderstood - does the node fail _while_ we're running 
 post_notify_start_0?
 Is that the ordering you're talking about?

 Yes.
 I think that stonith do not have to wait for post_notify_start_0 of the 
 inoperative node.

 If so, then the crmd is already supposed to be smart enough not to bother 
 waiting for those actions - perhaps the logic got broken at some point.

 If you need detailed information, give me communication.

Should be enough in the bug, i'll follow up there

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Packages for Opensuse 11.3 don't build / install

2011-02-15 Thread Holger Teutsch
Hi,
the packages from rpm-next(64bit) for opensuse 11.3 do not install there (at 
least true for 1.1.4 and 1.1.5).

The plugin is in
./usr/lib/lcrso/pacemaker.lcrso

but should be in
./usr/lib64/lcrso/pacemaker.lcrso

I think the patch below (borrowed from the 'official' packages) cures.
Regards
Holger


diff -r 43a11c0daae4 pacemaker.spec
--- a/pacemaker.specMon Feb 14 15:25:13 2011 +0100
+++ b/pacemaker.specTue Feb 15 17:50:27 2011 +0100
@@ -1,3 +1,7 @@
+%if 0%{?suse_version}
+%define _libexecdir %{_libdir}
+%endif
+
 %global gname haclient
 %global uname hacluster
 %global pcmk_docdir %{_docdir}/%{name}



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Pacemaker/Corosync Professional Services Help

2011-02-15 Thread Papadopoulos, Leo
Dear Mailing List:

My company is investigating the use Pacemaker/Corosync and would like to create 
a proof of concept. We need professional services from a hands-on developer 
with detailed knowledge of  Pacemaker/Corosync, Linux scripting experience, and 
general knowledge of the Linux OS. Please contact me if you or someone you know 
is able to provide such a service.
__
Leo Papadopoulos (leo.papadopou...@ipc.com)
Chief Technology Officer
IPC Systems
777 Commerce Drive
Fairfield, CT 06825-5500
Virtual Number: +1(203) 539-0448


-
Please consider the environment before printing this email.


DISCLAIMER: This e-mail may contain information that is confidential, 
privileged or otherwise protected from disclosure. If you are not an intended 
recipient of this e-mail, do not duplicate or redistribute it by any means. 
Please delete it and any attachments and notify the sender that you have 
received it in error. Unintended recipients are prohibited from taking action 
on the basis of information in this e-mail.E-mail messages may contain computer 
viruses or other defects, may not be accurately replicated on other systems, or 
may be intercepted, deleted or interfered with without the knowledge of the 
sender or the intended recipient. If you are not comfortable with the risks 
associated with e-mail messages, you may decide not to use e-mail to 
communicate with IPC. IPC reserves the right, to the extent and under 
circumstances permitted by applicable law, to retain, monitor and intercept 
e-mail messages to and from its systems.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] corosync+pacemaker support disk heartbeating?

2011-02-15 Thread jiaju liu
Hi allI think there is something wrong with cluster communication,result in 
node reboot. so I want to use disk heartbeating, I use corosync-1.2.2-1.1.el5 
and pacemaker-1.0.9.1-1.el5.is there any guide tell me how to realize disk 
heartbeating with corosync and pacemaker?Thanks a lot


  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] First confused (then enlightened ? :)

2011-02-15 Thread Dan Frincu
On Tue, Feb 15, 2011 at 1:02 PM, Carlos G Mendioroz t...@huapi.ba.arwrote:

 Andrew Beekhof @ 15/02/2011 04:25 -0300 dixit:

  For what I understand, you want the brains of the action at pacemaker,
 so VRRP, HSRP or (U)CARP seem more a trouble than a solution.
 (i.e. twin head) right ?

 In other words, it seems to better align with the solution idea to
 have pacemaker decide and some script-set do the changing.


 What you typically want to avoid is having two isolated entities
 trying to make decisions in the cluster - pulling it to pieces in the
 process.

 Right, makes a lot of sense, only one boss in the office and one place
 to define policy.
 But to integrate with other protocols thought as independent, like VRRP
 or (U)CARP, the dependency has to be implemented.


  Something like DRBD solves this by using crm_master to tell Pacemaker
 which instance it would like promoted, but not actually doing the
 promotion itself.

 I don't know if this is feasible for your application.

  In my case, it seems better to get rid of VRRP and use a more
 comprehensive look of pacemaker.


  Nevertheless, I don't see the concerns of MAC mutation being addressed
 anywhere. And I have my suspocious at ARP caches too.


 Both would be properties of the RA itself rather than Pacemaker or
 Heartbeat.
 So if you can script MAC mutation, you can also create an RA for it
 (or add it to an existing one).


 Is there a guide to implemented RAs ?
 I've seen that the shell can list them. Are they embedded or just
 showing a directory of entities found in some predefined places ?


http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html - The OCF Resource
Agent Developer's Guide
http://www.linux-ha.org/wiki/Resource_Agents - Resource Agents
http://www.linux-ha.org/wiki/OCF_Resource_Agents - OCF Resource Agents
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf
-
OCF Resource Agents
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Clusters_from_Scratch/index.html#id2281146
-
Listing Resource Agents

HTH




  I'm currently thinking about a couple of ideas:
 -using mac-vlan to move an active mac from one server to another
 -using bonding to have something like a MEC, multichasis ether channel.
 (i.e. a way to not only migrate the MAC but also to signal the migration
 to the attachment switch using 802.1ad)

 Are there any statistics on how much time does it take to migrate
 an IP address by current resource ? (IPAddr2 I guess)
 I'm looking for a subsecond delay since failure detection,
 and I guess it's obvious, an active-standby setup.


 I've not done any measurements lately.
 Mostly its dependent on how long the RA takes.


 Ok, now I'm getting into RA arena I guess.
 For speedy failover, I would need a hot standby approach. Is that a
 pacemaker known state ?

 --
 Carlos G Mendioroz  t...@huapi.ba.ar  LW7 EQI  Argentina

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




-- 
Dan Frincu
CCNA, RHCE
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] First confused (then enlightened ? :)

2011-02-15 Thread Dan Frincu
On Wed, Feb 16, 2011 at 9:32 AM, Dan Frincu df.clus...@gmail.com wrote:

 On Tue, Feb 15, 2011 at 1:02 PM, Carlos G Mendioroz t...@huapi.ba.arwrote:

 Andrew Beekhof @ 15/02/2011 04:25 -0300 dixit:

  For what I understand, you want the brains of the action at pacemaker,
 so VRRP, HSRP or (U)CARP seem more a trouble than a solution.
 (i.e. twin head) right ?

 In other words, it seems to better align with the solution idea to
 have pacemaker decide and some script-set do the changing.


 What you typically want to avoid is having two isolated entities
 trying to make decisions in the cluster - pulling it to pieces in the
 process.

 Right, makes a lot of sense, only one boss in the office and one place
 to define policy.
 But to integrate with other protocols thought as independent, like VRRP
 or (U)CARP, the dependency has to be implemented.


  Something like DRBD solves this by using crm_master to tell Pacemaker
 which instance it would like promoted, but not actually doing the
 promotion itself.

 I don't know if this is feasible for your application.

  In my case, it seems better to get rid of VRRP and use a more
 comprehensive look of pacemaker.


  Nevertheless, I don't see the concerns of MAC mutation being addressed
 anywhere. And I have my suspocious at ARP caches too.


 Both would be properties of the RA itself rather than Pacemaker or
 Heartbeat.
 So if you can script MAC mutation, you can also create an RA for it
 (or add it to an existing one).


 Is there a guide to implemented RAs ?
 I've seen that the shell can list them. Are they embedded or just
 showing a directory of entities found in some predefined places ?


 http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html - The OCF
 Resource Agent Developer's Guide
 http://www.linux-ha.org/wiki/Resource_Agents - Resource Agents
 http://www.linux-ha.org/wiki/OCF_Resource_Agents - OCF Resource Agents

 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf
  -
 OCF Resource Agents

 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Clusters_from_Scratch/index.html#id2281146
  -
 Listing Resource Agents


And not to forget

http://www.linux-ha.org/doc/users-guide/users-guide.html - The Linux-HA
User’s Guide
and
http://www.linux-ha.org/doc/man-pages/man-pages.html - Linux-HA Manual Pages


 HTH




  I'm currently thinking about a couple of ideas:
 -using mac-vlan to move an active mac from one server to another
 -using bonding to have something like a MEC, multichasis ether channel.
 (i.e. a way to not only migrate the MAC but also to signal the migration
 to the attachment switch using 802.1ad)

 Are there any statistics on how much time does it take to migrate
 an IP address by current resource ? (IPAddr2 I guess)
 I'm looking for a subsecond delay since failure detection,
 and I guess it's obvious, an active-standby setup.


 I've not done any measurements lately.
 Mostly its dependent on how long the RA takes.


 Ok, now I'm getting into RA arena I guess.
 For speedy failover, I would need a hot standby approach. Is that a
 pacemaker known state ?

 --
 Carlos G Mendioroz  t...@huapi.ba.ar  LW7 EQI  Argentina

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




 --
 Dan Frincu
 CCNA, RHCE




-- 
Dan Frincu
CCNA, RHCE
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker