Re: [Pacemaker] [Partially SOLVED] pacemaker/dlm problems

2011-12-08 Thread Andrew Beekhof
On Fri, Dec 9, 2011 at 3:16 PM, Vladislav Bogdanov  wrote:
> 09.12.2011 03:11, Andrew Beekhof wrote:
>> On Fri, Dec 2, 2011 at 1:32 AM, Vladislav Bogdanov  
>> wrote:
>>> Hi Andrew,
>>>
>>> I investigated on my test cluster what actually happens with dlm and
>>> fencing.
>>>
>>> I added more debug messages to dlm dump, and also did a re-kick of nodes
>>> after some time.
>>>
>>> Results are that stonith history actually doesn't contain any
>>> information until pacemaker decides to fence node itself.
>>
>> ...
>>
>>> From my PoV that means that the call to
>>> crm_terminate_member_no_mainloop() does not actually schedule fencing
>>> operation.
>>
>> You're going to have to remind me... what does your copy of
>> crm_terminate_member_no_mainloop() look like?
>> This is with the non-cman editions of the controlds too right?
>
> Just latest github's version. You changed some dlm_controld.pcmk
> functionality, so it asks stonithd for fencing results instead of XML
> magic. But call to crm_terminate_member_no_mainloop() remains the same
> there. But yes, that version communicates stonithd directly too.
>
> SO, the problem here is just with crm_terminate_member_no_mainloop()
> which for some reason skips actual fencing request.

There should be some logs, either indicating that it tried, or that it failed.

> Side note: shouldn't that wait_fencing_done functionality where it asks
> for stonith history be moved to crm API as well?

potentially

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-12-08 Thread Takatoshi MATSUO
Hi Attila

2011/12/8 Attila Megyeri :
> Hi Takatoshi,
>
> One strange thing I noticed and could probably be improved.
> When there is data inconsistency, I have the following node properties:
>
> * Node psql2:
>+ default_ping_set  : 100
>+ master-postgresql:1   : -INFINITY
>+ pgsql-data-status : DISCONNECT
>+ pgsql-status  : HS:alone
> * Node psql1:
>+ default_ping_set  : 100
>+ master-postgresql:0   : 1000
>+ master-postgresql:1   : -INFINITY
>+ pgsql-data-status : LATEST
>+ pgsql-master-baseline : 58:4B20
>+ pgsql-status  : PRI
>
> This is fine, and understandable - but I can see this only if I do a crm_mon 
> -A.
>
> My problem is, that CRM shows the following:
>
> Master/Slave Set: db-ms-psql [postgresql]
> Masters: [ psql1 ]
> Slaves: [ psql2 ]
>
> So if I monitor the system from crm_mon, HAWK or ther tools - I have no 
> indication at all that the slave is running in an inconsistent mode.
>
> I would expect the RA to stop the psql2 node in such cases, because:
> - It is running, but has non-up-to-date data, therefore noone will use it 
> (the slave IP points to the master as well, which is good)
> - In CRM status eveything looks perfect, even though it is NOT perfect and 
> admin intervention is required.
>
>
> Shouldn't the disconnected PSQL server be stopped instead?

hmm..
It's not better to stop PGSQL server.
RA cannot know whether PGSQL is disconnected because of
data-inconsistent or network-down or
starting-up and so on.


How about using dummy RA such as vip-slave?
---
primitive runningSlaveOK ocf:heartbeat:Dummy
.(snip)

location rsc_location-dummy runningSlaveOK \
 rule  200: pgsql-status eq "HS:sync"
---


Regards,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Partially SOLVED] pacemaker/dlm problems

2011-12-08 Thread Vladislav Bogdanov
09.12.2011 03:11, Andrew Beekhof wrote:
> On Fri, Dec 2, 2011 at 1:32 AM, Vladislav Bogdanov  
> wrote:
>> Hi Andrew,
>>
>> I investigated on my test cluster what actually happens with dlm and
>> fencing.
>>
>> I added more debug messages to dlm dump, and also did a re-kick of nodes
>> after some time.
>>
>> Results are that stonith history actually doesn't contain any
>> information until pacemaker decides to fence node itself.
> 
> ...
> 
>> From my PoV that means that the call to
>> crm_terminate_member_no_mainloop() does not actually schedule fencing
>> operation.
> 
> You're going to have to remind me... what does your copy of
> crm_terminate_member_no_mainloop() look like?
> This is with the non-cman editions of the controlds too right?

Just latest github's version. You changed some dlm_controld.pcmk
functionality, so it asks stonithd for fencing results instead of XML
magic. But call to crm_terminate_member_no_mainloop() remains the same
there. But yes, that version communicates stonithd directly too.

SO, the problem here is just with crm_terminate_member_no_mainloop()
which for some reason skips actual fencing request.

Side note: shouldn't that wait_fencing_done functionality where it asks
for stonith history be moved to crm API as well?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Partially SOLVED] pacemaker/dlm problems

2011-12-08 Thread Vladislav Bogdanov
09.12.2011 03:15, Andrew Beekhof wrote:
> On Thu, Nov 24, 2011 at 6:21 PM, Vladislav Bogdanov
>  wrote:
>> 24.11.2011 08:49, Andrew Beekhof wrote:
>>> On Thu, Nov 24, 2011 at 3:58 PM, Vladislav Bogdanov
>>>  wrote:
 24.11.2011 07:33, Andrew Beekhof wrote:
> On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov
>  wrote:
>> Hi Andrew,
>>
>> I just found another problem with dlm_controld.pcmk (with your latest
>> patch from github applied and also my fixes to actually build it - they
>> are included in a message referenced by this one).
>> One node which just requested fencing of another one stucks at printing
>> that message where you print ctime() in fence_node_time() (pacemaker.c
>> near 293) every second.
>
> So not blocked, it just keeps repeating that message?
> What date does it print?

 Blocked... kern_stop
>>>
>>> I'm confused.
>>
>> As well as me...
>>
>>> How can it do that every second?
>>
>> Only in one case:
> 
> I'm clearly not a kernel guy, but once the kernel is stopped, wouldn't
> it be doing nothing?
> How could the system re-hit the same condition if its stopped?

Sorry for being unclean.
kern_stop is a dlm state in which it forbids to make any changes in its
kernel part's lock list. Not a kernel panic. Just locking requests are
not served. Primarily this happens when dlm notices cluster problems and
waits until fencing is done.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Partially SOLVED] pacemaker/dlm problems

2011-12-08 Thread Nick Khamis
It can't. Nothing will work at that point. Not even a simple ls. Reboot!

Nick.

On Thu, Dec 8, 2011 at 7:15 PM, Andrew Beekhof  wrote:
> On Thu, Nov 24, 2011 at 6:21 PM, Vladislav Bogdanov
>  wrote:
>> 24.11.2011 08:49, Andrew Beekhof wrote:
>>> On Thu, Nov 24, 2011 at 3:58 PM, Vladislav Bogdanov
>>>  wrote:
 24.11.2011 07:33, Andrew Beekhof wrote:
> On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov
>  wrote:
>> Hi Andrew,
>>
>> I just found another problem with dlm_controld.pcmk (with your latest
>> patch from github applied and also my fixes to actually build it - they
>> are included in a message referenced by this one).
>> One node which just requested fencing of another one stucks at printing
>> that message where you print ctime() in fence_node_time() (pacemaker.c
>> near 293) every second.
>
> So not blocked, it just keeps repeating that message?
> What date does it print?

 Blocked... kern_stop
>>>
>>> I'm confused.
>>
>> As well as me...
>>
>>> How can it do that every second?
>>
>> Only in one case:
>
> I'm clearly not a kernel guy, but once the kernel is stopped, wouldn't
> it be doing nothing?
> How could the system re-hit the same condition if its stopped?
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Partially SOLVED] pacemaker/dlm problems

2011-12-08 Thread Andrew Beekhof
On Thu, Nov 24, 2011 at 6:21 PM, Vladislav Bogdanov
 wrote:
> 24.11.2011 08:49, Andrew Beekhof wrote:
>> On Thu, Nov 24, 2011 at 3:58 PM, Vladislav Bogdanov
>>  wrote:
>>> 24.11.2011 07:33, Andrew Beekhof wrote:
 On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov
  wrote:
> Hi Andrew,
>
> I just found another problem with dlm_controld.pcmk (with your latest
> patch from github applied and also my fixes to actually build it - they
> are included in a message referenced by this one).
> One node which just requested fencing of another one stucks at printing
> that message where you print ctime() in fence_node_time() (pacemaker.c
> near 293) every second.

 So not blocked, it just keeps repeating that message?
 What date does it print?
>>>
>>> Blocked... kern_stop
>>
>> I'm confused.
>
> As well as me...
>
>> How can it do that every second?
>
> Only in one case:

I'm clearly not a kernel guy, but once the kernel is stopped, wouldn't
it be doing nothing?
How could the system re-hit the same condition if its stopped?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Partially SOLVED] pacemaker/dlm problems

2011-12-08 Thread Andrew Beekhof
On Fri, Dec 2, 2011 at 1:32 AM, Vladislav Bogdanov  wrote:
> Hi Andrew,
>
> I investigated on my test cluster what actually happens with dlm and
> fencing.
>
> I added more debug messages to dlm dump, and also did a re-kick of nodes
> after some time.
>
> Results are that stonith history actually doesn't contain any
> information until pacemaker decides to fence node itself.

...

> From my PoV that means that the call to
> crm_terminate_member_no_mainloop() does not actually schedule fencing
> operation.

You're going to have to remind me... what does your copy of
crm_terminate_member_no_mainloop() look like?
This is with the non-cman editions of the controlds too right?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CMAN - Pacemaker - Porftpd setup

2011-12-08 Thread Andrew Beekhof
On Wed, Dec 7, 2011 at 9:49 AM, Florian Haas  wrote:
> On Tue, Dec 6, 2011 at 3:47 PM, Bensch, Kobus
>  wrote:
>> 2.) I pasted the outcome here http://pastebin.com/uPcHiM4p
>
> So, you should be seeing lines akin to the following in your logs:
>
> ERROR: clone_rsc_colocation_rh: Cannot interleave clone ActiveFTPSite
> and WebIP because they do not support the same number of resources per
> node
> ERROR: clone_rsc_colocation_rh: Cannot interleave clone ActiveFTPSite
> and WebIP because they do not support the same number of resources per
> node
>
> Andrew: this configuration appears to come from CFS
> (http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s06.html);
> that documentation seems like it needs to be updated. Either one would
> need to disable interleaving (which is now enabled by default, iirc),
> or set clone-node-max=”1” on the IPaddr2 clone.

I will likely give that document a major overhaul for fedora 17.
I wonder if the IPaddr2 clone will behave correctly with
clone-node-max=”1” (ie. do the number of buckets get reduced when its
stopped on a node)

>
> Interestingly, in 1.1.5, setting interleave=false on both clones in
> the given CIB does _not_ seem to fix the problem (ptest still
> complains about "cannot interleave"), only setting clone-node-max=1 on
> the IPaddr2 clone does. Am I doing something wrong, or does this look
> like a pengine bug?

Sounds like a bug.  Could someone create a bugzilla for that please?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Accessing GFS2 SAN drive, without Pacemaker?

2011-12-08 Thread Andrew Beekhof
On Fri, Dec 9, 2011 at 10:11 AM, Charles DeVoe wrote:

> We have a three node cluster that we are going to run dedicated services
> on each box.  That is one will be used for analysis, one for data
> collection, one for mysql.   We need to be able to access data on a shared
> SAN drive using iSCSI.  2 nodes are running Fedora 16 and 1 node on Fedora
> 14.  The SAN is formatted using the GFS2 file system.
>
> Can we run just cman to control the access or do we need pacemaker as well?


You can use GFS2 without a resource manager like Pacemaker but you would
need manual recovery for most failure conditions.


>   Also, if we need pacemaker does the SAN volume need to be added as a
> resource?


Not required, but there might be advantages.


>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] faq / howto needed for cib troubleshooting

2011-12-08 Thread Andrew Beekhof
On Fri, Nov 25, 2011 at 8:44 AM, Attila Megyeri
 wrote:
> Hi Gents,
>
> I see from time to time that you are asking for "cibadmin -Ql" type outputs 
> to help people troubleshoot their problems.
>
> Currenty I have an issue promoting a MS resource (the PSQL issue in the 
> previous mail) - and I would like to start troubleshooting the problem, but 
> did not find any howto's or documentation on this topic.
> Could you  provide me any details on how to troubleshoot cib states?

Start with crm_mon -o
Then check what crm_simulate -L says.
Try adding additional -V arguments and grepping for your resource name.

> My current issue is that I have a MS resource that is started in slave/slave 
> mode, and the "promote" is never even called by the cib. I'd like to start 
> the research but have no idea how to do it.

Are you sure the promote doesnt happen?  No mention of it in the logs?

>
> I have read the pacemaker doc, as well as the cluster from srcatch doc, but 
> there are no troubleshooting hints.
>
> Thank you in advance,
>
> Attila
>
> -Original Message-
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: 2011. november 23. 16:53
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed
>
> Hi Takatoshi, All,
>
> Thanks for your reply.
> I see that you have invested significant effort in the development of the RA. 
> I spent the last day trying to set up the RA, but without much success.
>
> My infrastructure is very similar to yours, except for the fact that 
> currently I am testing with a single network adapter.
>
> Replication works nicely when I start the databases manually, not using 
> corosync.
>
> When I try to start using corosync,I see that the ping resources start 
> normally, but the msPostgresql starts on both nodes in slave mode, and I see 
> "HS:alone"
>
> In the Wiki you state, the if I start on a signle node only, PSQL should 
> start in Master mode (PRI), but this is not the case.
>
> The recovery.conf file is created immediately, and from the logs I see no 
> attempt at all to promote the node.
> In the postgres logs I see that node1, which is supposed to be a master, 
> tries to connect to the vip-rep IP address, which is NOT brought up, because 
> it depends on the Master role...
>
> Do you have any idea?
>
>
> My environment:
> Debian Squeeze, with backported pacemaker (Version: 1.1.5) - official 
> pacemaker in debian is rather old and buggy Postgres 9.1, streaming 
> replication, sync mode
> Node1: psql1, 10.12.1.21
> Node1: psql2, 10.12.1.22
>
> Crm config:
>
> node psql1 \
>        attributes standby="off"
> node psql2 \
>        attributes standby="off"
> primitive pingCheck ocf:pacemaker:ping \
>        params name="default_ping_set" host_list="10.12.1.1" multiplier="100" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="ignore"
> primitive postgresql ocf:heartbeat:pgsql \
>        params pgctl="/usr/lib/postgresql/9.1/bin/pg_ctl" psql="/usr/bin/psql" 
> pgdata="/var/lib/postgresql/9.1/main" 
> config="/etc/postgresql/9.1/main/postgresql.conf" 
> pgctldata="/usr/lib/postgresql/9.1/bin/pg_controldata" rep_mode="sync" 
> node_list="psql1 psql2" restore_command="cp 
> /var/lib/postgresql/9.1/main/pg_archive/%f %p" master_ip="10.12.1.28" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="7s" timeout="60s" on-fail="restart" \
>        op monitor interval="2s" role="Master" timeout="60s" on-fail="restart" 
> \
>        op promote interval="0s" timeout="60s" on-fail="restart" \
>        op demote interval="0s" timeout="60s" on-fail="block" \
>        op stop interval="0s" timeout="60s" on-fail="block" \
>        op notify interval="0s" timeout="60s"
> primitive vip-master ocf:heartbeat:IPaddr2 \
>        params ip="10.12.1.20" nic="eth0" cidr_netmask="24" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="block" \
>        meta target-role="Started"
> primitive vip-rep ocf:heartbeat:IPaddr2 \
>        params ip="10.12.1.28" nic="eth0" cidr_netmask="24" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="block" \
>        meta target-role="Started"
> primitive vip-slave ocf:heartbeat:IPaddr2 \
>        params ip="10.12.1.27" nic="eth0" cidr_netmask="24" \
>        meta resource-stickiness="1" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="block"
> group master-group vip-master vip-rep
> ms msPostgr

Re: [Pacemaker] Excessive migrate_from is run after migrate_to failed

2011-12-08 Thread Andrew Beekhof
On Thu, Dec 1, 2011 at 9:30 PM, Vladislav Bogdanov  wrote:
> Hi Andrew, all,
>
> I found that pacemaker runs migrate_from on a migration destination node
> even if preceding migrate_to command failed (github master).
>
> Is it intentional?

I think so, but I can see that its not a good idea in all cases.

>
> hb_report?

A bug with the above description would be enough in this case

>
> Best,
> Vladislav
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Accessing GFS2 SAN drive, without Pacemaker?

2011-12-08 Thread Charles DeVoe
We have a three node cluster that we are going to run dedicated services on 
each box.  That is one will be used for analysis, one for data collection, one 
for mysql.   We need to be able to access data on a shared SAN drive using 
iSCSI.  2 nodes are running Fedora 16 and 1 node on Fedora 14.  The SAN is 
formatted using the GFS2 file system.  

Can we run just cman to control the access or do we need pacemaker as well?  
Also, if we need pacemaker does the SAN volume need to be added as a resource?  
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] colocation issue with master-slave resources

2011-12-08 Thread Andrew Beekhof
On Tue, Nov 29, 2011 at 10:10 AM, Patrick H.  wrote:
> Upgraded to 1.1.6 and put in an ordering constraint, still no joy.

Could you file a bug and include a crm_report for this please?

>
> # crm status
> 
> Last updated: Mon Nov 28 23:09:37 2011
> Last change: Mon Nov 28 23:08:34 2011 via cibadmin on devlvs03
>
> Stack: cman
> Current DC: devlvs03 - partition with quorum
> Version: 1.1.6-1.el6-b379478e0a66af52708f56d0302f50b6f13322bd
>
> 2 Nodes configured, 2 expected votes
> 5 Resources configured.
> 
>
> Online: [ devlvs04 devlvs03 ]
>
>  dummy    (ocf::pacemaker:Dummy):    Started devlvs03
>  Master/Slave Set: stateful1-ms [stateful1]
>     Masters: [ devlvs04 ]
>     Slaves: [ devlvs03 ]
>  Master/Slave Set: stateful2-ms [stateful2]
>     Masters: [ devlvs04 ]
>     Slaves: [ devlvs03 ]
>
>
> # crm configure show
> node devlvs03 \
>    attributes standby="off"
> node devlvs04 \
>    attributes standby="off"
> primitive dummy ocf:pacemaker:Dummy \
>    meta target-role="Started"
> primitive stateful1 ocf:pacemaker:Stateful
> primitive stateful2 ocf:pacemaker:Stateful
> ms stateful1-ms stateful1
> ms stateful2-ms stateful2
> colocation stateful1-colocation inf: stateful1-ms:Master dummy:Started
> colocation stateful2-colocation inf: stateful2-ms:Master dummy:Started
> order stateful1-start inf: dummy:start stateful1-ms:promote
> order stateful2-start inf: dummy:start stateful2-ms:promote
> property $id="cib-bootstrap-options" \
>    dc-version="1.1.6-1.el6-b379478e0a66af52708f56d0302f50b6f13322bd" \
>
>    cluster-infrastructure="cman" \
>    expected-quorum-votes="2" \
>    stonith-enabled="false" \
>    no-quorum-policy="ignore" \
>    last-lrm-refresh="1322450542"

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Make IP master

2011-12-08 Thread Andrew Beekhof
On Thu, Dec 8, 2011 at 6:34 AM, Charles DeVoe wrote:

> We are attempting to st up the cluster such that a user will be logged
> into the least busy node via ssh.  The configuration and crm_mon results
> are included here.  Is it possible to set this up such that doing an ssh to
> the cluster IP will put the user on one node or the other?  If so what are
> we missing?  Thnaks
>

Having them be logged into /a/ machine in this way isn't hard, just a basic
IP address (no cloning or master/slave) will do that.
Getting them on to the least busy machine might be tricky though, perhaps
there is a value for clusterip_hash that might give you something close.



>
> CONFIGURATION
>
> node node1
> node node2
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
> params ip="10.1.18.100" cidr_netmask="32" clusterip_hash="sourceip" \
> op monitor interval="30s"
> primitive WebFS ocf:heartbeat:Filesystem \
> params device="/dev/sde" directory="/data/SAN-VOL4" fstype="gfs2"
> ms AlbertIp ClusterIP \
> meta master-max="2" master-node-max="2" clone-max="2"
> clone-node-max="1" notify="true"
> clone WedFSClone WebFS
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-4.fc16-89678d4947c5bd466e2f31acd58ea4e1edb854d5" \
> cluster-infrastructure="cman" \
> stonith-enabled="false" \
> last-lrm-refresh="1323285086"
>
> crm_mon Yields
> 
> Last updated: Wed Dec  7 13:48:42 2011
> Last change: Wed Dec  7 13:48:06 2011 via cibadmin on node1
> Stack: cman
> Current DC: node2 - partition with quorum
> Version: 1.1.6-4.fc16-89678d4947c5bd466e2f31acd58ea4e1edb854d5
> 2 Nodes configured, unknown expected votes
> 4 Resources configured.
> 
>
> Online: [ node1 node2 ]
>
>  Clone Set: WedFSClone [WebFS]
>  Started: [ node2 node1 ]
>  Master/Slave Set: AlbertIp [ClusterIP]
>  Slaves: [ node1 node2 ]
>
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] don't want to restart clone resource

2011-12-08 Thread Andrew Beekhof
Can you file a bug and attach a crm_report to it please?
Unfortunately there's not enough information here to figure out the
cause (although it does look like a bug)

2011/12/1 Sha Fanghao :
> Hi,
>
>
>
> I have a cluster 3 nodes (CentOS 5.2) using pacemaker-1.0.11(also 1.0.12),
> with heartbeat-3.0.3.
>
> You can see the configuration:
>
>
>
> #crm configure show:
>
> node $id="85e0ca02-7aa4-45c8-9911-4035e1e6ee15" node-2
>
> node $id="a046bd1e-6267-49e5-902d-c87b6ed1dcb9" node-0
>
> node $id="d0f0b2ab-f243-4f78-b541-314fa7d6b346" node-1
>
> primitive failover-ip ocf:heartbeat:IPaddr2 \
>
>     params ip="10.10.5.83" \
>
>     op monitor interval="5s"
>
> primitive master-app-rsc lsb:cluster-master \
>
>     op monitor interval="5s"
>
> primitive node-app-rsc lsb:cluster-node \
>
>     op monitor interval="5s"
>
> group group-dc failover-ip master-app-rsc
>
> clone clone-node-app-rsc node-app-rsc
>
> location rule-group-dc group-dc \
>
>     rule $id="rule-group-dc-rule" -inf: #is_dc eq false
>
> property $id="cib-bootstrap-options" \
>
>     start-failure-is-fatal="false" \
>
>     no-quorum-policy="ignore" \
>
>     symmetric-cluster="true" \
>
>     stonith-enabled="false" \
>
>     dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
>
>     cluster-infrastructure="Heartbeat"
>
>
>
> #crm_mon -n -1:
>
> 
>
> Last updated: Sat Oct 29 08:44:14 2011
>
> Stack: Heartbeat
>
> Current DC: node-0 (a046bd1e-6267-49e5-902d-c87b6ed1dcb9) - partition with
> quorum
>
> Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
>
> 3 Nodes configured, unknown expected votes
>
> 2 Resources configured.
>
> 
>
>
>
> Node node-0 (a046bd1e-6267-49e5-902d-c87b6ed1dcb9): online
>
>     master-app-rsc  (lsb:cluster-master) Started
>
>     failover-ip (ocf::heartbeat:IPaddr2) Started
>
>     node-app-rsc:0  (lsb:cluster-node) Started
>
> Node node-1 (d0f0b2ab-f243-4f78-b541-314fa7d6b346): online
>
>     node-app-rsc:1  (lsb:cluster-node) Started
>
> Node node-2 (85e0ca02-7aa4-45c8-9911-4035e1e6ee15): online
>
>     node-app-rsc:2  (lsb:cluster-node) Started
>
>
>
>
>
> The problem:
>
> After stopping heartbeat service on node-1, if I remove node-1 with command
> "hb_delnode node-1 && crm node delete node-1", then
>
> the clone resource(node-app-rsc:2) running on the node-2 will restart and
> change to "node-app-rsc:1".
>
> You know, the node-app-rsc is my application, and I don't want it to
> restart.
>
> How could I do, Please?
>
>
>
> Any help will be very appreciated. :)
>
>
>
>
>
> Best Regards,
>
>  Fanghao Sha
>
>
>
>
>
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] are stopped resources monitored?

2011-12-08 Thread Andrew Beekhof
On Wed, Nov 30, 2011 at 1:26 PM, James Harper
 wrote:
>> >
>> > That thread goes around in circles and completely contradicts what
> I'm
>> > seeing. What I'm seeing is that unmanaged resources are never
> monitored.
>>
>> would be strange and how do you verify this? A look at your config may
> also
>> help to shed some light on this ...
>>
>
> The relevant portions of the config are:
>
> primitive p_xen_smtp2 ocf:heartbeat:Xen \
>        params name=" smtp2" xmfile="/configs/xen/smtp2" \
>        op start interval="0" timeout="60s" \
>        op stop interval="0" timeout="300s" \
>        op migrate_from interval="0" timeout="300s" \
>        op migrate_to interval="0" timeout="300s" \
>        op monitor interval="10s" timeout="30s" \
>        meta allow-migrate="true"
>
> property $id="cib-bootstrap-options" \
>        dc-version="1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04" \
>        cluster-infrastructure="openais" \
>        expected-quorum-votes="2" \
>        stonith-enabled="false" \
>        no-quorum-policy="ignore" \
>        last-lrm-refresh="1322100376"
> rsc_defaults $id="rsc-options" \
>        resource-stickiness="200"
>
> I just tested the following (it actually contradicts some of my previous
> statements... but I'm including it anyway as it wasn't what I expected):
>
> . VM is running on node bitvs6 as a managed resource
> . I type "crm resource unmanage p_xen_smtp2"
> . crm status is "Started bitvs6 (unmanaged)"
> . I manually stop the VM outside crm
> . A few seconds later, the status is " Started bitvs6 (unmanaged)
> FAILED" with a failed action " p_xen_smtp2_monitor_1 (node=bitvs6,
> call=70, rc=7, status=complete): not running"... so okay... it did
> monitor a managed and _running_ resource, even though it resulted in an
> error

So far so good.

> . I type "crm resource cleanup p_xen_smtp2"

What for?
This has the side effect of stopping any recurring monitor action that
was running.

> . hangs for ages at "Waiting for 3 replies from the CRMd.No messages
> received in 60 seconds.." then finally says "aborting"
> . I type "crm resource stop p_xen_smtp2"
> . hangs for a bit then says " Call cib_replace failed (-41): Remote node
> did not respond"

That doesn't look good at all.
At a guess, it seems like something crashed.  If you want to file a
bug and attach a crm_report I'll take a look.

>
> Any further attempt to do anything with this resource just hangs...
> maybe the Xen RA monitor script is broken? I can only fix it by starting
> the VM manually so that the actual status matches crm's expected
> resource status.
>
> So starting again to demonstrate the problem:
> . VM is running on node bitvs6 as a managed resource
> . I type "crm resource stop p_xen_smtp2"
> . VM shuts down as expected
> . I type "crm resource unmanage p_xen_smtp2"
> . I manually start the VM outside of crm
> . crm _never_ notices that the resource is started unless I do something
> like "crm resource cleanup p_xen_smtp2" to manually cause the monitoring
> script to be run

The 1.1.x series will detect this if you specify a recurring monitor
with role=Stopped, but its not the default behaviour because, well,
"don't do that".

>
> Now the above is all about unmanaged resources, but this VM is one I
> could rebuild easily enough so now I'm going to get tricky:
>
> . VM is running on node bitvs6 as a managed resource
> . I type "crm resource stop p_xen_smtp2"
> . VM shuts down as expected
> . I manually start the VM outside of crm
> . crm still _never_ notices that the resource is started unless I do
> something like "crm resource cleanup p_xen_smtp2" to manually cause the
> monitoring script to be run

As above.

>
> This really is unexpected behaviour... starting the resource in crm
> causes the right things to happen (notices that the resource is running)
> but I still expected that a stopped resource would be monitored...

No, not by default.
There should be only one point of control, you're creating an internal
split-brain by telling the cluster to control the resource AND doing
so yourself in parallel.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] (no subject)

2011-12-08 Thread Charles DeVoe
Oh good, the infamous system settingGreat,  I always love chasing these 
things down  Thanks for the help

--- On Wed, 12/7/11, Andrew Beekhof  wrote:

From: Andrew Beekhof 
Subject: Re: [Pacemaker] (no subject)
To: "The Pacemaker cluster resource manager" 
Date: Wednesday, December 7, 2011, 10:03 PM

Seems to work here...
[root@pcmk-4 ~]# service corosync startStarting corosync (via systemctl):       
                  [  OK  ][root@pcmk-4 ~]# systemctl start pacemaker.service
[root@pcmk-4 ~]# ps axf  PID TTY      STAT   TIME COMMAND    2 ?        S      
0:00 [kthreadd]... 3513 ?        Ssl    0:00 corosync 3525 ?        S      0:00 
/usr/sbin/pacemakerd
 3528 ?        Ss     0:00  \_ /usr/lib64/heartbeat/stonithd 3529 ?        Ss   
  0:00  \_ /usr/lib64/heartbeat/cib 3530 ?        Ss     0:00  \_ 
/usr/lib64/heartbeat/lrmd 3531 ?        Ss     0:00  \_ 
/usr/lib64/heartbeat/attrd
 3532 ?        Ss     0:00  \_ /usr/lib64/heartbeat/pengine 3533 ?        Ss    
 0:00  \_ /usr/lib64/heartbeat/crmd[root@pcmk-4 ~]# cat 
/etc/fedora-release Fedora release 16 (Verne)


On Tue, Dec 6, 2011 at 8:29 AM, Charles DeVoe  wrote:

Running Fedora 16.  When doing a systemctl start pacemaker.service I get the 
following error.

[root@node2 ]# systemctl status pacemaker.service

pacemaker.service - Pacemaker High Availability Cluster Manager
      Loaded: loaded (/lib/systemd/system/pacemaker.service; enabled)
      Active: failed since Mon, 05 Dec 2011 15:37:16 -0500; 2min 28s ago
     Process: 1947 ExecStart=/usr/sbin/pacemakerd (code=exited, 
status=200/CHDIR)

      CGroup: name=systemd:/system/pacemaker.service

However, if I change to /etc/init.d and then enter ./pacemaker it starts.  

Any idea why?

___


Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

http://oss.clusterlabs.org/mailman/listinfo/pacemaker



Project Home: http://www.clusterlabs.org

Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org





-Inline Attachment Follows-

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-12-08 Thread Attila Megyeri
Hi Takatoshi,

One strange thing I noticed and could probably be improved.
When there is data inconsistency, I have the following node properties:

* Node psql2:
+ default_ping_set  : 100
+ master-postgresql:1   : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status  : HS:alone
* Node psql1:
+ default_ping_set  : 100
+ master-postgresql:0   : 1000
+ master-postgresql:1   : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 58:4B20
+ pgsql-status  : PRI

This is fine, and understandable - but I can see this only if I do a crm_mon -A.

My problem is, that CRM shows the following:

Master/Slave Set: db-ms-psql [postgresql]
 Masters: [ psql1 ]
 Slaves: [ psql2 ]

So if I monitor the system from crm_mon, HAWK or ther tools - I have no 
indication at all that the slave is running in an inconsistent mode.

I would expect the RA to stop the psql2 node in such cases, because:
- It is running, but has non-up-to-date data, therefore noone will use it (the 
slave IP points to the master as well, which is good)
- In CRM status eveything looks perfect, even though it is NOT perfect and 
admin intervention is required.


Shouldn't the disconnected PSQL server be stopped instead?

Regards,
Attila




-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
Sent: 2011. november 28. 11:10
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/28 Attila Megyeri :
> Hi Takatoshi,
>
> I understand your point and I agree that the correct behavior is not to start 
> replication when data consistency exists.
> The only thing I do not really understand is how it could have happened:
>
> 1) nodes were in sync (psql1=PRI, psql2=STREAMING|SYNC)
> 2) I shut down node psql1 (by placing it into standby)
> 3) At this moment psql1's baseline became higher by 20?  What could cause 
> this? Probably the demote operation itself? There were no clients connected - 
> and there was definitively no write operation to the db (except if not from 
> system side).

Yes, PostgreSQL executes a CHECKPOINT when it is shut down normally on demote.

> On the other hand - thank you very much for your contribution, the RA works 
> very well and I really appreciate your work and help!

Not at all. Don't mention it.

Regards,
Takatoshi MATSUO


> Bests,
>
> Attil
>
> -Original Message-
> From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
> Sent: 2011. november 28. 2:10
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Postgresql streaming replication failover -
> RA needed
>
> Hi Attila
>
> Primary can not send all wals to HotStandby whether primary is shut down 
> normally.
> These logs validate it.
>
>> Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and
>> Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]:
>> INFO: psql2 master baseline : 14:2300
>
> psql1's location was  "2320" when it was demoted.
> OTOH psql2's location was "2300"  when it was promoted.
>
> It means that psql1's data was newer than psql2's one at that time.
> The gap is 20.
>
> As you said you can start psql1's PostgreSQL manually, but PostgreSQL can't 
> realize this occurrence.
> If you start HotStandby at psql1, data is replicated after 2320.
> It's inconsistency.
>
> Thanks,
> Takatoshi MATSUO
>
>
> 2011/11/28 Attila Megyeri :
>> Hi Takatoshi,
>>
>> I don't think it is inconsistency problem - for me it looks like some RA bug.
>> I think so, because postgres starts properly outside pacemaker.
>>
>> When pacemaker starts node psql1 I see only:
>>
>> postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete):
>> unknown error
>>
>> and the postgres log is empty - so I suppose that it does not even try to 
>> start it.
>>
>> What I tested was:
>> - I had a stable cluster, where psql1 was the master, psql2 was the
>> slave
>> - I put psql1 into standby mode. ("node psql1 standby") to test
>> failover
>> - After a while psql2 became the PRI, which is very good
>> - When I put psql1 back online, postgres wouldn't start anymore from 
>> pacemaker (unknown error).
>>
>>
>> I tried to start postgres manually from the shell it worked fine, even the 
>> monitor was able to see that it became in SYNC (obviously the master/slave 
>> group was showing improper state as psql was started outside pacemaker.
>>
>> I don't think data inconsistency is the case, partially because there are no 
>> clients connected, partially because psql starts properly outside pacemaker.
>>
>> Here is what is relevant from the log:
>>
>> Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a 
>> primary.
>> Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2