Re: [Pacemaker] Problems with SBD fencing

2013-08-19 Thread Angel L. Mateo

El 06/08/13 13:49, Jan Christian Kaldestad escribió:

In my case this does not work - read my original post. So I wonder if
there is a pacemaker bug (version 1.1.9-2db99f1). Killing pengine and
stonithd on the node which is supposed to "shoot" seems to resolve the
problem, though this is not a solution of course.

I also tested two separate stonith resources, one on each node. This
stonith'ing works fine with this configuration. Is there somehing
"wrong" about doing it this way?


For me to work (ubuntu 12.04) I had to create /etc/sysconfig/sbd file 
with:

SBD_DEVICE="/dev/disk/by-id/wwn-0x6006016009702500a4227a04c6b0e211-part1"
SBD_OPTS="-W"

and the resource configuration is

primitive stonith_sbd stonith:external/sbd \
params 
sbd_device="/dev/disk/by-id/wwn-0x6006016009702500a4227a04c6b0e211-part1" \

meta target-role="Started"

	Where /dev/disk/by-id/wwn-0x6006016009702500a4227a04c6b0e211-part1 is 
my disk device.


--
Angel L. Mateo Martínez
Sección de Telemática
Área de Tecnologías de la Información
y las Comunicaciones Aplicadas (ATICA)
http://www.um.es/atica
Tfo: 868889150
Fax: 86337

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Problems with SBD fencing

2013-08-19 Thread Angel L. Mateo

El 06/08/13 13:49, Jan Christian Kaldestad escribió:

In my case this does not work - read my original post. So I wonder if
there is a pacemaker bug (version 1.1.9-2db99f1). Killing pengine and
stonithd on the node which is supposed to "shoot" seems to resolve the
problem, though this is not a solution of course.

I also tested two separate stonith resources, one on each node. This
stonith'ing works fine with this configuration. Is there somehing
"wrong" about doing it this way?


Are you sure you have property stonith-enabled="true"?

--
Angel L. Mateo Martínez
Sección de Telemática
Área de Tecnologías de la Información
y las Comunicaciones Aplicadas (ATICA)
http://www.um.es/atica
Tfo: 868889150
Fax: 86337

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] postgresql failover

2013-08-19 Thread Takatoshi MATSUO
Hi Gregg Jaskiewicz

2013/8/16 Gregg Jaskiewicz :
> Running rsync -avzPc -e 'ssh -o UserKnownHostsFile=/dev/null'
> --delete-during 10.0.1.100:/var/lib/pgsql/9.2/data/pg_archive
> /var/lib/pgsql/9.2/data/
> on each slave fixes it - but question then is - why cannot this be done
> automatically by RA ?

I think it's over the top to use rsync by RA.
In addition, It may cause timed-out of monitor.

>
> Andrew on irc suggested I use restart_on_promote, but I have a feeling this
> can be done without restarting anything - however the RA itself would have
> to be fixed, and I can't do it myself to propose a fix, or submit a patch.

I recommend using restart_on_promote parameter too.
because timeline-ID is incremented when promote is called.

If you use restart_on_promote="true",  Saves may be able to connect
new Master without rsync.

Thanks,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] cibsecret not found

2013-08-19 Thread Gao,Yan
On 08/19/13 21:20, Халезов Иван wrote:
> Hi All!
> 
> According to crm documentation
> (http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.manual_config.html#sec.ha.config.crm.setpwd)
> 
> I am trying to setup secret password parameter for my resource:
> 
> [root@server]# crm resource secret  Journal set password xx
> /bin/sh: cibsecret: command not found
> 
> I use pacemaker 1.1.9.
> If there is no cibsecret command, what is the right way to store
> passwords in the configuration?
You have to configure it "--with-cibsecrets" when building.

Regards,
  Gao,Yan
-- 
Gao,Yan 
Software Engineer
China Server Team, SUSE.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] error: qb_ipcs_us_connection_acceptor: Could not accept client connection: Too many open files (24)

2013-08-19 Thread David Vossel




- Original Message -
> From: "Nikola Ciprich" 
> To: pacemaker@oss.clusterlabs.org
> Sent: Tuesday, August 6, 2013 5:13:02 AM
> Subject: [Pacemaker] error: qb_ipcs_us_connection_acceptor: Could not accept 
> client connection: Too many open files
> (24)
> 
> Hi,
> 
> I'd like to ask whether somebody met similar bug:
> 
> On one of the test two node clusters, node suddenly hung, and cib started
> spawning following messages:
> 
> error: qb_ipcs_us_connection_acceptor: Could not accept client connection:
> Too many open files (24)
> 
> in lsof, I see over thousand of opened /dev/shm files:

What version of libqb do you have installed? If you can upgrade libqb

-- Vossel

> 
> cib5737 hacluster  DEL   REG   0,14
> 2615869 /dev/shm/qb-cib_rw-control-5737-25733-179
> cib5737 hacluster  DEL   REG   0,14
> 2545021 /dev/shm/qb-cib_rw-control-5737-4605-178
> cib5737 hacluster  DEL   REG   0,14
> 2410274 /dev/shm/qb-cib_rw-control-5737-1925-180
> cib5737 hacluster  DEL   REG   0,14
> 2545640 /dev/shm/qb-cib_rw-control-5737-8828-177
> cib5737 hacluster  DEL   REG   0,14
> 2495467 /dev/shm/qb-cib_rw-control-5737-2054-174
> cib5737 hacluster  DEL   REG   0,14
> 2434602 /dev/shm/qb-cib_rw-control-5737-8659-176
> 
> 
> and also sockets:
> 
> cib5737 hacluster 1003u unix 0x880eaefee000  0t0
> 13885836 socket
> cib5737 hacluster 1004u unix 0x880eada76000  0t0
> 13849634 socket
> cib5737 hacluster 1005u unix 0x880eb37e7400  0t0
> 13847814 socket
> cib5737 hacluster 1006u unix 0x88099c120400  0t0
> 13866356 socket
> cib5737 hacluster 1007u unix 0x880eb7764000  0t0
> 13911546 socket
> cib5737 hacluster 1008u unix 0x880a7f579400  0t0
> 13847938 socket
> cib5737 hacluster 1009u unix 0x880a7f57e000  0t0
> 1388 socket
> 
> OS is latest centos6 (RHEL6 clone), running x86_64 3.0.87 kernel
> 
> another important packages:
> 
> pacemaker-1.1.8-7.el6.x86_64
> cluster-glue-1.0.5-6.el6.x86_64
> clusterlib-3.0.12.1-49.el6.x86_64
> 
> Any idea on what this could be? Is this some known bug?
> 
> with best regards
> 
> nik
> 
> 
> 
> --
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
> 
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Dual primary drbd + ocfs2: problems starting o2cb

2013-08-19 Thread Vladislav Bogdanov
16.08.2013 16:04, Elmar Marschke wrote:
> Hi all,
> 
> i'm working on a two node pacemaker cluster with dual primary drbd and
> ocfs2.
> 
> Dual pri drbd and ocfs2 WITHOUT pacemaker work fine (mounting, reading,
> writing, everything...).

ocfs2 uses own clustering stack by default.

> 
> When i try to make this work in pacemaker, there seems to be a problem
> to start the o2cb resource.
> 
> My (already simplified) configuration is:
> -
> node poc1 \
> attributes standby="off"
> node poc2 \
> attributes standby="off"
> primitive res_dlm ocf:pacemaker:controld \
> op monitor interval="120"
> primitive res_drbd ocf:linbit:drbd \
> params drbd_resource="r0" \
> op stop interval="0" timeout="100" \
> op start interval="0" timeout="240" \
> op promote interval="0" timeout="90" \
> op demote interval="0" timeout="90" \
> op notifiy interval="0" timeout="90" \
> op monitor interval="40" role="Slave" timeout="20" \
> op monitor interval="20" role="Master" timeout="20"
> primitive res_o2cb ocf:pacemaker:o2cb \
> op monitor interval="60"
> ms ms_drbd res_drbd \
> meta notify="true" master-max="2" master-node-max="1"
> target-role="Started"
> property $id="cib-bootstrap-options" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> last-lrm-refresh="1376574860"

Side note: you need to run both dlm and o2cb as clones, and group them
(either with "group" or with pair of colocation/order statements), so so
ocfs2_controld is started when dlm_controld already runs. You probably
already tried that, but do not forget the last part of this.

> 
> 
> First error message in corosync.log as far as i can identify it:
> 
> lrmd: [5547]: info: RA output: (res_dlm:probe:stderr) dlm_controld.pcmk:
> no process found
> [ other stuff ]
> lrmd: [5547]: info: RA output: (res_dlm:start:stderr) dlm_controld.pcmk:
> no process found
> [ other stuff ]
>  lrmd: [5547]: info: RA output: (res_o2cb:start:stderr)
> 2013/08/16_13:25:20 ERROR: ocfs2_controld.pcmk did not come up
> 
> (
> You can find the whole corosync logfile (starting corosync on node 1
> from beginning until after starting of resources) on:
> http://www.marschke.info/corosync_drei.log
> )
> 
> syslog shows:
> -
> ocfs2_controld.pcmk[5774]: Unable to connect to CKPT: Object does not exist

How exactly did you start corosync process? As "corosync" or as "openais"?
Background is that CKPT service is not loaded by corosync by default,
only if it is started by openais script, you may want to look at it for
details.

> 
> 
> Output of crm_mon:
> --
> 
> Stack: openais
> Current DC: poc1 - partition WITHOUT quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> 
> 
> Online: [ poc1 ]
> OFFLINE: [ poc2 ]
> 
>  Master/Slave Set: ms_drbd [res_drbd]
>  Masters: [ poc1 ]
>  Stopped: [ res_drbd:1 ]
>  res_dlm(ocf::pacemaker:controld):Started poc1
> 
> Migration summary:
> * Node poc1:
>res_o2cb: migration-threshold=100 fail-count=100
> 
> Failed actions:
> res_o2cb_start_0 (node=poc1, call=6, rc=1, status=complete): unknown
> error
> 
> -
> This is the situation after a reboot of node poc1. For simplification i
> left pacemaker / corosync unstarted on the second node, and already
> removed a group and a clone resource where dlm and o2cb already had been
> in (errors were there also).
> 
> Is my configuration of the resource agents correct?
> I checked using "ra meta ...", but as far as i recognized everything is ok.
> 
> Is some piece of software missing?
> dlm-pcmk is installed, ocfs2_controld.pcmk and dlm_controld.pcmk are
> available, i even did additional links in /usr/sbin:
> root@poc1:~# which ocfs2_controld.pcmk
> /usr/sbin/ocfs2_controld.pcmk
> root@poc1:~# which dlm_controld.pcmk
> /usr/sbin/dlm_controld.pcmk
> root@poc1:~#
> 
> I already googled but couldn't find any useful. Thanks for any hints...:)
> 
> kind regards
> elmar
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterla

Re: [Pacemaker] Dual primary drbd + ocfs2: problems starting o2cb

2013-08-19 Thread Jake Smith
> -Original Message-
> From: Elmar Marschke [mailto:elmar.marsc...@schenker.at]
> Sent: Friday, August 16, 2013 10:31 PM
> To: pacemaker@oss.clusterlabs.org
> Subject: Re: [Pacemaker] Dual primary drbd + ocfs2: problems starting
o2cb
> 
> 
> Am 16.08.2013 15:46, schrieb Jake Smith:
> >> -Original Message-
> >> From: Elmar Marschke [mailto:elmar.marsc...@schenker.at]
> >> Sent: Friday, August 16, 2013 9:05 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: [Pacemaker] Dual primary drbd + ocfs2: problems starting
> >> o2cb
> >>
> >> Hi all,
> >>
> >> i'm working on a two node pacemaker cluster with dual primary drbd
> >> and ocfs2.
> >>
> >> Dual pri drbd and ocfs2 WITHOUT pacemaker work fine (mounting,
> >> reading, writing, everything...).
> >>
> >> When i try to make this work in pacemaker, there seems to be a
> >> problem
> > to
> >> start the o2cb resource.
> >>
> >> My (already simplified) configuration is:
> >> -
> >> node poc1 \
> >>attributes standby="off"
> >> node poc2 \
> >>attributes standby="off"
> >> primitive res_dlm ocf:pacemaker:controld \
> >>op monitor interval="120"
> >> primitive res_drbd ocf:linbit:drbd \
> >>params drbd_resource="r0" \
> >>op stop interval="0" timeout="100" \
> >>op start interval="0" timeout="240" \
> >>op promote interval="0" timeout="90" \
> >>op demote interval="0" timeout="90" \
> >>op notifiy interval="0" timeout="90" \
> >>op monitor interval="40" role="Slave" timeout="20" \
> >>op monitor interval="20" role="Master" timeout="20"
> >> primitive res_o2cb ocf:pacemaker:o2cb \
> >>op monitor interval="60"
> >> ms ms_drbd res_drbd \
> >>meta notify="true" master-max="2" master-node-max="1" target-
> >> role="Started"
> >> property $id="cib-bootstrap-options" \
> >>no-quorum-policy="ignore" \
> >>stonith-enabled="false" \
> >>dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> >>cluster-infrastructure="openais" \
> >>expected-quorum-votes="2" \
> >>last-lrm-refresh="1376574860"
> >>
> >
> > Looks like you are missing ordering and colocation and clone (even
> > group to make it a shorter config; group = order and colocation in one
> > statement) statements.  The resources *must* start in a particular
> > order and they much run on the same node and there must be an instance
> > of each resource on each node.
> >
> > More here for DRBD 8.4:
> > http://www.drbd.org/users-guide/s-ocfs2-pacemaker.html
> > Or DRBD 8.3:
> > http://www.drbd.org/users-guide-8.3/s-ocfs2-pacemaker.html
> >
> > Basically add:
> > Group grp_dlm_o2cb res_dlm res_o2cb
> > Clone cl_dlm_o2cb grp_dlm_o2cb meta interleave=true Order
> > ord_drbd_then_dlm_o2cb  res_drbd:promote cl_dlm_o2cb:start
> Colocation
> > col_dlm_o2cb_with_drbdmaster cl_dlm_o2cb res_drbd:Master
> >
> > HTH
> >
> > Jake
> >
> 
> Hello Jake,
> 
> thanks for your reply. I already had res_dlm and res_o2cb grouped
together
> and cloned like in your advice; indeed this was my initial
configuration. But
> the problem showed up, so i tried to simplify the configuration to
reduce
> possible error sources.
> 
> But now it seems i found a solution; or at least a workaround: i just
use the
> LSB resource agent lsb:o2cb. This one works! The resource starts without
a
> problem on both nodes and as far as i can see right now everything is
fine
> (tried with and without additional group and clone resource).
> 
> Don't know if this will bring some drawbacks in the future; but for the
> moment my problem seems to be solved.

Not sure either - usually resource agents are more robust than simple LSB.
I would also verify that the o2cb LSB is fully LSB compliant or your
cluster will have issues

> 
> Currently it seems to me that there's a subtle problem with the
> ocf:pacemaker:o2cb resource agent; at least on my system.

Maybe, maybe not - if you take a look at the o2cb resource agent the error
message you were getting is after trying to start
/usr/sbin/ocfs2_controld.pcmk for 10 seconds without success... I would
time starting o2cb.  Might be as simple as allowing more time for startup
of the daemon.
I've not setup ocfs2 in a while but I believe you may be able to extend
that timeout in the meta of the primitive without having to muck with the
actual resource agent.

Jake

> 
> Anyway, thanks a lot for your answer..!
> Best regards
> elmar
> 
> 
> >
> >> First error message in corosync.log as far as i can identify it:
> >> 
> >> lrmd: [5547]: info: RA output: (res_dlm:probe:stderr)
dlm_controld.pcmk:
> >> no process found
> >> [ other stuff ]
> >> lrmd: [5547]: info: RA output: (res_dlm:start:stderr)
dlm_controld.pcmk:
> >> no process found
> >> [ other stuff ]
> >>lrmd: [5547]: info: RA output: (res_o2cb:start:stderr)
> >> 2013/08/16_13:25:20 ERROR: ocfs2_controld.pcmk did not come up
> >>
> >> (
> >> You can find the whole 

[Pacemaker] cibsecret not found

2013-08-19 Thread Халезов Иван

Hi All!

According to crm documentation 
(http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.manual_config.html#sec.ha.config.crm.setpwd)

I am trying to setup secret password parameter for my resource:

[root@server]# crm resource secret  Journal set password xx
/bin/sh: cibsecret: command not found

I use pacemaker 1.1.9.
If there is no cibsecret command, what is the right way to store 
passwords in the configuration?


Thank you in advance
Ivan Khalezov.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] PostgreSQL failed to stop after streaming replication established

2013-08-19 Thread Mistina Michal
Dear community.

 

The scenario of redundant environment is in the "graphic" representation...

 

   ++

   |  WAN|

   +v

++++++

|pgsql |pgsql  ||pgsql  |pgsql
|

++++++

|drbd-pri   |drbd-sec   ||drbd-pri|drbd-sec  |

++++++

|   pacemaker ||   pacemaker
|

+-++--+

|corosync ||corosync
|

++++++

|node1   |node2||node1   |node2   |

++++++

   TC1
TC2

 

Within each technical center everything worked fine when migrating resources
between nodes. 

Then I've set up streaming replication from TC1 to TC2. 

Now migration from one node to another failes. Pacemaker operation FAILED to
stop resource postgres. However postgresql was stopped but postmaster.pid
stayed corrupted.

 

Now I ended up like this.

I am unable to stop postgresql service correctly on TC1 (streaming
replication master). After issuing /etc/init.d/postgresql-9.2 stop the
postmaster.pid remains on the filesystem and moreover it is corrupted. I am
unable to delete it with rm command.

 

It looks like this:

[root@pcmk1 ~]# ll /var/lib/pgsql/9.2/data/

ls: cannot access /var/lib/pgsql/9.2/data/postmaster.pid: No such file or
directory total 56

drwx-- 7 postgres postgres62 Jun 26 17:13 base

drwx-- 2 postgres postgres  4096 Aug 18 00:25 global

drwx-- 2 postgres postgres17 Jun 26 09:54 pg_clog

-rw--- 1 postgres postgres  5127 Aug 17 16:24 pg_hba.conf

-rw--- 1 postgres postgres  1636 Jun 26 09:54 pg_ident.conf

drwx-- 2 postgres postgres  4096 Jul  2 00:00 pg_log

drwx-- 4 postgres postgres34 Jun 26 09:53 pg_multixact

drwx-- 2 postgres postgres17 Aug 18 00:23 pg_notify

drwx-- 2 postgres postgres 6 Jun 26 09:53 pg_serial

drwx-- 2 postgres postgres 6 Jun 26 09:53 pg_snapshots

drwx-- 2 postgres postgres 6 Aug 18 00:25 pg_stat_tmp

drwx-- 2 postgres postgres17 Jun 26 09:54 pg_subtrans

drwx-- 2 postgres postgres 6 Jun 26 09:53 pg_tblspc

drwx-- 2 postgres postgres 6 Jun 26 09:53 pg_twophase

-rw--- 1 postgres postgres 4 Jun 26 09:53 PG_VERSION

drwx-- 3 postgres postgres  4096 Aug 18 00:25 pg_xlog

-rw--- 1 postgres postgres 19884 Aug 17 22:54 postgresql.conf

-rw--- 1 postgres postgres71 Aug 18 00:23 postmaster.opts

?? ? ???? postmaster.pid

-rw-r--r-- 1 postgres postgres   491 Aug 17 16:33 recovery.done

 

I don't know if the resource agent did something wrong while pacemaker tried
stopping postgres or actually the postgres is the source component, which
failed to stop correctly. What do you think? Has somebody experienced
problem like this?

 

I am using:

-  pacemaker-1.1.7-6

-  corosync-1.4.1-7

-  resource-agents-3.9.2-12

-  drbd-8.4.3-2

 

CONFIGURATION

[root@pcmk2 9.2]# crm configure show

node pcmk1 \

attributes standby="off"

node pcmk2 \

attributes standby="off"

primitive drbd_pg ocf:linbit:drbd \

params drbd_resource="postgres" \

op monitor interval="15" role="Master" \

op monitor interval="16" role="Slave" \

op start interval="0" timeout="240" \

op stop interval="0" timeout="120"

primitive pg_fs ocf:heartbeat:Filesystem \

params device="/dev/vg_local-lv_pgsql/lv_pgsql"
directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \

op start interval="0" timeout="60" \

op stop interval="0" timeout="120"

primitive pg_lsb lsb:postgresql-9.2 \

op monitor interval="30" timeout="60" \

op start interval="0" timeout="60" \

op stop interval="0" timeout="60"

primitive pg_lvm ocf:heartbeat:LVM \

params volgrpname="vg_local-lv_pgsql" \

op start interval="0" timeout="30" \

op stop interval="0" timeout="30"

primitive pg_vip ocf:heartbeat:IPaddr2 \

params ip="x.x.x.x" iflabel="pcmkvip" \

op monitor interval="5"

group PGServer pg_lvm pg_fs pg_lsb pg_vip \

meta target-role="Started"

ms ms_drbd_pg drbd_pg \

meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"

location master-prefer-node1 pg_vip 50: pcmk1

colocation col_pg_drbd inf: PGServer ms_drbd

Re: [Pacemaker] Failed stop of stonith resource

2013-08-19 Thread Vladislav Bogdanov
19.08.2013 07:36, Andrew Beekhof wrote:
> 
> On 14/08/2013, at 7:58 AM, Vladislav Bogdanov  wrote:
> 
>> 14.08.2013 00:51, Vladislav Bogdanov wrote:
>>
>> ...
>>
>>>
>>> Sure, reason of the failure of the fence_ipmilan requires investigations
>>> too, but that is not important for the above issue I think.
>>
>> That seems to be stonith-ng failure:
> 
> Did you create a crm_report for this?
> There's just not enough context to say anything based on these logs alone.

Sent privately.

> 
>> Aug 13 20:56:39 mgmt01 stonith-ng[10206]:   notice: log_cib_diff:
>> cib_process_diff: Local-only Change: 0.714.10
>> Aug 13 20:56:39 mgmt01 stonith-ng[10206]:   notice: cib_process_diff: --
>> 
>> Aug 13 20:56:39 mgmt01 stonith-ng[10206]:   notice: cib_process_diff: ++
>>> crm-debug-origin="do_state_transition" in_ccm="true" expected="member"/>
>> Aug 13 20:56:39 mgmt01 stonith-ng[10206]:   notice: cib_process_diff:
>> Diff 0.714.10 -> 0.714.10 from local not applied to 0.714.10: + and -
>> versions in the diff did not change
>> Aug 13 20:56:39 mgmt01 stonith-ng[10206]:   notice: update_cib_cache_cb:
>> [cib_diff_notify] Patch aborted: Application of an update diff failed (-206)
>> Aug 13 20:56:53 mgmt01 kernel: dlm: got connection from 10
>> Aug 13 20:57:21 mgmt01 cib[10205]:  warning: cib_notify_send_one:
>> Notification of client crmd/1a302bfe-5a71-4555-abbb-c030fcb6416d failed
>> Aug 13 20:57:26 mgmt01 stonith-ng[10206]:   notice: update_cib_cache_cb:
>> [cib_diff_notify] Patch aborted: Application of an update diff failed (-206)
>> Aug 13 20:57:27 mgmt01 stonith-ng[10206]:   notice: update_cib_cache_cb:
>> [cib_diff_notify] Patch aborted: Application of an update diff failed (-206)
>>
>> ...[many lines with the same message]...
>>
>> Aug 13 20:58:56 mgmt01 stonith-ng[10206]:   notice: update_cib_cache_cb:
>> [cib_diff_notify] Patch aborted: Application of an update diff failed (-206)
>> Aug 13 20:58:57 mgmt01 lrmd[10207]:  warning: crm_ipc_send: Request 24
>> to stonith-ng (0x19832b0) failed: Resource temporarily unavailable (-11)
>> Aug 13 20:58:57 mgmt01 lrmd[10207]:error: stonith_send_command:
>> Couldn't perform st_device_register operation (timeout=0s): -11:
>> Connection timed out (110)
>> Aug 13 20:58:57 mgmt01 stonith-ng[10206]:   notice: update_cib_cache_cb:
>> [cib_diff_notify] Patch aborted: Application of an update diff failed (-206)
>> Aug 13 20:58:58 mgmt01 crmd[10210]:error: process_lrm_event: LRM
>> operation stonith-ipmi-v03-b_start_0 (call=568, status=4,
>> cib-update=128, confirmed=true) Error
>> Aug 13 20:58:58 mgmt01 stonith-ng[10206]:   notice: update_cib_cache_cb:
>> [cib_diff_notify] Patch aborted: Application of an update diff failed (-206)
>> Aug 13 20:58:58 mgmt01 attrd[10208]:   notice: attrd_cs_dispatch: Update
>> relayed from v03-a
>> Aug 13 20:58:58 mgmt01 attrd[10208]:   notice: attrd_triggeAug 13
>> 21:00:39 mgmt01 kernel: imklog 5.8.10, log source = /proc/kmsg started.
>>
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org