Re: [Pacemaker] 2-node cluster doesn't move resources away from a failed node

2012-07-09 Thread David Guyot
Thank you for your help!

I found the problem; it came from a bug in my STONITH agent, which
caused it to become a zombie. I corrected this bug and the cluster now
fails over as expected.

Kind regards.

Le 08/07/2012 00:12, Andreas Kurz a écrit :
> On 07/05/2012 04:12 PM, David Guyot wrote:
>> Hello, everybody.
>>
>> As the title suggests, I'm configuring a 2-node cluster but I've got a
>> strange issue here : when I put a node in standby mode, using "crm node
>> standby", its resources are correctly moved to the second node, and stay
>> there even if the first is back on-line, which I assume is the preferred
>> behavior (preferred by the designers of such systems) to avoid having
>> resources on a potentially unstable node. Nevertheless, when I simulate
>> failure of the node which run resources by "/etc/init.d/corosync stop",
>> the other node correctly fence the failed node by electrically resetting
>> it, but it doesn't mean that it will mount resources on himself; rather,
>> it waits the failed node to be back on-line, and then re-negotiates
>> resource placement, which inevitably leads to the failed node restarting
>> the resources, which I suppose is a consequence of the resource
>> stickiness still recorded by the intact node : because this node still
>> assume that resources are running on the failed node, it assumes that
>> resources prefer to stay on the first node, even if it has failed.
>>
>> When the first node, Vindemiatrix, has shuts down Corosync, the second,
>> Malastare, reports this :
>>
>> root@Malastare:/home/david# crm_mon --one-shot -VrA
>> 
>> Last updated: Thu Jul  5 15:27:01 2012
>> Last change: Thu Jul  5 15:26:37 2012 via cibadmin on Malastare
>> Stack: openais
>> Current DC: Malastare - partition WITHOUT quorum
>> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
>> 2 Nodes configured, 2 expected votes
>> 17 Resources configured.
>> 
>>
>> Node Vindemiatrix: UNCLEAN (offline)
> Pacemaker thinks fencing was not successful and will not recover
> resources until STONITH was successful ... or the node returns an it is
> possible to probe resource states
>
>> Online: [ Malastare ]
>>
>> Full list of resources:
>>
>>  soapi-fencing-malastare(stonith:external/ovh):Started Vindemiatrix
>>  soapi-fencing-vindemiatrix(stonith:external/ovh):Started Malastare
>>  Master/Slave Set: ms_drbd_svn [drbd_svn]
>>  Masters: [ Vindemiatrix ]
>>  Slaves: [ Malastare ]
>>  Master/Slave Set: ms_drbd_pgsql [drbd_pgsql]
>>  Masters: [ Vindemiatrix ]
>>  Slaves: [ Malastare ]
>>  Master/Slave Set: ms_drbd_backupvi [drbd_backupvi]
>>  Masters: [ Vindemiatrix ]
>>  Slaves: [ Malastare ]
>>  Master/Slave Set: ms_drbd_www [drbd_www]
>>  Masters: [ Vindemiatrix ]
>>  Slaves: [ Malastare ]
>>  fs_www(ocf::heartbeat:Filesystem):Started Vindemiatrix
>>  fs_pgsql(ocf::heartbeat:Filesystem):Started Vindemiatrix
>>  fs_svn(ocf::heartbeat:Filesystem):Started Vindemiatrix
>>  fs_backupvi(ocf::heartbeat:Filesystem):Started Vindemiatrix
>>  VirtualIP(ocf::heartbeat:IPaddr2):Started Vindemiatrix
>>  OVHvIP(ocf::pacemaker:OVHvIP):Started Vindemiatrix
>>  ProFTPd(ocf::heartbeat:proftpd):Started Vindemiatrix
>>
>> Node Attributes:
>> * Node Malastare:
>> + master-drbd_backupvi:0  : 1
>> + master-drbd_pgsql:0 : 1
>> + master-drbd_svn:0   : 1
>> + master-drbd_www:0   : 1
>>
>> As you can see, the node failure is detected. This state leads to
>> attached log file.
>>
>> Note that both ocf::pacemaker:OVHvIP and stonith:external/ovh are custom
>> resources which uses my server provider's SOAP API to provide intended
>> services. The STONITH agent does nothing but returning exit status 0
>> when start, stop, on or off actions are required, but returns the 2
>> nodes names when hostlist or gethosts actions are required and, when
>> reset action is required, effectively resets faulting node using the
>> provider API. As this API doesn't provide reliable mean to know the
>> exact moment of resetting, the STONITH agent pings the faulting node
>> every 5 seconds until ping fails, then forks a process which pings the
>> faulting node every 5 seconds until it answers, then, due to external
>> VPN being not yet installed by the provider, I'm forced to emulate it
>> with OpenVPN (which seems to be unable to re-establish a connection lost
>> minutes ago, leading to a dual brain situation), the STONITH agent
>> restarts OpenVPN to re-establish the connection, then restarts Corosync
>> and Pacemaker.
>>
>> Aside from the VPN issue, of which I'm fully aware of performance and
>> stability issues, I thought that Pacemaker would, as soon as the STONITH
>> agent returns exit status 0, start the resources on the remaining node,
>> but it doesn't. Instead, it seems that the STONITH reset action waits
>> too

[Pacemaker] could live migration without pause VM?

2012-07-09 Thread hcyy
Hello,
everybody.

 
   I use NFS to do live migration。After
input  crm resource migrate vm12 pcmk-2,it
use almost 10s for preparation. During the 10s,the vm is still runing and
can ping other vm. But if i input mkdir pcmk-6 in vm during the 10s,it say 
:mkdir:
cannot create directory `pcmk-6': Read-only file system.

 
        Anybody can tell me this situation is due to my
wrong configure or libvirt cannot mkdir during migration?thanks!

   
           primitive vm12 ocf:heartbeat:VirtualDomain \

   
    params config="/share/vm12.xml"
migration_transport="ssh" hypervisor="qemu:///system" \

   
    meta allow-migrate="true" is-managed="true"
target-role="Started" \

   
    op migrate_from interval="0" timeout="240s" \

   
    op migrate_to interval="0" timeout="240s" \

   
    op start interval="0" timeout="120s" \

   
    op stop interval="0" timeout="120s" \

   
    op monitor interval="10" timeout="200s"
on-fail="restart" depth="0" \

   
    utilization memory="5120"

 

primitive
Mount_nfs ocf:heartbeat:Filesystem \

   
    params device="10.50.4.13:/export/yanyang"
directory="/share" fstype="nfs"
options="rw,hard,intr" \

   
    op monitor interval="120s" timeout="90s" \

   
    op start interval="0" timeout="120s" \

   
    op stop interval="0" timeout="120s" \

   
    meta target-role="Started"___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-09 Thread Nikola Ciprich
Hello Andreas,

yes, You're right. I should have sent those in the initial post. Sorry about 
that.
I've created very simple test configuration on which I'm able to simulate the 
problem.
there's no stonith etc, since it's just two virtual machines for the test.

crm conf:

primitive drbd-sas0 ocf:linbit:drbd \
  params drbd_resource="drbd-sas0" \
  operations $id="drbd-sas0-operations" \
  op start interval="0" timeout="240s" \
  op stop interval="0" timeout="200s" \
  op promote interval="0" timeout="200s" \
  op demote interval="0" timeout="200s" \
  op monitor interval="179s" role="Master" timeout="150s" \
  op monitor interval="180s" role="Slave" timeout="150s"

primitive lvm ocf:lbox:lvm.ocf \
  op start interval="0" timeout="180" \
  op stop interval="0" timeout="180"

ms ms-drbd-sas0 drbd-sas0 \
   meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" 
notify="true" globally-unique="false" interleave="true" target-role="Started"

clone cl-lvm lvm \
  meta globally-unique="false" ordered="false" interleave="true" 
notify="false" target-role="Started" \
  params lvm-clone-max="2" lvm-clone-node-max="1"

colocation col-lvm-drbd-sas0 inf: cl-lvm ms-drbd-sas0:Master

order ord-drbd-sas0-lvm inf: ms-drbd-sas0:promote cl-lvm:start

property $id="cib-bootstrap-options" \
 dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
 cluster-infrastructure="openais" \
 expected-quorum-votes="2" \
 no-quorum-policy="ignore" \
 stonith-enabled="false"

lvm resource starts vgshared volume group on top of drbd (LVM filters are set to
use /dev/drbd* devices only)

drbd configuration:

global {
   usage-count no;
}

common {
   protocol C;

handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; ";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; ";
local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; ";

#pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot 
-f";
#pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot 
-f";
#local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt 
-f";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
# split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target 
"/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target 
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}

net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri call-pri-lost-after-sb;
#rr-conflict disconnect;
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 0;
ping-timeout 50;
}

syncer {
rate 100M;
al-extents 3833;
#   al-extents 257;
#   verify-alg sha1;
}

disk {
on-io-error   detach;
no-disk-barrier;
no-disk-flushes;
no-md-flushes;
}

startup {
# wfc-timeout  0;
degr-wfc-timeout 120;# 2 minutes.
# become-primary-on both;

}
}

note that pri-on-incon-degr etc handlers are intentionally commented out so I 
can
see what's going on.. otherwise machine always got immediate reboot..

any idea?

thanks a lot in advance

nik


On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
> On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
> > hello,
> > 
> > I'm trying to solve quite mysterious problem here..
> > I've got new cluster with bunch of SAS disks for testing purposes.
> > I've configured DRBDs (in primary/primary configuration)
> > 
> > when I start drbd using drbdadm, it get's up nicely (both nodes
> > are Primary, connected).
> > however when I start it using corosync, I always get split-brain, although
> > there are no data written, no network disconnection, anything..
> 
> your full drbd and Pacemaker configuration please ... some snippets from
> something are very seldom helpful ...
> 
> Regards,
> Andreas
> 
> -- 
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
> > 
> > here's drbd resource config:
> 

[Pacemaker] OCFS2, Corosync and Pacemaker on CentOS 6.3/RHEL 6.x

2012-07-09 Thread Errol Neal
I'm curious to understand if there is any kind of support for 
Pacemaker/Corosync Clusters on RHEL 6.x (and comparable clones)  with OCFS2 as 
the cluster FS? 
Aside from missing a couple of rpm (e.g. ocfs2-tools to name a few), Pacemaker 
just seems packaged differently. For example, dlm_controld.pcmk is no where to 
be found. But dlm_controld does ship with cman. 
Any thoughts? 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-09 Thread Andreas Kurz
On 07/09/2012 12:58 PM, Nikola Ciprich wrote:
> Hello Andreas,
> 
> yes, You're right. I should have sent those in the initial post. Sorry about 
> that.
> I've created very simple test configuration on which I'm able to simulate the 
> problem.
> there's no stonith etc, since it's just two virtual machines for the test.
> 
> crm conf:
> 
> primitive drbd-sas0 ocf:linbit:drbd \
> params drbd_resource="drbd-sas0" \
> operations $id="drbd-sas0-operations" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="200s" \
> op promote interval="0" timeout="200s" \
> op demote interval="0" timeout="200s" \
> op monitor interval="179s" role="Master" timeout="150s" \
> op monitor interval="180s" role="Slave" timeout="150s"
> 
> primitive lvm ocf:lbox:lvm.ocf \

Why not using the RA that comes with the resource-agent package?

> op start interval="0" timeout="180" \
> op stop interval="0" timeout="180"
> 
> ms ms-drbd-sas0 drbd-sas0 \
>meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" 
> notify="true" globally-unique="false" interleave="true" target-role="Started"
> 
> clone cl-lvm lvm \
>   meta globally-unique="false" ordered="false" interleave="true" 
> notify="false" target-role="Started" \
>   params lvm-clone-max="2" lvm-clone-node-max="1"
> 
> colocation col-lvm-drbd-sas0 inf: cl-lvm ms-drbd-sas0:Master
> 
> order ord-drbd-sas0-lvm inf: ms-drbd-sas0:promote cl-lvm:start
> 
> property $id="cib-bootstrap-options" \
>dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
>cluster-infrastructure="openais" \
>expected-quorum-votes="2" \
>no-quorum-policy="ignore" \
>stonith-enabled="false"
> 
> lvm resource starts vgshared volume group on top of drbd (LVM filters are set 
> to
> use /dev/drbd* devices only)
> 
> drbd configuration:
> 
> global {
>usage-count no;
> }
> 
> common {
>protocol C;
> 
> handlers {
> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; ";
> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; ";
> local-io-error "/usr/lib/drbd/notify-io-error.sh; 
> /usr/lib/drbd/notify-emergency-shutdown.sh; ";
> 
> #pri-on-incon-degr 
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
> reboot -f";
> #pri-lost-after-sb 
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; 
> reboot -f";
> #local-io-error "/usr/lib/drbd/notify-io-error.sh; 
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; 
> halt -f";
> # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
> # before-resync-target 
> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
> # after-resync-target 
> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
> }
> 
> net {
> allow-two-primaries;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri call-pri-lost-after-sb;
> #rr-conflict disconnect;
> max-buffers 8000;
> max-epoch-size 8000;
> sndbuf-size 0;
> ping-timeout 50;
> }
> 
> syncer {
> rate 100M;
> al-extents 3833;
> #   al-extents 257;
> #   verify-alg sha1;
> }
> 
> disk {
> on-io-error   detach;
> no-disk-barrier;
> no-disk-flushes;
> no-md-flushes;
> }
> 
> startup {
> # wfc-timeout  0;
> degr-wfc-timeout 120;# 2 minutes.
> # become-primary-on both;

this "become-primary-on" was never activated?

> 
> }
> }
> 
> note that pri-on-incon-degr etc handlers are intentionally commented out so I 
> can
> see what's going on.. otherwise machine always got immediate reboot..
> 
> any idea?

Is the drbd init script deactivated on system boot? Cluster logs should
give more insights 

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> thanks a lot in advance
> 
> nik
> 
> 
> On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
>> On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
>>> hello,
>>>
>>> I'm trying to solve quite mysterious problem here..
>>> I've got new cluster with bunch of SAS disks for testing p

Re: [Pacemaker] Pacemaker cannot start the failed master as a new slave?

2012-07-09 Thread Andreas Kurz
On 07/09/2012 06:11 AM, quanta wrote:
> Related thread:
> http://oss.clusterlabs.org/pipermail/pacemaker/2011-December/012499.html
> 
> I'm going to setup failover for MySQL replication (1 master and 1 slave)
> follow this guide:
> https://github.com/jayjanssen/Percona-Pacemaker-Resource-Agents/blob/master/doc/PRM-setup-guide.rst

and you also use the latest mysql RA from resource-agents github?

> 
> Here're the output of `crm configure show`:
> 
> node serving-6192 \
> attributes p_mysql_mysql_master_IP="192.168.6.192"
> node svr184R-638.localdomain \
> attributes p_mysql_mysql_master_IP="192.168.6.38"
> primitive p_mysql ocf:percona:mysql \
> params config="/etc/my.cnf" pid="/var/run/mysqld/mysqld.pid"
> socket="/var/lib/mysql/mysql.sock" replication_user="repl"
> replication_passwd="x" test_user="test_user" test_passwd="x" \
> op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1" \
> op monitor interval="2s" role="Slave" timeout="30s"
> OCF_CHECK_LEVEL="1" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s"
> primitive writer_vip ocf:heartbeat:IPaddr2 \
> params ip="192.168.6.8" cidr_netmask="32" \
> op monitor interval="10s" \
> meta is-managed="true"
> ms ms_MySQL p_mysql \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" globally-unique="false"
> target-role="Master" is-managed="true"
> colocation writer_vip_on_master inf: writer_vip ms_MySQL:Master
> order ms_MySQL_promote_before_vip inf: ms_MySQL:promote writer_vip:start
> property $id="cib-bootstrap-options" \
> dc-version="1.0.12-unknown" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> last-lrm-refresh="1341801689"
> property $id="mysql_replication" \
> p_mysql_REPL_INFO="192.168.6.192|mysql-bin.06|338"
> 
> `crm_mon`:
> 
> Last updated: Mon Jul  9 10:30:01 2012
> Stack: openais
> Current DC: serving-6192 - partition with quorum
> Version: 1.0.12-unknown
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> 
> 
> Online: [ serving-6192 svr184R-638.localdomain ]
> 
>  Master/Slave Set: ms_MySQL
>  Masters: [ serving-6192 ]
>  Slaves: [ svr184R-638.localdomain ]
> writer_vip(ocf::heartbeat:IPaddr2):Started serving-6192
> Editing `/etc/my.cnf` on the serving-6192 of wrong syntax to test
> failover and it's working fine:
> - svr184R-638.localdomain being promoted to become the master
> - writer_vip switch to svr184R-638.localdomain
> 
> Last updated: Mon Jul  9 10:35:57 2012
> Stack: openais
> Current DC: serving-6192 - partition with quorum
> Version: 1.0.12-unknown
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> 
> 
> Online: [ serving-6192 svr184R-638.localdomain ]
> 
>  Master/Slave Set: ms_MySQL
>  Masters: [ svr184R-638.localdomain ]
>  Stopped: [ p_mysql:0 ]
> writer_vip(ocf::heartbeat:IPaddr2):Started svr184R-638.localdomain
> 
> Failed actions:
> p_mysql:0_monitor_5000 (node=serving-6192, call=15, rc=7,
> status=complete): not running
> p_mysql:0_demote_0 (node=serving-6192, call=22, rc=7,
> status=complete): not running
> p_mysql:0_start_0 (node=serving-6192, call=26, rc=-2, status=Timed
> Out): unknown exec error
> 
> Remove the wrong syntax from `/etc/my.cnf` on serving-6192, and restart
> corosync, what I would like to see is serving-6192 was started as a new
> slave but it doesn't:
> 
> Failed actions:
> p_mysql:0_start_0 (node=serving-6192, call=4, rc=1,
> status=complete): unknown error
> 
> Here're snippet of the logs which I'm suspecting:
> 
> Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: rsc:p_mysql:0:4: start
> Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: RA output:
> (p_mysql:0:start:stderr) Error performing operation: The
> object/attribute does not exist
> 
> Jul 09 10:46:32 serving-6192 crm_attribute: [7420]: info: Invoked:
> /usr/sbin/crm_attribute -N serving-6192 -l reboot --name readable -v 0

Not enough logs ... at least for me ... to give more hints.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> 
> The strange thing is I can starting it manually:
> 
> export OCF_ROOT=/usr/lib/ocf
> export OCF_RESKEY_config="/etc/my.cnf"
> export OCF_RESKEY_pid="/var/run/mysqld/mysqld.pid"
> export OCF_RESKEY_socket="/var/lib/mysql/mysql.sock"
> export OCF_RESKEY_replication_user="repl"
> export OCF_RESKEY_replication_passwd="x"
> export OCF_RESKEY_max_slave_lag="60"
> export OCF_RESKEY_evict_outdated_slaves="false"
> export OCF_RESKEY_test_user="test_user"
> export OCF_RESKEY_test_passwd="x"
> 
> `sh -x /usr/lib/ocf/resource.d/percona/mysql start`: http://fpaste.org/RVGh/
> 
> Did I make something wrong?
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/p

Re: [Pacemaker] Cannot start the failed MySQL master as a new slave?

2012-07-09 Thread quanta


On 07/10/2012 05:08 AM, Andreas Kurz wrote:
> and you also use the latest mysql RA from resource-agents github? 
Yes.
> Not enough logs ... at least for me ... to give more hints. Regards,
> Andreas

Sorry, here for you: http://fpaste.org/AyOZ/

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] drbd under pacemaker - always get split brain

2012-07-09 Thread Nikola Ciprich
Hello Andreas,
> Why not using the RA that comes with the resource-agent package?
well, I've historically used my scripts, haven't even noticed when LVM
resource appeared.. I switched to it now.., thanks for the hint..
> this "become-primary-on" was never activated?
nope.


> Is the drbd init script deactivated on system boot? Cluster logs should
> give more insights 
yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs,
rebooted both nodes, checked drbd ain't started and started corosync.
result is here:
http://nelide.cz/nik/logs.tar.gz

thanks for Your time.
n.


> 
> Regards,
> Andreas
> 
> -- 
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
> > 
> > thanks a lot in advance
> > 
> > nik
> > 
> > 
> > On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
> >> On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
> >>> hello,
> >>>
> >>> I'm trying to solve quite mysterious problem here..
> >>> I've got new cluster with bunch of SAS disks for testing purposes.
> >>> I've configured DRBDs (in primary/primary configuration)
> >>>
> >>> when I start drbd using drbdadm, it get's up nicely (both nodes
> >>> are Primary, connected).
> >>> however when I start it using corosync, I always get split-brain, although
> >>> there are no data written, no network disconnection, anything..
> >>
> >> your full drbd and Pacemaker configuration please ... some snippets from
> >> something are very seldom helpful ...
> >>
> >> Regards,
> >> Andreas
> >>
> >> -- 
> >> Need help with Pacemaker?
> >> http://www.hastexo.com/now
> >>
> >>>
> >>> here's drbd resource config:
> >>> primitive drbd-sas0 ocf:linbit:drbd \
> >>> params drbd_resource="drbd-sas0" \
> >>> operations $id="drbd-sas0-operations" \
> >>> op start interval="0" timeout="240s" \
> >>> op stop interval="0" timeout="200s" \
> >>> op promote interval="0" timeout="200s" \
> >>> op demote interval="0" timeout="200s" \
> >>> op monitor interval="179s" role="Master" timeout="150s" \
> >>> op monitor interval="180s" role="Slave" timeout="150s"
> >>>
> >>> ms ms-drbd-sas0 drbd-sas0 \
> >>>meta clone-max="2" clone-node-max="1" master-max="2" 
> >>> master-node-max="1" notify="true" globally-unique="false" 
> >>> interleave="true" target-role="Started"
> >>>
> >>>
> >>> here's the dmesg output when pacemaker tries to promote drbd, causing the 
> >>> splitbrain:
> >>> [  157.646292] block drbd2: Starting worker thread (from drbdsetup [6892])
> >>> [  157.646539] block drbd2: disk( Diskless -> Attaching ) 
> >>> [  157.650364] block drbd2: Found 1 transactions (1 active extents) in 
> >>> activity log.
> >>> [  157.650560] block drbd2: Method to ensure write ordering: drain
> >>> [  157.650688] block drbd2: drbd_bm_resize called with capacity == 
> >>> 584667688
> >>> [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 
> >>> pages=2231
> >>> [  157.653760] block drbd2: size = 279 GB (292333844 KB)
> >>> [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
> >>> [  157.673722] block drbd2: recounting of set bits took additional 2 
> >>> jiffies
> >>> [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk 
> >>> bit-map.
> >>> [  157.673972] block drbd2: disk( Attaching -> UpToDate ) 
> >>> [  157.674100] block drbd2: attached to UUIDs 
> >>> 0150944D23F16BAE::8C175205284E3262:8C165205284E3263
> >>> [  157.685539] block drbd2: conn( StandAlone -> Unconnected ) 
> >>> [  157.685704] block drbd2: Starting receiver thread (from drbd2_worker 
> >>> [6893])
> >>> [  157.685928] block drbd2: receiver (re)started
> >>> [  157.686071] block drbd2: conn( Unconnected -> WFConnection ) 
> >>> [  158.960577] block drbd2: role( Secondary -> Primary ) 
> >>> [  158.960815] block drbd2: new current UUID 
> >>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
> >>> [  162.686990] block drbd2: Handshake successful: Agreed network protocol 
> >>> version 96
> >>> [  162.687183] block drbd2: conn( WFConnection -> WFReportParams ) 
> >>> [  162.687404] block drbd2: Starting asender thread (from drbd2_receiver 
> >>> [6927])
> >>> [  162.687741] block drbd2: data-integrity-alg: 
> >>> [  162.687930] block drbd2: drbd_sync_handshake:
> >>> [  162.688057] block drbd2: self 
> >>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 
> >>> bits:0 flags:0
> >>> [  162.688244] block drbd2: peer 
> >>> 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 
> >>> bits:0 flags:0
> >>> [  162.688428] block drbd2: uuid_compare()=100 by rule 90
> >>> [  162.688544] block drbd2: helper command: /sbin/drbdadm 
> >>> initial-split-brain minor-2
> >>> [  162.691332] block drbd2: helper command: /sbin/drbdadm 
> >>> initial-split-brain minor-2 exit code 0 (0x0)
> >>>
> >>> to me it seems to be that it's promoting it too early, and I also wonder 
> >>> why there is the 
> >>> "new current UUID" stuff?
> >>>
> >>> I'm using centos6,