Re: [Pacemaker] Business Standard India - Errors noticed in pacemaker

2011-07-26 Thread Andrew Beekhof
Thought about googling the error?

On Mon, Jul 25, 2011 at 9:17 PM, Gururaj B Patil gururaj.pa...@bsmail.inwrote:

 To support staff,

 We in Business Standard Ltd. use pacemaker as clusttering application for
 one of our website. Two servers are are in clusttering mode.

 One of the server is web server and another one is mysql db server.
 Pacemaker handles Mysql clustering at block level.

 We have noticed same type of notice and warning in the server's message
 file. Errors are as below.


 ---
 Messages like below appear every 15 minutes

 sidrbd0 pengine: [3143]: notice: get_failcount: Failcount for sipdu1 on
 sidrbd0 has expired (limit was 20s)

 sidrbd0 pengine: [3143]: ERROR: create_notification_boundaries: Creating
 boundaries for mysql-ms-drbd


 ---

 I have registered for pacemaker mailing list also.

 Regards,
 Gururaj Patil
 Systems Department
 Business Standard Ltd.
 H3/4, Paragon center,
 P.B.Marg,
 Worli
 Mumbai - 400013
 India

 Ph.+91-22-24971924
 --



 *Disclaimer:* This communication/message is for the named addressees
 only. This transmission may contain information that is privileged,
 confidential, proprietary or legally privileged, and /or exempt from
 disclosure under applicable law. If you are not the intended recipient,
 please immediately notify the sender and destroy the material in its
 entirety, whether in electronic or hard copy format. You are hereby notified
 that any disclosure, copying, distribution, or use of the information
 contained herein (including any reliance thereon) is STRICTLY PROBHIBITED.
 You must not, directly or indirectly, use, disclose, distribute, print or
 copy any part of this message.

 *WARNING :*This electronic mail and any attachments are believed to be
 free of any virus or other defect, the recipient must ensure that it is
 virus free and no responsibility is accepted by Business Standard Limited
 and /or its employees as applicable for any loss or damage arising in any
 way from its use.


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster type is: corosync

2011-07-26 Thread Andrew Beekhof
On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill
k.prosku...@corp.mail.ru wrote:
 25.07.2011 10:10, Andrew Beekhof пишет:

 Which packages are you using?

 It is your official source from repository I build.

Ok. And did you add the pacemaker configuration options to corosync's
config file?

 pacemaker-1.1.5
 corosync-1.4.0
 cluster-glue-1.0.6
 openais-1.1.2

 All nodes have same rpms.

 On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill
 k.prosku...@corp.mail.ru  wrote:

 Hello again!

 Hope I`m not flooding too much here but I have another problem.

 I install same rpm of corosync, openais, pacemaker, cluster_glue on all
 nodes. I check it twice.

 And then I start some of they - they can`t connect to cluster and stays
 offline. In logs I see what they see other nodes and connectivity is ok.
 But
 I found the difference:

 Online nodes in cluster have:
 [root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log
 Jul 22 20:38:58 mysender39.mail.ru stonith-ng: [3499]: info:
 get_cluster_type: Cluster type is: 'openais'.
 Jul 22 20:38:58 mysender39.mail.ru attrd: [3502]: info: get_cluster_type:
 Cluster type is: 'openais'.
 Jul 22 20:38:58 mysender39.mail.ru cib: [3500]: info: get_cluster_type:
 Cluster type is: 'openais'.
 Jul 22 20:38:59 mysender39.mail.ru crmd: [3504]: info: get_cluster_type:
 Cluster type is: 'openais'.

 Offline have:
 [root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log
 Jul 22 13:39:17 mysender2.mail.ru stonith-ng: [9028]: info:
 get_cluster_type: Cluster type is: 'corosync'.
 Jul 22 13:39:17 mysender2.mail.ru attrd: [9031]: info: get_cluster_type:
 Cluster type is: 'corosync'.
 Jul 22 13:39:17 mysender2.mail.ru cib: [9029]: info: get_cluster_type:
 Cluster type is: 'corosync'.
 Jul 22 13:39:18 mysender2.mail.ru crmd: [9033]: info: get_cluster_type:
 Cluster type is: 'corosync'.

 What`s wrong and how can I fix it?

 --
 Best regards,
 Proskurin Kirill

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster type is: corosync

2011-07-26 Thread Proskurin Kirill

On 07/26/2011 11:00 AM, Andrew Beekhof wrote:

On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill
k.prosku...@corp.example.com  wrote:

25.07.2011 10:10, Andrew Beekhof пишет:


Which packages are you using?


It is your official source from repository I build.


Ok. And did you add the pacemaker configuration options to corosync's
config file?



I attach our corosync.conf. It is same on all nodes except IP addr.
Pacemaker is black now - no configuration at all.

Online nodes:
[root@mysender1 ~]# crm configure show
node mysender1.example.com
node mysender2.example.com
node mysender3.example.com
node mysender4.example.com
node mysender5.example.com
node mysender6.example.com
node mysender7.example.com
property $id=cib-bootstrap-options \
dc-version=1.1.5-3-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
cluster-infrastructure=openais \
expected-quorum-votes=6


Offline nodes(Cluster type is: corosync)
[root@mysender2 ~]# crm configure show
[root@mysender2 ~]#





pacemaker-1.1.5
corosync-1.4.0
cluster-glue-1.0.6
openais-1.1.2

All nodes have same rpms.


On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill
k.prosku...@corp.example.comwrote:


Hello again!

Hope I`m not flooding too much here but I have another problem.

I install same rpm of corosync, openais, pacemaker, cluster_glue on all
nodes. I check it twice.

And then I start some of they - they can`t connect to cluster and stays
offline. In logs I see what they see other nodes and connectivity is ok.
But
I found the difference:

Online nodes in cluster have:
[root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 20:38:58 mysender39.example.com stonith-ng: [3499]: info:
get_cluster_type: Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.example.com attrd: [3502]: info: get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:58 mysender39.example.com cib: [3500]: info: get_cluster_type:
Cluster type is: 'openais'.
Jul 22 20:38:59 mysender39.example.com crmd: [3504]: info: get_cluster_type:
Cluster type is: 'openais'.

Offline have:
[root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log
Jul 22 13:39:17 mysender2.example.com stonith-ng: [9028]: info:
get_cluster_type: Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.example.com attrd: [9031]: info: get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:17 mysender2.example.com cib: [9029]: info: get_cluster_type:
Cluster type is: 'corosync'.
Jul 22 13:39:18 mysender2.example.com crmd: [9033]: info: get_cluster_type:
Cluster type is: 'corosync'.

What`s wrong and how can I fix it?


--
Best regards,
Proskurin Kirill
totem {
version: 2
token: 2500
token_retransmits_before_loss_const: 10
join: 100
consensus: 3000
vsftype: none
max_messages: 20
send_join: 45
secauth:off
fail_recv_const: 5000
 
interface {
ringnumber: 0
bindnetaddr: 10.6.1.155
mcastaddr: 239.255.1.1
mcastport: 5405
ttl: 31
}

}
 
logging {
fileline: off
to_syslog: no
to_stderr: no
to_logfile: yes
logfile: /var/log/corosync.log
debug: off
timestamp: on
}
 
amf {
mode: disabled
}
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Business Standard India - Errors noticed in pacemaker

2011-07-26 Thread Gururaj B Patil

Dear Mr.Andrew Beekhof

Yes I did try googeling but could not get proper information. In  few
forums I noticed  message as below.

 That's development logging, which was accidentally bumped to a higher
log level.

We are searching again but meanwhile can pacemaker team help on this.

Regards,
Gururaj Patil



From:   pacemaker-requ...@oss.clusterlabs.org
To: pacemaker@oss.clusterlabs.org
Date:   07/26/2011 12:47 PM
Subject:Pacemaker Digest, Vol 44, Issue 50



Send Pacemaker mailing list submissions to
 pacemaker@oss.clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
or, via email, send a message with subject or body 'help' to
 pacemaker-requ...@oss.clusterlabs.org

You can reach the person managing the list at
 pacemaker-ow...@oss.clusterlabs.org

When replying, please edit your Subject line so it is more specific
than Re: Contents of Pacemaker digest...


Today's Topics:

   1. Re: Business Standard India - Errors noticed inpacemaker
  (Andrew Beekhof)
   2. Re: Cluster type is: corosync (Andrew Beekhof)
   3. Please teach it about handling of the unmanaged resource in
  environment setting placement-strategy. (Yuusuke IIDA)


--

Message: 1
Date: Tue, 26 Jul 2011 16:02:30 +1000
From: Andrew Beekhof and...@beekhof.net
To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
Cc: M A Faruqui m.faru...@bsmail.in,   Prafulla H Patil
 prafulla.pa...@bsmail.in, Bandana Roy
bandana@bsmail.in
Subject: Re: [Pacemaker] Business Standard India - Errors noticed in
 pacemaker
Message-ID:

caedlwg0exctnrjq8kduv8jdc3hztvsovb7gdnyv4mvppox_...@mail.gmail.com
Content-Type: text/plain; charset=iso-8859-1

Thought about googling the error?

On Mon, Jul 25, 2011 at 9:17 PM, Gururaj B Patil
gururaj.pa...@bsmail.inwrote:

 To support staff,

 We in Business Standard Ltd. use pacemaker as clusttering application for
 one of our website. Two servers are are in clusttering mode.

 One of the server is web server and another one is mysql db server.
 Pacemaker handles Mysql clustering at block level.

 We have noticed same type of notice and warning in the server's message
 file. Errors are as below.



---

 Messages like below appear every 15 minutes

 sidrbd0 pengine: [3143]: notice: get_failcount: Failcount for sipdu1 on
 sidrbd0 has expired (limit was 20s)

 sidrbd0 pengine: [3143]: ERROR: create_notification_boundaries: Creating
 boundaries for mysql-ms-drbd



---


 I have registered for pacemaker mailing list also.

 Regards,
 Gururaj Patil
 Systems Department
 Business Standard Ltd.
 H3/4, Paragon center,
 P.B.Marg,
 Worli
 Mumbai - 400013
 India

 Ph.+91-22-24971924
 --



 *Disclaimer:* This communication/message is for the named addressees
 only. This transmission may contain information that is privileged,
 confidential, proprietary or legally privileged, and /or exempt from
 disclosure under applicable law. If you are not the intended recipient,
 please immediately notify the sender and destroy the material in its
 entirety, whether in electronic or hard copy format. You are hereby
notified
 that any disclosure, copying, distribution, or use of the information
 contained herein (including any reliance thereon) is STRICTLY
PROBHIBITED.
 You must not, directly or indirectly, use, disclose, distribute, print or
 copy any part of this message.

 *WARNING :*This electronic mail and any attachments are believed to be
 free of any virus or other defect, the recipient must ensure that it is
 virus free and no responsibility is accepted by Business Standard Limited
 and /or its employees as applicable for any loss or damage arising in any
 way from its use.


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


-- next part --
An HTML attachment was scrubbed...
URL: 
http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110726/4d99749e/attachment-0001.html


--

Message: 2
Date: Tue, 26 Jul 2011 17:00:56 +1000
From: Andrew Beekhof and...@beekhof.net
To: Pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Cluster type is: corosync

Re: [Pacemaker] Problem with colocation

2011-07-26 Thread Taneli Leppä

Hello,

On 25.7.11 13:28, Yingliang Yang wrote:

constraints
rsc_colocation id=Sphinx_with_IP rsc=Sphinx score-attribute=INF
with-rsc=Sphinx_IP/
/constraints
There is a problem in your config.
The score-attribute should be score and its value should be INFINITY.


Thanks, you were correct. The Cluster from Scratch manual uses inf
shorthand all the time, so I thought it would work.

Should this kind of error pass the schema check anyways?

--
  Taneli Leppä   | CISSP, RHCE, ZCE, CMDEV
  Crasman Co Ltd | tan...@crasman.fi

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Clone resource each instance start\Stop

2011-07-26 Thread manish . gupta
Hi,

I have configured a multi-state(clone)resource float IP(IP).
It is running on all the configure Nodes.

I am trying to stop it using crm_resource command

crm_resource -r IP:0 -p target-role -v stopped

I am getting this error.

Error performing operation : The object/attribute does not exist.

Please anybody can help me. How can I stop a single instance using
any command

If I manually down a single instance on one node ,then i clean
instance than it comes up means it start again.

 ifconfig eth0:1 down
 crm_resource -C -r IP:0 -H NodeName

 It is working properly.

 Cluster stack
 corosync-1.2
 pacemaker-1.10


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] The version about pacemaker

2011-07-26 Thread Yingliang Yang
Hi, Andrew

I'd like to know whether the latest version(c86cb93c5a57) of pacemaker codes
on the site is stable? If not, how about Pacemaker 1.1.5(01e86afaaa6d)?

BTW, I'd like to know when the version 1.1.6 will get released, will it be
the NEAR future?

Best Regards,
Yingliang Yang
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] ping RA question

2011-07-26 Thread Yingliang Yang
2011/7/22 Dan Urist wrote:
 I am in the process of trying to write an fping RA, based on the
 pacemaker ping RA. My impetus for this is that I would like the RA to
 return success as soon as any ping succeeds; the behavior of linux's
 system ping as used in the standard ping RA is to run COUNT pings
 within the given deadline and only after COUNT or the deadline return
 success if any of the pings succeeded-- very inefficient.

 My question is this: the ping RA sets default values for
 OCF_RESKEY_CRM_meta_timeout and OCF_RESKEY_CRM_meta_interval, and it
 tests that OCF_RESKEY_CRM_meta_interval is an integer greater than 0.
 These variables aren't used anywhere else within the RA, but these are
 the same values in the actions section of the metadata for the monitor
 timeout and interval. I can't find any documentation that these
 variables serve as defaults for the monitor action in either the OCF
 agent developer guide or the pacemaker docs, but this seems to be the
 intent. Is this what they're there for?

 I think so.
The values in the actions section of the metadata for the monitor timeout
and interval, are used as monitor operation's default values.
OCF_RESKEY_CRM_meta_timeout and OCF_RESKEY_CRM_meta_interval are come from
the monitor operation's actual values.



Best Regards,
Yingliang Yang

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] ping RA question

2011-07-26 Thread Dan Urist
On Tue, 26 Jul 2011 18:41:25 +0800
Yingliang Yang zjut...@gmail.com wrote:

 2011/7/22 Dan Urist wrote:
  I am in the process of trying to write an fping RA, based on the
  pacemaker ping RA. My impetus for this is that I would like the RA
  to return success as soon as any ping succeeds; the behavior of
  linux's system ping as used in the standard ping RA is to run COUNT
  pings within the given deadline and only after COUNT or the
  deadline return success if any of the pings succeeded-- very
  inefficient.
 
  My question is this: the ping RA sets default values for
  OCF_RESKEY_CRM_meta_timeout and OCF_RESKEY_CRM_meta_interval, and it
  tests that OCF_RESKEY_CRM_meta_interval is an integer greater than
  0. These variables aren't used anywhere else within the RA, but
  these are the same values in the actions section of the metadata
  for the monitor timeout and interval. I can't find any
  documentation that these variables serve as defaults for the
  monitor action in either the OCF agent developer guide or the
  pacemaker docs, but this seems to be the intent. Is this what
  they're there for?
 
  I think so.
 The values in the actions section of the metadata for the monitor
 timeout and interval, are used as monitor operation's default values.

Not to be pedantic, but that's not what the OCF RA developer's guide
says, from http://www.linux-ha.org/doc/dev-guides/_metadata.html:

  Every action should list its own timeout value. This is a hint to the
  user what minimal timeout should be configured for the action. This is
  meant to cater for the fact that some resources are quick to start and
  stop (IP addresses or filesystems, for example), some may take several
  minutes to do so (such as databases).

  In addition, recurring actions (such as monitor) should also specify a
  recommended minimum interval, which is the time between two
  consecutive invocations of the same action. Like timeout, this value
  does not constitute a default — it is merely a hint for the user
  which action interval to configure, at minimum.


 OCF_RESKEY_CRM_meta_timeout and OCF_RESKEY_CRM_meta_interval are come
 from the monitor operation's actual values.

Are you sure these variables are used for the monitor action? There's
no documentation for these that I can find, either in the OCF RA
developer's guide or here:
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-resource-options.html

I've grepped through all the resource agents
under /usr/lib/ocf/resource.d (this is on a Debian Lenny system); the
only thing I can see meta_timeout used for is calculating a reasonable
shutdown timeout, and the only thing I see meta_interval used for is to
detect a probe.

-- 
Dan Urist
dur...@ucar.edu
303-497-2459

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster with DRBD : split brain

2011-07-26 Thread Lars Ellenberg
On Wed, Jul 20, 2011 at 11:36:25AM -0400, Digimer wrote:
 On 07/20/2011 11:24 AM, Hugo Deprez wrote:
  Hello Andrew,
  
  in fact DRBD was in standalone mode but the cluster was working :
  
  Here is the syslog of the drbd's split brain :
  
  Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake
  successful: Agreed network protocol version 91
  Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn(
  WFConnection - WFReportParams )
  Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting
  asender thread (from drbd0_receiver [23281])
  Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0:
  data-integrity-alg: not-used
  Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0:
  drbd_sync_handshake:
  Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self
  BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F
  bits:75338 flags:0
  Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer
  8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F
  bits:769 flags:0
  Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0:
  uuid_compare()=100 by rule 90
  Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain
  detected, dropping connection!
  Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper
  command: /sbin/drbdadm split-brain minor-0
  Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta
  connection shut down by peer.
  Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn(
  WFReportParams - NetworkFailure )
  Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender
  terminated
  Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating
  asender thread
  Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper
  command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
  Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn(
  NetworkFailure - Disconnecting )
  Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error
  receiving ReportState, l: 4!
  Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection
  closed
  Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn(
  Disconnecting - StandAlone )
  Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver
  terminated
  Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating
  receiver thread
 
 This was a DRBD split-brain, not a pacemaker split. I think that might
 have been the source of confusion.
 
 The split brain occurs when both DRBD nodes lose contact with one
 another and then proceed as StandAlone/Primary/UpToDate. To avoid this,
 configure fencing (stonith) in Pacemaker, then use 'crm-fence-peer.sh'
 in drbd.conf;
 
 ===
 disk {
 fencing resource-and-stonith;
 }
 
 handlers {
 outdate-peer/path/to/crm-fence-peer.sh;
 }
 ===

Thanks, that is basically right.
Let me fill in some details, though:

 This will tell DRBD to block (resource) and fence (stonith). DRBD will

drbd fencing options are fencing resource-only,
and fencing resource-and-stonith. 

resource-only does *not* block IO while the fencing handler runs.

resource-and-stonith does block IO.

 not resume IO until either the fence script exits with a success, or
 until an admit types 'drbdadm resume-io res'.


 The CRM script simply calls pacemaker and asks it to fence the other
 node.

No.  It tries to place a constraint forcing the Master role off of any
node but the one with the good data.

 When a node has actually failed, then the lost no is fenced. If
 both nodes are up but disconnected, as you had, then only the fastest
 node will succeed in calling the fence, and the slower node will be
 fenced before it can call a fence.

fenced may be restricted from being/becoming Master by that fencing
constraint. Or, if pacemaker decided to do so, actually shot by some
node level fencing agent (stonith).

All that resource-level fencing by placing some constraint stuff
obviously only works as long as the cluster communication is still up.
It not only the drbd replication link had issues, but the cluster
communication was down as well, it becomes a bit more complex.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster with DRBD : split brain

2011-07-26 Thread Digimer
On 07/26/2011 11:43 AM, Lars Ellenberg wrote:
 On Wed, Jul 20, 2011 at 11:36:25AM -0400, Digimer wrote:
 On 07/20/2011 11:24 AM, Hugo Deprez wrote:
 Hello Andrew,

 in fact DRBD was in standalone mode but the cluster was working :

 Here is the syslog of the drbd's split brain :

 Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake
 successful: Agreed network protocol version 91
 Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn(
 WFConnection - WFReportParams )
 Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting
 asender thread (from drbd0_receiver [23281])
 Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0:
 data-integrity-alg: not-used
 Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0:
 drbd_sync_handshake:
 Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self
 BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F
 bits:75338 flags:0
 Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer
 8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F
 bits:769 flags:0
 Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0:
 uuid_compare()=100 by rule 90
 Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain
 detected, dropping connection!
 Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper
 command: /sbin/drbdadm split-brain minor-0
 Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta
 connection shut down by peer.
 Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn(
 WFReportParams - NetworkFailure )
 Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender
 terminated
 Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating
 asender thread
 Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper
 command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
 Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn(
 NetworkFailure - Disconnecting )
 Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error
 receiving ReportState, l: 4!
 Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection
 closed
 Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn(
 Disconnecting - StandAlone )
 Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver
 terminated
 Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating
 receiver thread

 This was a DRBD split-brain, not a pacemaker split. I think that might
 have been the source of confusion.

 The split brain occurs when both DRBD nodes lose contact with one
 another and then proceed as StandAlone/Primary/UpToDate. To avoid this,
 configure fencing (stonith) in Pacemaker, then use 'crm-fence-peer.sh'
 in drbd.conf;

 ===
 disk {
 fencing resource-and-stonith;
 }

 handlers {
 outdate-peer/path/to/crm-fence-peer.sh;
 }
 ===
 
 Thanks, that is basically right.
 Let me fill in some details, though:
 
 This will tell DRBD to block (resource) and fence (stonith). DRBD will
 
 drbd fencing options are fencing resource-only,
 and fencing resource-and-stonith. 
 
 resource-only does *not* block IO while the fencing handler runs.
 
 resource-and-stonith does block IO.

Ahhh, that's why I was confused. I thought the 'resource' meant the same
thing in both cases, but had only read the 'resource-and-stonith' section.

 not resume IO until either the fence script exits with a success, or
 until an admit types 'drbdadm resume-io res'.
 
 
 The CRM script simply calls pacemaker and asks it to fence the other
 node.
 
 No.  It tries to place a constraint forcing the Master role off of any
 node but the one with the good data.

Ok, I thought it was akin to the 'obliterate-peer.sh' script, which
calls 'fence_node'... I made an assumption, which was not correct.

 When a node has actually failed, then the lost no is fenced. If
 both nodes are up but disconnected, as you had, then only the fastest
 node will succeed in calling the fence, and the slower node will be
 fenced before it can call a fence.
 
 fenced may be restricted from being/becoming Master by that fencing
 constraint. Or, if pacemaker decided to do so, actually shot by some
 node level fencing agent (stonith).
 
 All that resource-level fencing by placing some constraint stuff
 obviously only works as long as the cluster communication is still up.
 It not only the drbd replication link had issues, but the cluster
 communication was down as well, it becomes a bit more complex.

Thanks for the clarity. Today I learned. :)

-- 
Digimer
E-Mail:  digi...@alteeve.com
Freenode handle: digimer
Papers and Projects: http://alteeve.com
Node Assassin:   http://nodeassassin.org
At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?


Re: [Pacemaker] The version about pacemaker

2011-07-26 Thread Andrew Beekhof
On Tue, Jul 26, 2011 at 7:15 PM, Yingliang Yang zjut...@gmail.com wrote:
 Hi, Andrew

 I'd like to know whether the latest version(c86cb93c5a57) of pacemaker codes
 on the site is stable?

Yes. It is.

 If not, how about Pacemaker 1.1.5(01e86afaaa6d)?

 BTW, I'd like to know when the version 1.1.6 will get released, will it be
 the NEAR future?

I hope so :-(
Lately I've been required to work on some other things but I should be
handing over responsibilities for those tasks and be back on pacemaker
full time very soon.


 Best Regards,
 Yingliang Yang





 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Resources are not restarted on definition change after f59d7460bdde (devel)

2011-07-26 Thread Andrew Beekhof
On Fri, Jul 1, 2011 at 4:59 PM, Andrew Beekhof and...@beekhof.net wrote:
 Hmm.  Interesting. I will investigate.

This is an unfortunate side-effect of my history compression patch.

Since we only store the last successful and last failed operation, we
don't have the md5 of the start operation around to check when a
resource's definition is changed.

Solutions appear to be either:
a) give up the space savings and revert the history compression patch
b) always restart a resource if a non-matching md5 is detected - even
if the operation was a recurring monitor

I'd favor b) along with dropping the per-operation parameters.
The only valid use-case I've heard for those is setting OCF_LEVEL or
depth or whatever it was called - and I think we're in basic agreement
that we need a better solution for that anyway.
Perhaps promoting it to be an attribute of the op tag (along with timeout etc).


 On Tue, Jun 28, 2011 at 3:46 AM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 Hi all,

 I'm pretty sure I bisected commit which breaks restart of (node local)
 resources after definition change.

 Nodes which has f59d7460bdde applied (v03-a and v03-b in my case) do not
 restart such resources, while node without this commit (mgmt01) does.

 Here is snippet from DC (grrr, thunderbird does not like long lines):
 
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:0_start_0 on mgmt01 changed: recorded
 a2a2341cf3c157a1b44dd9ed7068e2dd vs. 31e7242629b49443f536c22192debb15
 (all:3.0.5) 0:0;150:2:0:62c60b6a-17e8-4dbf-8291-a01e7ea06b6a
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:0_monitor_36 on mgmt01 changed:
 recorded 346bad4576870d644109c1e6233002aa vs.
 d9c16f21c130ae8da55d8eac0b6c6cdc (all:3.0.5)
 0:0;153:2:0:62c60b6a-17e8-4dbf-8291-a01e7ea06b6a
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:0_monitor_24 on mgmt01 changed:
 recorded fbdf86bce136d60e21c1ef1fad451c0d vs.
 11cd729f3313767ad7383c42495e612b (all:3.0.5)
 0:0;152:2:0:62c60b6a-17e8-4dbf-8291-a01e7ea06b6a
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:0_monitor_12 on mgmt01 changed:
 recorded 34e9fed5be3737e563b47b0c3e353db1 vs.
 54b02cd722053809bd0b1a3619adfd3b (all:3.0.5)
 0:0;151:2:0:62c60b6a-17e8-4dbf-8291-a01e7ea06b6a
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:1_monitor_36 on v03-a changed:
 recorded 346bad4576870d644109c1e6233002aa vs.
 d9c16f21c130ae8da55d8eac0b6c6cdc (all:3.0.5)
 0:0;177:2:0:9b3096b4-6add-4612-937c-f7013b18fd15
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:1_monitor_24 on v03-a changed:
 recorded fbdf86bce136d60e21c1ef1fad451c0d vs.
 11cd729f3313767ad7383c42495e612b (all:3.0.5)
 0:0;176:2:0:9b3096b4-6add-4612-937c-f7013b18fd15
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:1_monitor_12 on v03-a changed:
 recorded 34e9fed5be3737e563b47b0c3e353db1 vs.
 54b02cd722053809bd0b1a3619adfd3b (all:3.0.5)
 0:0;175:2:0:9b3096b4-6add-4612-937c-f7013b18fd15
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:2_monitor_36 on v03-b changed:
 recorded 346bad4576870d644109c1e6233002aa vs.
 d9c16f21c130ae8da55d8eac0b6c6cdc (all:3.0.5)
 0:0;182:3:0:76ced8fb-1f7b-4a40-898c-a134b816b791
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:2_monitor_24 on v03-b changed:
 recorded fbdf86bce136d60e21c1ef1fad451c0d vs.
 11cd729f3313767ad7383c42495e612b (all:3.0.5)
 0:0;181:3:0:76ced8fb-1f7b-4a40-898c-a134b816b791
 Jun 27 17:35:58 mgmt01 pengine: [31176]: WARN: check_action_definition:
 Parameters to libvirt-install-fs:2_monitor_12 on v03-b changed:
 recorded 34e9fed5be3737e563b47b0c3e353db1 vs.
 54b02cd722053809bd0b1a3619adfd3b (all:3.0.5)
 0:0;180:3:0:76ced8fb-1f7b-4a40-898c-a134b816b791
 =

 Then resource is restarted on mgmt01 but not on other nodes.
 First line from logs snipped (for libvirt-install-fs:0_start_0
 operation) does not appear for start ops for resources on other nodes.

 The only difference between pacemaker builds is that commit.

 Hope this information could help to fix this (if not already done).

 Best,
 Vladislav

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: 

Re: [Pacemaker] Resource Group Questions - Start/Stop Order

2011-07-26 Thread Andrew Beekhof
On Thu, Jul 21, 2011 at 2:36 AM, Bobbie Lind bl...@sms-fed.com wrote:
 Hi group,

 I am running a 6 node system, 4 of which mount the LUNs for my Lustre file
 system.  I currently have 29 LUNs per server set up in 4 Resource Groups.  I
 understand the default startup/shudown order of the resource but I was
 wondering if there is a way to override that and have all the resources in
 the group startup or shutdown at the same time.  Ideally what I am looking
 for is all the resources in the group OSS1group to startup and shutdown at
 the same time since none of them are dependent on each other, they just
 belong on the same server.

I'd suggest just not using a group in this case.
If all you want is colocation, use a colocation set.

 I found this thread here
 http://www.gossamer-threads.com/lists/linuxha/pacemaker/60893 which talks
 about non-ordered groups and I think that is what I need but I am at a loss
 as to how to find the parameters/attributes of the group to set it up.

 Is it possible to override the default action of the resource group's
 startup/shutdown order?  Can someone point me to some documentation where I
 can find the available parameters that can be set for groups?

 I have attached my configuration in case it's needed and I am running
 Pacemaker 1.0.11

 Bobbie Lind
 Systems Engineer
 Solutions Made Simple, Inc (SMSi)
 703-296-3087 (Cell)
 bl...@sms-fed.com


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Upgrading from 1.0 to 1.1

2011-07-26 Thread Andrew Beekhof
On Tue, Jul 19, 2011 at 5:40 PM, Proskurin Kirill
k.prosku...@corp.mail.ru wrote:
 On 07/19/2011 03:22 AM, Andrew Beekhof wrote:

 On Fri, Jul 15, 2011 at 10:33 PM, Proskurin Kirill
 k.prosku...@corp.mail.ru  wrote:

 Hello all.

 I found what I using corosync with pacemaker ver:0 with installed
 pacemaker 1.1.5 - eg without start a pacemakerd.

 Sounds wrong. :-)
 So I try to upgrade.
 I shutdown one node. Change 0 to 1 on service.d/pcmk
 Start corosync and then start pacemakerd via init script.

 But this node stays online and on clusters DC I see:
 cib: [18392]: WARN: cib_peer_callback: Discarding cib_sync_one message
 (255)
 from mysender10.example.com: not in our membership

 Thats odd.  The only you changed was ver: 0 to ver: 1 ?

 Yes, only this. To make it more clear - I have 4 nodes with ver 0 and try to
 add one with ver 1 and got this.

 Well I shutdown all nodes change all to 1 and star them up add all was ok.
 Not a really good way to upgrade but I don`t have time.

Do you still have the logs for the failure case?
I'd really like to see them.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] unexpected Error in Log files

2011-07-26 Thread Andrew Beekhof
On Tue, Jul 19, 2011 at 4:00 PM, rakesh rakirocker4...@gmail.com wrote:
 Hi


 I configured a cluster which consists of four nodes,

 started Heartbeat/pacemaker on four nodes.

 after some point of time  4th nodes gone down unexpectedly and find the
 following error messages while debugging all the log files like ha-debug and
 messages.log file .
 can you please help me out regarding this .

 please find the  messages in the log file below.



 Jun 10 12:55:46 node4 ccm: 2011 Jun 10 12:55:46 PDT -0700 ccm: Cannot append 
 to
 /var/log/ha-debug: File too large
 Jun 10 12:55:46 node4 stonithd: 2011 Jun 10 12:55:46 PDT -0700 stonithd: 
 Cannot
 append to /var/log/ha-debug: File too large
 Jun 10 12:55:46 node4 last message repeated 6 times
 Jun 10 12:55:46 node4 cib: 2011 Jun 10 12:55:46 PDT -0700 cib: Cannot append 
 to
 /var/log/ha-debug: File too large
 Jun 10 12:55:46 node4 cib: 2011 Jun 10 12:55:46 PDT -0700 cib: Cannot append 
 to
 /var/log/ha-debug: File too large

You might want to do something about that.

 Jun 10 12:55:46 node4 ccm: 2011 Jun 10 12:55:46 [17142]: WARN:
 cib_peer_callback: Discarding cib_apply_diff message (181) from node1: not in
 our membership

All this says is that node1 left the cluster.
There is no way for us to know why based on this one line.





 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Reload action and stop/start sequence questions

2011-07-26 Thread Andrew Beekhof
On Mon, Jul 11, 2011 at 5:45 PM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 Hi all,

 Would somebody (Andrew?) please bring some light on how exactly
 redefinition of resource is supposed to be handled?

 Below is my (rather perfectionistic) vision on this, please correct me
 if/where I'm wrong:
 * If RA supports 'reload' action then it is called on resource
 definition change (instead of stop/start).

Only if the attribute changed was NOT marked as unique in the metadata.

 * If 'reload' action fails then usual start/stop sequence is executed.
 This would give a chance to RA to refuse to reload if some key
 properties change, while allowing it to tune some secondary resource
 parameters. Of course, RA should leave resource in a usable state, so
 failure of reload action should indicate RA's denial to do a reload. How
 to differentiate that from real reload failures?

Either way the resource needs to be restarted.  So there is no need
for differentiation.

 Is there some special
 exit code for that?
 * Dependent resources should not be stopped/started for 'reload' action.
 Of course they are restarted if reload fails and stop/start is executed
 then. (I see that they are restarted now for reload of a resource they
 depend on, is it a bug?)

More like a limitation.  Which is a round-a-bout way of saying really
hard to fix bug.
You're welcome to create a BZ for it though, maybe one day I'll figure
out how to resolve it.

 * (wish) Resources should be migrated out of node (if they support live
 migration) for stop/start sequence of resource they depend on.

Migration can only occur if a resource at the bottom (excluding any
clones) of the resource stack.
In order to migrate any colocation dependancies need to be running at
_both_ the old and the new locations.

This can only be true for resources that depend on clones.

 * (wish) Redefinition of clones should be handled in a way which allows
 dependent live-migratable resources to survive (if reload action for
 clone instance either is not supported or fails).

This doesn't make sense.
If the definition of one clone changes, then they all change and there
is nowhere for dependant resources to migrate to.

 That is: dependent
 resources which support live migration are first tried to migrate out of
 one node, and are stopped if migration fails. Then clone instance is
 restarted on that node. Then the same procedure applies to next cluster
 node so resources may return back to a first node.

 If above (at least first three points) is right, then is it possible to
 get a set of previous instance parameters the same way new configuration
 is passed (env vars), or RA should save that information itself in advance?

 Best,
 Vladislav

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Business Standard India - Errors noticed in pacemaker

2011-07-26 Thread Andrew Beekhof
 --
 An HTML attachment was scrubbed...
 URL: 
 http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110726/4d99749e/attachment-0001.html
 

 --

 Message: 2
 Date: Tue, 26 Jul 2011 17:00:56 +1000
 From: Andrew Beekhof and...@beekhof.net
 To: Pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Cluster type is: corosync
 Message-ID:
 CAEDLWG1s=ahrdxdwdo0r04410j0+ygj7vfaz_yf_0fmdpin...@mail.gmail.com
 Content-Type: text/plain; charset=UTF-8

 On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill
 k.prosku...@corp.mail.ru wrote:
  25.07.2011 10:10, Andrew Beekhof ?:
 
  Which packages are you using?
 
  It is your official source from repository I build.

 Ok. And did you add the pacemaker configuration options to corosync's
 config file?

  pacemaker-1.1.5
  corosync-1.4.0
  cluster-glue-1.0.6
  openais-1.1.2
 
  All nodes have same rpms.
 
  On Fri, Jul 22, 2011 at 7:47 PM, Proskurin Kirill
  k.prosku...@corp.mail.ru ?wrote:
 
  Hello again!
 
  Hope I`m not flooding too much here but I have another problem.
 
  I install same rpm of corosync, openais, pacemaker, cluster_glue on all
  nodes. I check it twice.
 
  And then I start some of they - they can`t connect to cluster and stays
  offline. In logs I see what they see other nodes and connectivity is
 ok.
  But
  I found the difference:
 
  Online nodes in cluster have:
  [root@mysender39 ~]# grep 'Cluster type is' /var/log/corosync.log
  Jul 22 20:38:58 mysender39.mail.ru stonith-ng: [3499]: info:
  get_cluster_type: Cluster type is: 'openais'.
  Jul 22 20:38:58 mysender39.mail.ru attrd: [3502]: info:
 get_cluster_type:
  Cluster type is: 'openais'.
  Jul 22 20:38:58 mysender39.mail.ru cib: [3500]: info:
 get_cluster_type:
  Cluster type is: 'openais'.
  Jul 22 20:38:59 mysender39.mail.ru crmd: [3504]: info:
 get_cluster_type:
  Cluster type is: 'openais'.
 
  Offline have:
  [root@mysender2 ~]# grep 'Cluster type is' /var/log/corosync.log
  Jul 22 13:39:17 mysender2.mail.ru stonith-ng: [9028]: info:
  get_cluster_type: Cluster type is: 'corosync'.
  Jul 22 13:39:17 mysender2.mail.ru attrd: [9031]: info:
 get_cluster_type:
  Cluster type is: 'corosync'.
  Jul 22 13:39:17 mysender2.mail.ru cib: [9029]: info: get_cluster_type:
  Cluster type is: 'corosync'.
  Jul 22 13:39:18 mysender2.mail.ru crmd: [9033]: info:
 get_cluster_type:
  Cluster type is: 'corosync'.
 
  What`s wrong and how can I fix it?
 
  --
  Best regards,
  Proskurin Kirill
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs:
 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 



 --

 Message: 3
 Date: Tue, 26 Jul 2011 16:12:25 +0900
 From: Yuusuke IIDA iiday...@intellilink.co.jp
 To: pacemaker@oss pacemaker@oss.clusterlabs.org
 Cc: tanaka...@intellilink.co.jp
 Subject: [Pacemaker] Please teach it about handling of the unmanaged
 resource in environment setting placement-strategy.
 Message-ID: 4e2e68d9.6010...@intellilink.co.jp
 Content-Type: text/plain; charset=iso-2022-jp

 Hi, Yan
 Hi, Andrew

 I used the function of placement-strategy and found movement to be worried
 about.

 There is node act3 which the resource that became the unmanaged state
 starts.

 The resource that started then in node act1 broke down and moved.

 I hoped that this inoperative resource moved to node sby1, but it was not
 carried out.

 Is the movement that a resource with other capacity moves in the node that
 the
 resource of the unmanaged state meets capacity right as specifications?

 I want you to revise it to decide placement in consideration of the
 capacity of
 the unmanaged resource.

 I attach crm_report when a problem happened.

 Best Regards,
 Yuusuke
 --
 
 METRO SYSTEMS CO., LTD

 Yuusuke Iida
 Mail: iiday...@intellilink.co.jp
 
 -- next part --
 A non-text attachment was scrubbed...
 Name: pcmk-Tue-26-Jul-2011.tar.bz2
 Type: application/octet-stream
 Size: 164949 bytes
 Desc: not available
 URL: 
 http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110726/66852f5b/attachment.obj
 

 --

 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


 End of Pacemaker Digest, Vol 44, Issue 50
 *


 --



 *Disclaimer:* This communication/message is for the named addressees
 only. This transmission may contain information that is privileged,
 confidential, proprietary or legally privileged, and /or exempt from
 disclosure under applicable law. If you are not the intended

Re: [Pacemaker] Initial quorum

2011-07-26 Thread Andrew Beekhof
On Thu, Jul 21, 2011 at 4:13 PM, pskrap psk...@hotmail.com wrote:
 Devin Reade gdr@... writes:


 --On Wednesday, July 20, 2011 09:19:33 AM + pskrap pskrap@...
 wrote:

  I have a cluster where some of the resources cannot run on the same node.
  All  resources must be running to provide a functioning service. This
  means that a  certain amount of nodes needs to be up before it makes
  sense for the cluster to start any resources.

 Without knowing anything about your application, I would tend to question
 this statement.  Is it true that you must not start *any* resources before
 you have enough nodes, or is sufficient to say that the application
 is not considered up until all resources are started?  It may not
 make sense to run any, but does it do any harm?

 If you *can* start at least some resources before all nodes are available,
 then I would expect that you could get by with defining colocation
 constraints to ensure that some resources don't run on the same nodes,
 perhaps augmenting things with some order constraints if necessary.

 If your applications die or do other horrible stuff when only some subset
 are running then I'd have a talk with your application developers
 as it sounds like a larger robustness problem.

 Devin


 No, there are no crash issues etc when all resources are not running. The
 application is just not usable until all resources are started.

 As for the harm, the resources which have constraints preventing them from
 running will fail,

Are you talking about constraints in the pacemaker config or some other kind?

 but I guess they will recover as more nodes are added. The
 harm is mostly in the fact that starting nodes one by one will cause the
 resources to be unevenly distributed over the nodes since everything will 
 start
 on the nodes in the order they are installed. I know I can give a preferred
 node to a resource and allow it to relocate when it becomes available. 
 However,
 this application provides a real-time service so I only want resources to
 relocate when it is absolutely necessary. Therefore I have given the resources
 a preferred node, but do not allow them to relocate when it becomes available.

 So I guess the overall harm is limited even though it exists. I was just
 looking for a cleaner startup for the system. Since you did not mention any 
 way
 to do what my question was about I assume it is currently not possible to do
 what I asked for. I do think such an option would be useful though. Logically
 it does not make sense for the cluster to be starting resources for an
 application before the cluster have enough nodes for the application to be 
 able
 to run.



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Business Standard India - Errors related to pacemaker in server message file

2011-07-26 Thread Gururaj B Patil
 mailing list submissions to
 pacemaker@oss.clusterlabs.org

 To subscribe or unsubscribe via the World Wide Web, visit
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 or, via email, send a message with subject or body 'help' to
 pacemaker-requ...@oss.clusterlabs.org

 You can reach the person managing the list at
 pacemaker-ow...@oss.clusterlabs.org

 When replying, please edit your Subject line so it is more specific
 than Re: Contents of Pacemaker digest...


 Today's Topics:

   1. Re: Business Standard India - Errors noticed in pacemaker
  (Andrew Beekhof)
   2. Re: Cluster type is: corosync (Andrew Beekhof)
   3. Please teach it about handling of the unmanaged resource in
  environment setting placement-strategy. (Yuusuke IIDA)


 --

 Message: 1
 Date: Tue, 26 Jul 2011 16:02:30 +1000
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Cc: M A Faruqui m.faru...@bsmail.in, Prafulla H Patil
 prafulla.pa...@bsmail.in, Bandana Roy bandana@bsmail.in
 Subject: Re: [Pacemaker] Business Standard India - Errors noticed in
 pacemaker
 Message-ID:
 caedlwg0exctnrjq8kduv8jdc3hztvsovb7gdnyv4mvppox_...@mail.gmail.com
 Content-Type: text/plain; charset=iso-8859-1

 Thought about googling the error?

 On Mon, Jul 25, 2011 at 9:17 PM, Gururaj B Patil gururaj.pa...@bsmail.in
 wrote:

  To support staff,
 
  We in Business Standard Ltd. use pacemaker as clusttering application
for
  one of our website. Two servers are are in clusttering mode.
 
  One of the server is web server and another one is mysql db server.
  Pacemaker handles Mysql clustering at block level.
 
  We have noticed same type of notice and warning in the server's message
  file. Errors are as below.
 
 
 

---

  Messages like below appear every 15 minutes
 
  sidrbd0 pengine: [3143]: notice: get_failcount: Failcount for sipdu1 on
  sidrbd0 has expired (limit was 20s)
 
  sidrbd0 pengine: [3143]: ERROR: create_notification_boundaries:
Creating
  boundaries for mysql-ms-drbd
 
 
 

---

 
  I have registered for pacemaker mailing list also.
 
  Regards,
  Gururaj Patil
  Systems Department
  Business Standard Ltd.
  H3/4, Paragon center,
  P.B.Marg,
  Worli
  Mumbai - 400013
  India
 
  Ph.+91-22-24971924
  --
 
 
 
  *Disclaimer:* This communication/message is for the named addressees
  only. This transmission may contain information that is privileged,
  confidential, proprietary or legally privileged, and /or exempt from
  disclosure under applicable law. If you are not the intended recipient,
  please immediately notify the sender and destroy the material in its
  entirety, whether in electronic or hard copy format. You are hereby
 notified
  that any disclosure, copying, distribution, or use of the information
  contained herein (including any reliance thereon) is STRICTLY
 PROBHIBITED.
  You must not, directly or indirectly, use, disclose, distribute, print
or
  copy any part of this message.
 
  *WARNING :*This electronic mail and any attachments are believed to be
  free of any virus or other defect, the recipient must ensure that it is
  virus free and no responsibility is accepted by Business Standard
Limited
  and /or its employees as applicable for any loss or damage arising in
any
  way from its use.
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs:
 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 
 -- next part --
 An HTML attachment was scrubbed...
 URL: 

http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20110726/4d99749e/attachment-0001.html

 

 --

 Message: 2
 Date: Tue, 26 Jul 2011 17:00:56 +1000
 From: Andrew Beekhof and...@beekhof.net
 To: Pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Cluster type is: corosync
 Message-ID:
 CAEDLWG1s=ahrdxdwdo0r04410j0+ygj7vfaz_yf_0fmdpin...@mail.gmail.com
 Content-Type: text/plain; charset=UTF-8

 On Mon, Jul 25, 2011 at 7:18 PM, Proskurin Kirill
 k.prosku...@corp.mail.ru wrote:
  25.07.2011 10:10, Andrew Beekhof ?:
 
  Which packages are you using?
 
  It is your official source from repository I build.

 Ok. And did you add the pacemaker configuration options to corosync's
 config file?

  pacemaker-1.1.5
  corosync-1.4.0
  cluster-glue-1.0.6
  openais-1.1.2
 
  All nodes have same rpms.
 
  On Fri