Re: [Pacemaker] Cluster crash

2012-03-29 Thread Andrew Beekhof
On Tue, Mar 27, 2012 at 9:36 PM, Hugo Deprez hugo.dep...@gmail.com wrote:
 Hello,

 I do have no-quorum-policy=ignore

What about fencing?

 Any idea how to reproduce this ?
 Regards,

 On 23 February 2012 23:43, Andrew Beekhof and...@beekhof.net wrote:

 On Thu, Feb 23, 2012 at 9:17 PM, Hugo Deprez hugo.dep...@gmail.com
 wrote:
  I don't think so has, I do have over similar cluster on the same network
  and
  didn't have any issues.
  The only thing I can detect was that the virtual machine was like
  unresponsive.
  But I think the VM crash was not like a power shutdown more like very
  slow
  then totaly crash.
 
  Even if the drbd-nagios resource timeout, it should failover on the
  other
  node no ?

 Not if no-quorum-policy wasn't set to ignore, or if we couldn't
 confirm the service is safely stopped on the original location.

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] VirtualDomain Shutdown Timeout

2012-03-29 Thread Andrew Beekhof
On Sun, Mar 25, 2012 at 6:27 AM, Andrew Martin amar...@xes-inc.com wrote:
 Hello,

 I have configured a KVM virtual machine primitive using Pacemaker 1.1.6 and
 Heartbeat 3.0.5 on Ubuntu 10.04 Server using DRBD as the storage device (so
 there is no shared storage, no live-migration):
 primitive p_vm ocf:heartbeat:VirtualDomain \
         params config=/vmstore/config/vm.xml \
         meta allow-migrate=false \
         op start interval=0 timeout=180s \
         op stop interval=0 timeout=120s \
         op monitor interval=10 timeout=30

 I would expect the following events to happen on failover on the from node
 (the migration source) if the VM hangs while shutting down:
 1. VirtualDomain issues virsh shutdown vm to gracefully shutdown the VM
 2. pacemaker waits 120 seconds for the timeout specified in the op stop
 timeout
 3. VirtualDomain waits a bit less than 120 seconds to see if it will
 gracefully shutdown. Once it gets to almost 120 seconds, it issues virsh
 destroy vm to hard stop the VM.
 4. pacemaker wakes up from the 120 second timeout and sees that the VM has
 stopped and proceeds with the failover

 However, I observed that VirtualDomain seems to be using the timeout from
 the op start line, 180 seconds, yet pacemaker uses the 120 second timeout.
 Thus, the VM is still running after the pacemaker timeout is reached and so
 the node is STONITHed. Here is the relevant section of code from
 /usr/lib/ocf/resource.d/heartbeat/VirtualDomain:
 VirtualDomain_Stop() {
     local i
     local status
     local shutdown_timeout
     local out ex

     VirtualDomain_Status
     status=$?

     case $status in
         $OCF_SUCCESS)
             if ! ocf_is_true $OCF_RESKEY_force_stop; then
                 # Issue a graceful shutdown request
                 ocf_log info Issuing graceful shutdown request for domain
 ${DOMAIN_NAME}.
                 virsh $VIRSH_OPTIONS shutdown ${DOMAIN_NAME}
                 # The shutdown_timeout we use here is the operation
                 # timeout specified in the CIB, minus 5 seconds
                 shutdown_timeout=$(( $NOW +
 ($OCF_RESKEY_CRM_meta_timeout/1000) -5 ))
                 # Loop on status until we reach $shutdown_timeout
                 while [ $NOW -lt $shutdown_timeout ]; do

 Doesn't $OCF_RESKEY_CRM_meta_timeout correspond to the timeout value in the
 op stop ... line?

It should, however there was a bug in 1.1.6 where this wasn't the case.
The relevant patch is:
  https://github.com/beekhof/pacemaker/commit/fcfe6fe

Or you could try 1.1.7


 How can I optimize my pacemaker configuration so that the VM will attempt to
 gracefully shutdown and then at worst case destroy the VM before the
 pacemaker timeout is reached? Moreover, is there anything I can do inside of
 the VM (another Ubuntu 10.04 install) to optimize/speed up the shutdown
 process?

 Thanks,

 Andrew


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] unable to join cluster

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 22, 2012 at 3:07 PM, Hisashi Osanai
osanai.hisa...@jp.fujitsu.com wrote:

 Hello,

 I have three nodes cluster using pacemaker/corosync. When I reboot one node,

 the node unable to join cluster. I can see that kind of split brain 10-20%
 (recall ration) if I shutdown a node.

 What do you think of this problem?

It depends whether corosync sees all three nodes (in which case its a
pacemaker problem), if not its a corosync problem.
There are newer versions of both, perhaps try an upgrade?


 My questions are:
 - Is this known problem?
 - Any work around to avoid the this?
 - How can I solve this problem?

 [testserver001]
 
 Last updated: Sat Mar 10 14:18:49 2012
 Stack: openais
 Current DC: NONE
 3 Nodes configured, 3 expected votes
 4 Resources configured.
 

 OFFLINE: [ testserver001 testserver002 testserver003 ]


 Migration summary:

 [testserver002]
 
 Last updated: Sat Mar 10 14:15:17 2012
 Stack: openais
 Current DC: testserver002 - partition with quorum
 Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
 3 Nodes configured, 3 expected votes
 4 Resources configured.
 

 Online: [ testserver002 testserver003 ]
 OFFLINE: [ testserver001 ]

  Resource Group: testgroup
     testrsc     (lsb:testmgr):   Started testserver002
 stonith-testserver002        (stonith:external/ipmi):        Started
 testserver003
 stonith-testserver003        (stonith:external/ipmi):        Started
 testserver002
 stonith-testserver001        (stonith:external/ipmi):        Started
 testserver003

 Migration summary:
 * Node testserver003:
 * Node testserver002:

 [testserver003]
 
 Last updated: Sat Mar 10 14:19:07 2012
 Stack: openais
 Current DC: testserver002 - partition with quorum
 Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
 3 Nodes configured, 3 expected votes
 4 Resources configured.
 

 Online: [ testserver002 testserver003 ]
 OFFLINE: [ testserver001 ]

  Resource Group: testgroup
     testrsc     (lsb:testmgr):   Started testserver002
 stonith-testserver002        (stonith:external/ipmi):        Started
 testserver003
 stonith-testserver003        (stonith:external/ipmi):        Started
 testserver002
 stonith-testserver001        (stonith:external/ipmi):        Started
 testserver003

 Migration summary:
 * Node testserver003:
 * Node testserver002:

 - Checked information
  + https://bugzilla.redhat.com/show_bug.cgi?id=525589
    It looks the packages which I used already support this.
  + http://comments.gmane.org/gmane.linux.highavailability.user/36101
    I checked entries in /etc/hosts but I didn't find out the wrong entry.
    ===
    127.0.0.1 testserver001 localhost
    ::1             localhost6.localdomain6 localhost6
    ===

 - Look into this from tcpdump
  OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends
 MESSAGE_TYPE_MCAST.
           I took the information from VMware env.

    + MESSAGE_TYPE_ORF_TOKEN
      No.     Time                       Source                Destination
 Protocol Length Info
          119 2012-03-19 22:00:15.250310 172.27.4.1            172.27.4.2
 UDP      112    Source port: 23489  Destination port: 23490

      Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits)
      Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst:
 Vmware_8e:74:92 (00:0c:29:8e:74:92)
      Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst:
 172.27.4.2 (172.27.4.2)
      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
 (23490)
      Data (70 bytes)

        00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00
 ...
      0010  00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b
 
      (snip)

    + MESSAGE_TYPE_MCAST
      No.     Time                       Source                Destination
 Protocol Length Info
         5141 2012-03-19 22:01:19.198346 172.27.4.2            226.94.16.16
 UDP      1486   Source port: 23489  Destination port: 23490

      Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured
 (11888 bits)
      Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst:
 IPv4mcast_5e:10:10 (01:00:5e:5e:10:10)
      Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
 226.94.16.16 (226.94.16.16)
      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
 (23490)
      Data (1444 bytes)

        01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b
 ...
      0010  04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b
 
      (snip)

  NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see
 the
           message in pacemaker.log.

    + MESSAGE_TYPE_ORF_TOKEN
      No.     Time                       Source                Destination
 Protocol Length Info
         39605 2012-03-10 14:18:13.826778 172.27.4.2            172.27.4.3
 UDP      112    Source port: 23489  Destination port: 23490

      Frame 39605: 112 bytes 

Re: [Pacemaker] [Problem] The cluster fails in the stop of the node.

2012-03-29 Thread Andrew Beekhof
This appears to be resolved with 1.1.7, perhaps look for a patch to backport?

On Tue, Mar 27, 2012 at 4:46 PM,  renayama19661...@ybb.ne.jp wrote:
 Hi All,

 When we set a group resource within Master/Slave resource, we found the 
 problem that a node could not stop.

 This problem occurs in Pacemaker1.0.11.

 We confirmed a problem in the following procedure.

 Step1) Start all nodes.

 
 Last updated: Tue Mar 27 14:35:16 2012
 Stack: Heartbeat
 Current DC: test2 (b645c456-af78-429e-a40a-279ed063b97d) - partition WITHOUT 
 quorum
 Version: 1.0.12-unknown
 2 Nodes configured, unknown expected votes
 4 Resources configured.
 

 Online: [ test1 test2 ]

  Master/Slave Set: msGroup01
     Masters: [ test1 ]
     Slaves: [ test2 ]
  Resource Group: testGroup
     prmDummy1  (ocf::pacemaker:Dummy): Started test1
     prmDummy2  (ocf::pacemaker:Dummy): Started test1
  Resource Group: grpStonith1
     prmStonithN1       (stonith:external/ssh): Started test2
  Resource Group: grpStonith2
     prmStonithN2       (stonith:external/ssh): Started test1

 Migration summary:
 * Node test2:
 * Node test1:

 Step2) Stop Slave node.

 [root@test2 ~]# service heartbeat stop
 Stopping High-Availability services: Done.

 Step3) Stop Master node. However, a loop does the Master node and does not 
 stop.

 (snip)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: run_graph: Transition 3 
 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=23, 
 Source=/var/lib/pengine/pe-input-3.bz2): Terminated
 Mar 27 14:38:06 test1 crmd: [21443]: ERROR: te_graph_trigger: Transition 
 failed: terminated
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Graph 3 (30 actions 
 in 30 synapses): batch-limit=30 jobs, network-delay=6ms
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 0 is pending 
 (priority: 0)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:     [Action 12]: 
 Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 14]: 
 Completed (id: testMsGroup01:0_demote_0, type: pseduo, priority: 0)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 32]: 
 Pending (id: msGroup01_stop_0, type: pseduo, priority: 0)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 1 is pending 
 (priority: 0)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:     [Action 13]: 
 Pending (id: testMsGroup01:0_stopped_0, type: pseduo, priority: 0)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 8]: 
 Pending (id: prmStateful1:0_stop_0, loc: test1, priority: 0)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 9]: 
 Pending (id: prmStateful2:0_stop_0, loc: test1, priority: 0)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 12]: 
 Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0)
 Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 2 was 
 confirmed (priority: 0)
 (snip)

 I attach data of hb_report.

 Best Regards,
 Hideo Yamauchi.
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Question about Pacemaker mysql master/slave replication and DRBD replication

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 15, 2012 at 11:09 PM, coma coma@gmail.com wrote:
 Hello,

 I'm a new Pacemaker user and i trying to understand exactly what it can do /
 can't do in case of MySQL Replication or DRBD replication.

 I have two MySQl servers, for the moment with a simple Master/Slave
 replication, my need is to implement a high availability system with
 automated IP and DB fail-over / fail-back (including replication).

 I would like to be able to have 1 node designated as Master and in case of
 failure, automatically promote the slave node to master, and when the
 initial master will available again do the reverse operation automatically.

 I have compared several solutions and according to my needs (i have two
 servers only and i don't want / don't need use solutions like MySQL Cluster
 which requires 4 servers), Pacemaker seems the best solution in my case.

 I have to choose between Pacemaker with MySQL replication or Pacemaker with
 DRBD replication but it's difficult to find clear explanations and
 documentation about the 2 solutions, so i have some questions about it.
 If anyone can enlighten me i thank in advance!


 In the Pacemaker + MySQL replication case:

 I know pacemaker is able to do IP failover and it seems DB failover too(?),
 but what type of failure pacemaker can detect?
 I know it is able to monitor the network failure (node unreachable), but can
 it detect MySQL service failure and promote the slave to master?

Yes.  We call the mysql resource agent and react to any failures it detects.

 Example: Master node reachable, but database not (mysql service stopped,
 access failed-too many connexions, disk full, database read access but
 write error...)?

 Can pacemaker do (natively) the reverse operation automatically (When
 initial master node or MySQL DB will be available again)?

Yes. This is what the http://www.mysqlperformanceblog.com blog is talking about.
Pacemaker doesn't actually understand what it is managing, the details
are hidden by the RA scripts.

 In this case, can it manage the replication? and if not, can i use a
 personal shell scipt to do it?

 Else, i have browsing the maillinglist archives and i've seen the Percona
 Replication Manager solution
 (http://www.mysqlperformanceblog.com/2011/11/29/percona-replication-manager-a-solution-for-mysql-high-availability-with-replication-using-pacemaker/),
 somebody he already implemented it in production environment? Is it
 reliable?

I've not used it myself but I hear good things.



 In the Pacemaker +   DRBD replication case:

 I understand that pacemaker and drbd work very well together and drbd is a
 good solution for mysql replication.
 In case of master (active node) failure, pacemaker+DRBD promote
 automatically the slave (passive node) as master and i have read that drbd
 can handle itself the back replication when master node is available again?
 can you enlighten me a little more about it?

 I also read that it is recommended to have a dedicated network for drbd
 replication, but can i do without? I don't write a lot on my databases,
 reading much more, so replication will not be a big charge.

 The big problem i have with DRBD is that i work on RHEL5 and i have read
 that i will have to recompile DRBD after each kernel update (there is not a
 lot of updates but still some),
 is it possible to avoid it? (CentOS DRBD packages maybe?)

 Somebody has already been problems with RHEL updates and DRBD/Pacemaker?

RHEL5 is getting on a bit, its version of glib is too old to even
build pacemaker there anymore.
What about using RHEL6 which even ships with pacemaker?


 Thank you in advance for any response, and sorry if my english is not very
 good.



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [PATCH] pingd checks pidfile on start

2012-03-29 Thread Andrew Beekhof
Any chance you could redo this as a github pull request? :-D

On Wed, Mar 14, 2012 at 6:49 PM, Takatoshi MATSUO matsuo@gmail.com wrote:
 Hi

 I use pacemaker 1.0.11 and pingd RA.
 Occasionally, pingd's first monitor is failed after start.

 It seems that the main cause is pingd daemon returns 0 before creating pidfile
 and RA doesn't check pidfile on start.

 test script
 -
 while true; do
    killall pingd; sleep 3
    rm -f /tmp/pingd.pid; sleep 1
    /usr/lib64/heartbeat/pingd -D -p /tmp/pingd.pid -a ping_status -d
 0 -m 100 -h 192.168.0.1
   echo $?
   ls /tmp/pingd.pid; sleep .1
   ls /tmp/pingd.pid
 done
 -

 result
 -
 0
 /tmp/pingd.pid
 /tmp/pingd.pid
 0
 ls: cannot access /tmp/pingd.pid:  No such file or directory   - NG
 /tmp/pingd.pid
 0
 /tmp/pingd.pid
 /tmp/pingd.pid
 0
 /tmp/pingd.pid
 /tmp/pingd.pid
 0
 /tmp/pingd.pid
 /tmp/pingd.pid
 0
 ls: cannot access /tmp/pingd.pid: No such file or directory   - NG
 /tmp/pingd.pid
 --

 Please consider the attached patch for pacemaker-1.0.

 Regards,
 Takatoshi MATSUO

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] problem: stonith executes stonith command remote on the dead host and not local

2012-03-29 Thread Andrew Beekhof
On Tue, Mar 6, 2012 at 5:12 AM, Thomas Boernert t...@tbits.net wrote:
 Hi again,

 should i open a bug report about this issue ?

Which distro is this?  Any chance to try 1.1.7?


 Thanks

 Thomas

 Thomas Börnert schrieb am 02.03.2012 um 12:06 Uhr


 Hi List,

 my problem is that stonith will execute the command to fence on the remote
 dead host and not on the local machine :-(. this will end with an timeout.

 some facts:
 - 2 node cluster with 2
 dell servers
 - each server have an own drac card
 - pacemaker 1.1.6
 - heartbeat 3.0.4
 - corosync 1.4.1

 node1 should fence node2 if node2 is dead and
 node2 should fence node1 if node1 is dead

 it
 works fine manual with the stonith script
 fence_drac5 

 my config
 -- snip 
 node node1 \
 attributes standby=off
 node node2 \

 attributes standby=off
 primitive httpd ocf:heartbeat:apache \
 params configfile=/etc/httpd/conf/httpd.conf port=80 \
 op start interval=0 timeout=60s \
 op monitor
 interval=5s timeout=20s \
 op stop interval=0 timeout=60s
 primitive node1-stonith stonith:fence_drac5 \

 params ipaddr=192.168.1.101 login=root passwd=1234 action=reboot

 secure=true cmd_prompt=admin1- power_wait=300 pcmk_host_list=node1
 primitive node2-stonith stonith:fence_drac5 \

 params ipaddr=192.168.1.102 login=root passwd=1234 action=reboot

 secure=true cmd_prompt=admin1- power_wait=300 pcmk_host_list=node2
 primitive nodeIP ocf:heartbeat:IPaddr2 \
 op monitor interval=60 timeout=20 \
 params ip=192.168.1.10
 cidr_netmask=24 nic=eth0:0 broadcast=192.168.1.255
 primitive nodeIParp ocf:heartbeat:SendArp \
 params ip=192.168.1.10 nic=eth0:0
 group WebServices nodeIP nodeIParp httpd
 location
 node1-stonith-log node1-stonith -inf: node1
 location node2-stonith-log node2-stonith -inf: node2
 property $id=cib-bootstrap-options \

 dc-version=1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 stonith-enabled=true \
 no-quorum-policy=ignore \

 last-lrm-refresh=1330685786
 -- snip 

 [root@node2 ~]# stonith_admin -l node1
  node1-stonith
 1 devices found

 it seems ok

 now
 i try

 [root@node2 ~]# stonith_admin -V -F node1
 stonith_admin[5685]: 2012/03/02_13:00:44 debug: main: Create

 stonith_admin[5685]: 2012/03/02_13:00:44 debug: init_client_ipc_comms_nodispatch:
 Attempting to talk on: /var/run/crm/st_command

 stonith_admin[5685]: 2012/03/02_13:00:44 debug: get_stonith_token: Obtained registration token: 6258828b-4b19-472f-9256-8da36fe87962


 stonith_admin[5685]: 2012/03/02_13:00:44 debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/st_callback
 stonith_admin[5685]: 2012/03/02_13:00:44 debug: get_stonith_token:
 Obtained registration token: 6266ebb8-2112-4378-a00c-3eaff47c9a9d

 stonith_admin[5685]: 2012/03/02_13:00:44 debug: stonith_api_signon: Connection to STONITH successful
 stonith_admin[5685]:
 2012/03/02_13:00:44 debug: main: Connect: 0
 Command failed: Operation timed out

 stonith_admin[5685]: 2012/03/02_13:00:56 debug: stonith_api_signoff: Signing out of the STONITH Service

 stonith_admin[5685]: 2012/03/02_13:00:56 debug: main: Disconnect: -8
 stonith_admin[5685]: 2012/03/02_13:00:56 debug: main: Destroy

 the log on node2 shows:


 --- snip ---

 Mar  2 13:00:58 node2 crmd: [2665]: info: te_fence_node: Executing reboot fencing operation (21) on
 node1 (timeout=6)

 Mar  2 13:00:58 node2 stonith-ng: [2660]: info: initiate_remote_stonith_op: Initiating remote operation reboot for node1: 3325df94-8d59-4c00-a37e-be31e79f7503
 Mar  2 13:00:58

 node2 stonith-ng: [2638]: info: stonith_command: Processed st_query from node2: rc=0

 --- snip ---

 why remote on the
 dead host ?

 Thanks

 Thomas

 the complete log

 --- snip ---
 Mar  2 13:00:44 node2 stonith_admin: [5685]: info:

 crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root

 Mar  2 13:00:44 node2 stonith-ng: [2660]: info: initiate_remote_stonith_op: Initiating remote operation off for node1:
 7d8beca4-1853-44fd-9bb2-4015b080c37b

 Mar  2 13:00:44 node2 stonith-ng: [2638]: info: stonith_command: Processed st_query from node2: rc=0
 Mar  2 13:00:46 node2 stonith-ng: [2660]: ERROR:

 remote_op_query_timeout: Query 561e89af-6f5a-45cb-adc2-45389940f1db for node1 timed out

 Mar  2 13:00:46 node2 stonith-ng: [2660]: ERROR: remote_op_timeout: Action reboot
 (561e89af-6f5a-45cb-adc2-45389940f1db) for node1 timed out

 Mar  2 13:00:46 node2 stonith-ng: [2660]: info: remote_op_done: Notifing clients of 561e89af-6f5a-45cb-adc2-45389940f1db (reboot of 

Re: [Pacemaker] how to using rules to control resource option?

2012-03-29 Thread Andrew Beekhof
try reading clusters from scratch

2012/3/26 sinchb sin...@163.com:
 how to write the command when I want ro using some rules to control resource
 option(like resource-sticikiness).how to add a rule and a score to the
 instance attribute?

 I am looking forward to your reply!



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Migration of lower resource causes dependent resources to restart

2012-03-29 Thread Vladislav Bogdanov
Hi Andrew, all,

Pacemaker restarts resources when resource they depend on (ordering
only, no colocation) is migrated.

I mean that when I do crm resource migrate lustre, I get

LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left)
LogActions: Restart mgs#011(Started lustre01-left)

I only have one ordering constraint for these two resources:

order mgs-after-lustre inf: lustre:start mgs:start

This reminds me what have been with reload in a past (dependent resource
restart when lower resource is reloaded).

Shouldn't this be changed? Migration usually means that service is not
interrupted...

Best,
Vladislav

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Migration of lower resource causes dependent resources to restart

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 Pacemaker restarts resources when resource they depend on (ordering
 only, no colocation) is migrated.

 I mean that when I do crm resource migrate lustre, I get

 LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left)
 LogActions: Restart mgs#011(Started lustre01-left)

 I only have one ordering constraint for these two resources:

 order mgs-after-lustre inf: lustre:start mgs:start

 This reminds me what have been with reload in a past (dependent resource
 restart when lower resource is reloaded).

 Shouldn't this be changed? Migration usually means that service is not
 interrupted...

Is that strictly true?  Always?
My understanding was although A thinks the migration happens
instantaneously, it is in fact more likely to be pause+migrate+resume
and during that time anyone trying to talk to A during that time is
going to be disappointed.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Resource Agent ethmonitor

2012-03-29 Thread Fiorenza Meini

Il 21/03/2012 09:06, Florian Haas ha scritto:

On Tue, Mar 20, 2012 at 4:18 PM, Fiorenza Meinifme...@esseweb.eu  wrote:

Hi there,
has anybody configured successfully the RA specified in the object of the
message?

I got this error: if_eth0_monitor_0 (node=fw1, call=2297, rc=-2,
status=Timed Out): unknown exec error


Your ethmonitor RA missed its 50-second timeout on the probe (that is,
the initial monitor operation). You should be seeing Monitoring of
if_eth0 failed, X retries left warnings in your logs. Grepping your
syslog for ethmonitor will probably turn up some useful results.

Cheers,
Florian



Thank you, I solved the problem.

Regards

--

Fiorenza Meini
Spazio Web S.r.l.

V. Dante Alighieri, 10 - 13900 Biella
Tel.: 015.2431982 - 015.9526066
Fax: 015.2522600
Reg. Imprese, CF e P.I.: 02414430021
Iscr. REA: BI - 188936
Iscr. CCIAA: Biella - 188936
Cap. Soc.: 30.000,00 Euro i.v.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Migration of lower resource causes dependent resources to restart

2012-03-29 Thread Vladislav Bogdanov
29.03.2012 09:35, Andrew Beekhof wrote:
 On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 Pacemaker restarts resources when resource they depend on (ordering
 only, no colocation) is migrated.

 I mean that when I do crm resource migrate lustre, I get

 LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left)
 LogActions: Restart mgs#011(Started lustre01-left)

 I only have one ordering constraint for these two resources:

 order mgs-after-lustre inf: lustre:start mgs:start

 This reminds me what have been with reload in a past (dependent resource
 restart when lower resource is reloaded).

 Shouldn't this be changed? Migration usually means that service is not
 interrupted...
 
 Is that strictly true?  Always?

This probably depends on implementation.
With qemu live migration - yes.
With pacemaker:Dummy (with meta allow-migrate=true) probably yes too...

 My understanding was although A thinks the migration happens
 instantaneously, it is in fact more likely to be pause+migrate+resume
 and during that time anyone trying to talk to A during that time is
 going to be disappointed.


 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Migration of lower resource causes dependent resources to restart

2012-03-29 Thread Vladislav Bogdanov
29.03.2012 09:43, Vladislav Bogdanov wrote:
 29.03.2012 09:35, Andrew Beekhof wrote:
 On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 Pacemaker restarts resources when resource they depend on (ordering
 only, no colocation) is migrated.

 I mean that when I do crm resource migrate lustre, I get

 LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left)
 LogActions: Restart mgs#011(Started lustre01-left)

 I only have one ordering constraint for these two resources:

 order mgs-after-lustre inf: lustre:start mgs:start

 This reminds me what have been with reload in a past (dependent resource
 restart when lower resource is reloaded).

 Shouldn't this be changed? Migration usually means that service is not
 interrupted...

 Is that strictly true?  Always?
 
 This probably depends on implementation.
 With qemu live migration - yes.
 With pacemaker:Dummy (with meta allow-migrate=true) probably yes too...
 

And if RA is just a management for some external entity - then yes too
(although this case is probably not very common ;) ).

 My understanding was although A thinks the migration happens
 instantaneously, it is in fact more likely to be pause+migrate+resume
 and during that time anyone trying to talk to A during that time is
 going to be disappointed.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Migration of lower resource causes dependent resources to restart

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 29, 2012 at 5:43 PM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 29.03.2012 09:35, Andrew Beekhof wrote:
 On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 Pacemaker restarts resources when resource they depend on (ordering
 only, no colocation) is migrated.

 I mean that when I do crm resource migrate lustre, I get

 LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left)
 LogActions: Restart mgs#011(Started lustre01-left)

 I only have one ordering constraint for these two resources:

 order mgs-after-lustre inf: lustre:start mgs:start

 This reminds me what have been with reload in a past (dependent resource
 restart when lower resource is reloaded).

 Shouldn't this be changed? Migration usually means that service is not
 interrupted...

 Is that strictly true?  Always?

 This probably depends on implementation.
 With qemu live migration - yes.

So there will be no point at which, for example, pinging the VM's ip
address fails?

 With pacemaker:Dummy (with meta allow-migrate=true) probably yes too...

 My understanding was although A thinks the migration happens
 instantaneously, it is in fact more likely to be pause+migrate+resume
 and during that time anyone trying to talk to A during that time is
 going to be disappointed.



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Migration of lower resource causes dependent resources to restart

2012-03-29 Thread Vladislav Bogdanov
29.03.2012 10:07, Andrew Beekhof wrote:
 On Thu, Mar 29, 2012 at 5:43 PM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 29.03.2012 09:35, Andrew Beekhof wrote:
 On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 Pacemaker restarts resources when resource they depend on (ordering
 only, no colocation) is migrated.

 I mean that when I do crm resource migrate lustre, I get

 LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left)
 LogActions: Restart mgs#011(Started lustre01-left)

 I only have one ordering constraint for these two resources:

 order mgs-after-lustre inf: lustre:start mgs:start

 This reminds me what have been with reload in a past (dependent resource
 restart when lower resource is reloaded).

 Shouldn't this be changed? Migration usually means that service is not
 interrupted...

 Is that strictly true?  Always?

 This probably depends on implementation.
 With qemu live migration - yes.
 
 So there will be no point at which, for example, pinging the VM's ip
 address fails?

Even all existing connections are preserved.
Small delays during last migration phase are still possible, but they
are minor (during around 100-200 milliseconds while context is switching
and ip is announced from another node). And packets are not lost, just
delayed a bit.

I have corosync/pacemaker udpu clusters in VMs, and even corosync is
happy when VM it runs on is migrating to another node (with some token
tuning).

 
 With pacemaker:Dummy (with meta allow-migrate=true) probably yes too...

 My understanding was although A thinks the migration happens
 instantaneously, it is in fact more likely to be pause+migrate+resume
 and during that time anyone trying to talk to A during that time is
 going to be disappointed.



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] CIB not saved

2012-03-29 Thread Fiorenza Meini

Hi there,
a strange thing happened to my two node cluster: I rebooted both machine 
at the same time, when s.o. went up again, no resources were configured 
anymore: as it was a fresh installation. Why ?
It was explained to me that the configuration of resources managed by 
pacemaker should be in a file called cib.xml, but cannot find it in the 
system. Have I to specify any particular option in the configuration file?


Thanks and regards
--

Fiorenza Meini
Spazio Web S.r.l.

V. Dante Alighieri, 10 - 13900 Biella
Tel.: 015.2431982 - 015.9526066
Fax: 015.2522600
Reg. Imprese, CF e P.I.: 02414430021
Iscr. REA: BI - 188936
Iscr. CCIAA: Biella - 188936
Cap. Soc.: 30.000,00 Euro i.v.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Issue with ordering

2012-03-29 Thread Vladislav Bogdanov
Hi Andrew, all,

I'm continuing experiments with lustre on stacked drbd, and see
following problem:

I have one drbd resource (ms-drbd-testfs-mdt) is stacked on top of
other (ms-drbd-testfs-mdt-left), and have following constraints
between them:

colocation drbd-testfs-mdt-with-drbd-testfs-mdt-left inf:
ms-drbd-testfs-mdt ms-drbd-testfs-mdt-left:Master
order drbd-testfs-mdt-after-drbd-testfs-mdt-left inf:
ms-drbd-testfs-mdt-left:promote ms-drbd-testfs-mdt:start

Then I have filesystem mounted on top of ms-drbd-testfs-mdt
(testfs-mdt resource).

colocation testfs-mdt-with-drbd-testfs-mdt inf: testfs-mdt
ms-drbd-testfs-mdt:Master
order testfs-mdt-after-drbd-testfs-mdt inf:
ms-drbd-testfs-mdt:promote testfs-mdt:start

When I trigger event which causes many resources to stop (including
these three), LogActions output look like:

LogActions: Stopdrbd-local#011(lustre01-left)
LogActions: Stopdrbd-stacked#011(Started lustre02-left)
LogActions: Stopdrbd-testfs-local#011(Started lustre03-left)
LogActions: Stopdrbd-testfs-stacked#011(Started lustre04-left)
LogActions: Stoplustre#011(Started lustre04-left)
LogActions: Stopmgs#011(Started lustre01-left)
LogActions: Stoptestfs#011(Started lustre03-left)
LogActions: Stoptestfs-mdt#011(Started lustre01-left)
LogActions: Stoptestfs-ost#011(Started lustre01-left)
LogActions: Stoptestfs-ost0001#011(Started lustre02-left)
LogActions: Stoptestfs-ost0002#011(Started lustre03-left)
LogActions: Stoptestfs-ost0003#011(Started lustre04-left)
LogActions: Stopdrbd-mgs:0#011(Master lustre01-left)
LogActions: Stopdrbd-mgs:1#011(Slave lustre02-left)
LogActions: Stopdrbd-testfs-mdt:0#011(Master lustre01-left)
LogActions: Stopdrbd-testfs-mdt-left:0#011(Master lustre01-left)
LogActions: Stopdrbd-testfs-mdt-left:1#011(Slave lustre02-left)
LogActions: Stopdrbd-testfs-ost:0#011(Master lustre01-left)
LogActions: Stopdrbd-testfs-ost-left:0#011(Master lustre01-left)
LogActions: Stopdrbd-testfs-ost-left:1#011(Slave lustre02-left)
LogActions: Stopdrbd-testfs-ost0001:0#011(Master lustre02-left)
LogActions: Stopdrbd-testfs-ost0001-left:0#011(Master lustre02-left)
LogActions: Stopdrbd-testfs-ost0001-left:1#011(Slave lustre01-left)
LogActions: Stopdrbd-testfs-ost0002:0#011(Master lustre03-left)
LogActions: Stopdrbd-testfs-ost0002-left:0#011(Master lustre03-left)
LogActions: Stopdrbd-testfs-ost0002-left:1#011(Slave lustre04-left)
LogActions: Stopdrbd-testfs-ost0003:0#011(Master lustre04-left)
LogActions: Stopdrbd-testfs-ost0003-left:0#011(Master lustre04-left)
LogActions: Stopdrbd-testfs-ost0003-left:1#011(Slave lustre03-left)

For some reason demote is not run on both mdt drbd esources (should
it?), so drbd RA prints warning about that.

What I see then is that ms-drbd-testfs-mdt-left is tried to stop
before ms-drbd-testfs-mdt.

More, testfs-mdt filesystem resource is not stopped before stopping
drbd-testfs-mdt.

I have advisory ordering constraints between mdt and ost filesystem
resources, so all ost's are stopped before mdt. Thus mdt stop is delayed
a bit. May be this influences what happens.

I'm pretty sure I have correct constraints for at least these three
resources, so it looks like a bug, because mandatory ordering is not
preserved.

I can produce report for this.

Best,
Vladislav

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Rasto Levrinc
On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meini fme...@esseweb.eu wrote:
 Hi there,
 a strange thing happened to my two node cluster: I rebooted both machine at
 the same time, when s.o. went up again, no resources were configured
 anymore: as it was a fresh installation. Why ?
 It was explained to me that the configuration of resources managed by
 pacemaker should be in a file called cib.xml, but cannot find it in the
 system. Have I to specify any particular option in the configuration file?

Normally you shouldn't worry about it. cib.xml is stored in
/var/lib/heartbeat/crm/ or similar and the directory should have have
hacluster:haclient permissions. What distro is it and how did you install
it?

Rasto

-- 
Dipl.-Ing. Rastislav Levrinc
rasto.levr...@gmail.com
Linux Cluster Management Console
http://lcmc.sf.net/

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Florian Haas
On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 I'm continuing experiments with lustre on stacked drbd, and see
 following problem:

At the risk of going off topic, can you explain *why* you want to do
this? If you need a distributed, replicated filesystem with
asynchronous replication capability (the latter presumably for DR),
why not use a Distributed-Replicated GlusterFS volume with
geo-replication?

Note that I know next to nothing about your actual detailed
requirements, so GlusterFS may well be non-ideal for you and my
suggestion may thus be moot, but it would be nice if you could explain
why you're doing this.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Vladislav Bogdanov
Hi Florian,

29.03.2012 11:54, Florian Haas wrote:
 On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 I'm continuing experiments with lustre on stacked drbd, and see
 following problem:
 
 At the risk of going off topic, can you explain *why* you want to do
 this? If you need a distributed, replicated filesystem with
 asynchronous replication capability (the latter presumably for DR),
 why not use a Distributed-Replicated GlusterFS volume with
 geo-replication?

I need fast POSIX fs scalable to tens of petabytes with support for
fallocate() and friends to prevent fragmentation.

I generally agree with Linus about FUSE and userspace filesystems in
general, so that is not an option.

Using any API except what VFS provides via syscalls+glibc is not an
option too because I need access to files from various scripted
languages including shell and directly from a web server written in C.
Having bindings for them all is a real overkill. And it all is in
userspace again.

So I generally have choice of CEPH, Lustre, GPFS and PVFS.

CEPH is still very alpha, so I can't rely on it, although I keep my eye
on it.

GPFS is not an option because it is not free and produced by IBM (can't
say which of these two is more important ;) )

Can't remember why exactly PVFS is a no-go, their site is down right
now. Probably userspace server implementation (although some examples
like nfs server discredit idea of in-kernel servers, I still believe
this is a way to go).

Lustre is widely deployed, predictable and stable. It fully runs in
kernel space. Although Oracle did its best to bury Lustre development,
it is actively developed by whamcloud and company. They have builds for
EL6, so I'm pretty happy with this. Lustre doesn't have any replication
built-in so I need to add it on a lower layer (no rsync, no rsync, no
rsync ;) ). DRBD suits my needs for a simple HA.

But I also need datacenter-level HA, that's why I evaluate stacked DRBD
and tickets with booth.

So, frankly speaking, I decided to go with Lustre not because it is so
cool (it has many-many niceties), but because all others I know do not
suit my needs at all due to various reasons.

Hope this clarifies my point,

Best,
Vladislav

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Fiorenza Meini

Il 29/03/2012 10:12, Rasto Levrinc ha scritto:

On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meinifme...@esseweb.eu  wrote:

Hi there,
a strange thing happened to my two node cluster: I rebooted both machine at
the same time, when s.o. went up again, no resources were configured
anymore: as it was a fresh installation. Why ?
It was explained to me that the configuration of resources managed by
pacemaker should be in a file called cib.xml, but cannot find it in the
system. Have I to specify any particular option in the configuration file?


Normally you shouldn't worry about it. cib.xml is stored in
/var/lib/heartbeat/crm/ or similar and the directory should have have
hacluster:haclient permissions. What distro is it and how did you install
it?

Rasto



Thanks, it was a permission problems.

Regards
--

Fiorenza Meini
Spazio Web S.r.l.

V. Dante Alighieri, 10 - 13900 Biella
Tel.: 015.2431982 - 015.9526066
Fax: 015.2522600
Reg. Imprese, CF e P.I.: 02414430021
Iscr. REA: BI - 188936
Iscr. CCIAA: Biella - 188936
Cap. Soc.: 30.000,00 Euro i.v.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Pacemaker + Oracle

2012-03-29 Thread Ruwan Fernando
Hi,
I'm working with Pacemaker Active Passive Cluster and need to use oracle as
a resource to the pacemaker. my resource script is
crm configureprimitive Oracle ocf:heartbeat:oracle params sid=OracleDB op
monitor inetrval=120s
but it is not worked for me.

Can someone help out on this matter?

Regards,
Ruwan
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker + Oracle

2012-03-29 Thread emmanuel segura
cat /etc/oratab

And maybe you can post your log :-)

Il giorno 29 marzo 2012 13:53, Ruwan Fernando ruwanm...@gmail.com ha
scritto:

 Hi,
 I'm working with Pacemaker Active Passive Cluster and need to use oracle
 as a resource to the pacemaker. my resource script is
 crm configureprimitive Oracle ocf:heartbeat:oracle params sid=OracleDB op
 monitor inetrval=120s
 but it is not worked for me.

 Can someone help out on this matter?

 Regards,
 Ruwan

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Florian Haas
On Thu, Mar 29, 2012 at 11:40 AM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 Hi Florian,

 29.03.2012 11:54, Florian Haas wrote:
 On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 I'm continuing experiments with lustre on stacked drbd, and see
 following problem:

 At the risk of going off topic, can you explain *why* you want to do
 this? If you need a distributed, replicated filesystem with
 asynchronous replication capability (the latter presumably for DR),
 why not use a Distributed-Replicated GlusterFS volume with
 geo-replication?

 I need fast POSIX fs scalable to tens of petabytes with support for
 fallocate() and friends to prevent fragmentation.

 I generally agree with Linus about FUSE and userspace filesystems in
 general, so that is not an option.

I generally agree with Linus and just about everyone else that
filesystems shouldn't require invasive core kernel patches. But I
digress. :)

 Using any API except what VFS provides via syscalls+glibc is not an
 option too because I need access to files from various scripted
 languages including shell and directly from a web server written in C.
 Having bindings for them all is a real overkill. And it all is in
 userspace again.

 So I generally have choice of CEPH, Lustre, GPFS and PVFS.

 CEPH is still very alpha, so I can't rely on it, although I keep my eye
 on it.

 GPFS is not an option because it is not free and produced by IBM (can't
 say which of these two is more important ;) )

 Can't remember why exactly PVFS is a no-go, their site is down right
 now. Probably userspace server implementation (although some examples
 like nfs server discredit idea of in-kernel servers, I still believe
 this is a way to go).

Ceph is 100% userspace server side, jftr. :) And it has no async
replication capability at this point, which you seem to be after.

 Lustre is widely deployed, predictable and stable. It fully runs in
 kernel space. Although Oracle did its best to bury Lustre development,
 it is actively developed by whamcloud and company. They have builds for
 EL6, so I'm pretty happy with this. Lustre doesn't have any replication
 built-in so I need to add it on a lower layer (no rsync, no rsync, no
 rsync ;) ). DRBD suits my needs for a simple HA.

 But I also need datacenter-level HA, that's why I evaluate stacked DRBD
 and tickets with booth.

 So, frankly speaking, I decided to go with Lustre not because it is so
 cool (it has many-many niceties), but because all others I know do not
 suit my needs at all due to various reasons.

 Hope this clarifies my point,

It does. Doesn't necessarily mean I agree, but the point you're making is fine.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] VirtualDomain Shutdown Timeout

2012-03-29 Thread Andrew Martin
Hi Andrew, 


Thanks, that sounds good. I am using the Ubuntu HA ppa, so I will wait for a 
1.1.7 package to become available. 


Andrew 

- Original Message -

From: Andrew Beekhof and...@beekhof.net 
To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org 
Sent: Thursday, March 29, 2012 1:08:21 AM 
Subject: Re: [Pacemaker] VirtualDomain Shutdown Timeout 

On Sun, Mar 25, 2012 at 6:27 AM, Andrew Martin amar...@xes-inc.com wrote: 
 Hello, 
 
 I have configured a KVM virtual machine primitive using Pacemaker 1.1.6 and 
 Heartbeat 3.0.5 on Ubuntu 10.04 Server using DRBD as the storage device (so 
 there is no shared storage, no live-migration): 
 primitive p_vm ocf:heartbeat:VirtualDomain \ 
 params config=/vmstore/config/vm.xml \ 
 meta allow-migrate=false \ 
 op start interval=0 timeout=180s \ 
 op stop interval=0 timeout=120s \ 
 op monitor interval=10 timeout=30 
 
 I would expect the following events to happen on failover on the from node 
 (the migration source) if the VM hangs while shutting down: 
 1. VirtualDomain issues virsh shutdown vm to gracefully shutdown the VM 
 2. pacemaker waits 120 seconds for the timeout specified in the op stop 
 timeout 
 3. VirtualDomain waits a bit less than 120 seconds to see if it will 
 gracefully shutdown. Once it gets to almost 120 seconds, it issues virsh 
 destroy vm to hard stop the VM. 
 4. pacemaker wakes up from the 120 second timeout and sees that the VM has 
 stopped and proceeds with the failover 
 
 However, I observed that VirtualDomain seems to be using the timeout from 
 the op start line, 180 seconds, yet pacemaker uses the 120 second timeout. 
 Thus, the VM is still running after the pacemaker timeout is reached and so 
 the node is STONITHed. Here is the relevant section of code from 
 /usr/lib/ocf/resource.d/heartbeat/VirtualDomain: 
 VirtualDomain_Stop() { 
 local i 
 local status 
 local shutdown_timeout 
 local out ex 
 
 VirtualDomain_Status 
 status=$? 
 
 case $status in 
 $OCF_SUCCESS) 
 if ! ocf_is_true $OCF_RESKEY_force_stop; then 
 # Issue a graceful shutdown request 
 ocf_log info Issuing graceful shutdown request for domain 
 ${DOMAIN_NAME}. 
 virsh $VIRSH_OPTIONS shutdown ${DOMAIN_NAME} 
 # The shutdown_timeout we use here is the operation 
 # timeout specified in the CIB, minus 5 seconds 
 shutdown_timeout=$(( $NOW + 
 ($OCF_RESKEY_CRM_meta_timeout/1000) -5 )) 
 # Loop on status until we reach $shutdown_timeout 
 while [ $NOW -lt $shutdown_timeout ]; do 
 
 Doesn't $OCF_RESKEY_CRM_meta_timeout correspond to the timeout value in the 
 op stop ... line? 

It should, however there was a bug in 1.1.6 where this wasn't the case. 
The relevant patch is: 
https://github.com/beekhof/pacemaker/commit/fcfe6fe 

Or you could try 1.1.7 

 
 How can I optimize my pacemaker configuration so that the VM will attempt to 
 gracefully shutdown and then at worst case destroy the VM before the 
 pacemaker timeout is reached? Moreover, is there anything I can do inside of 
 the VM (another Ubuntu 10.04 install) to optimize/speed up the shutdown 
 process? 
 
 Thanks, 
 
 Andrew 
 
 
 ___ 
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
 
 Project Home: http://www.clusterlabs.org 
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
 Bugs: http://bugs.clusterlabs.org 
 

___ 
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] OCF_RESKEY_CRM_meta_{ordered,notify,interleave}

2012-03-29 Thread Andrew Beekhof
On Fri, Mar 30, 2012 at 1:47 AM, Florian Haas flor...@hastexo.com wrote:
 Lars (lmb), or Andrew -- maybe one of you remembers what this was all about.

 In this commit, Lars enabled the
 OCF_RESKEY_CRM_meta_{ordered,notify,interleave} attributes to be
 injected into the environment of RAs:
 https://github.com/ClusterLabs/pacemaker/commit/b0ba01f61086f073be69db3e6beb0914642f79d9

 Then that change was almost immediately backed out:
 https://github.com/ClusterLabs/pacemaker/commit/b33d3bf5376ab59baa435086c803b9fdaf6de504

Because it was felt that RAs shouldn't need to know.
Those options change pacemaker's behaviour, not the RAs.

But subsequently, in lf#2391, you convinced us to add notify since it
allowed the drbd agent to error out if they were not turned on.


 And since then, at some point evidently only interleave and notify
 made it back in. Any specific reason for omitting ordered? I happen to
 have a pretty good use case for an ordered-clone RA, and it would be
 handy to be able to test whether clone ordering has been enabled.

I'd need more information.  The RA shouldn't need to care I would have
thought. The ordering happens in the PE/crmd, the RA should just do
what its told.

 All insights are much appreciated.

 Cheers,
 Florian

 --
 Need help with High Availability?
 http://www.hastexo.com/now

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 29, 2012 at 8:45 PM, Fiorenza Meini fme...@esseweb.eu wrote:
 Il 29/03/2012 10:12, Rasto Levrinc ha scritto:

 On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meinifme...@esseweb.eu  wrote:

 Hi there,
 a strange thing happened to my two node cluster: I rebooted both machine
 at
 the same time, when s.o. went up again, no resources were configured
 anymore: as it was a fresh installation. Why ?
 It was explained to me that the configuration of resources managed by
 pacemaker should be in a file called cib.xml, but cannot find it in the
 system. Have I to specify any particular option in the configuration
 file?


 Normally you shouldn't worry about it. cib.xml is stored in
 /var/lib/heartbeat/crm/ or similar and the directory should have have
 hacluster:haclient permissions. What distro is it and how did you install
 it?

 Rasto


 Thanks, it was a permission problems.

Normally we log an error at startup if we can't write there... did
this not happen?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 29, 2012 at 7:07 PM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 Hi Andrew, all,

 I'm continuing experiments with lustre on stacked drbd, and see
 following problem:

 I have one drbd resource (ms-drbd-testfs-mdt) is stacked on top of
 other (ms-drbd-testfs-mdt-left), and have following constraints
 between them:

 colocation drbd-testfs-mdt-with-drbd-testfs-mdt-left inf:
 ms-drbd-testfs-mdt ms-drbd-testfs-mdt-left:Master
 order drbd-testfs-mdt-after-drbd-testfs-mdt-left inf:
 ms-drbd-testfs-mdt-left:promote ms-drbd-testfs-mdt:start

 Then I have filesystem mounted on top of ms-drbd-testfs-mdt
 (testfs-mdt resource).

 colocation testfs-mdt-with-drbd-testfs-mdt inf: testfs-mdt
 ms-drbd-testfs-mdt:Master
 order testfs-mdt-after-drbd-testfs-mdt inf:
 ms-drbd-testfs-mdt:promote testfs-mdt:start

 When I trigger event which causes many resources to stop (including
 these three), LogActions output look like:

 LogActions: Stop    drbd-local#011(lustre01-left)
 LogActions: Stop    drbd-stacked#011(Started lustre02-left)
 LogActions: Stop    drbd-testfs-local#011(Started lustre03-left)
 LogActions: Stop    drbd-testfs-stacked#011(Started lustre04-left)
 LogActions: Stop    lustre#011(Started lustre04-left)
 LogActions: Stop    mgs#011(Started lustre01-left)
 LogActions: Stop    testfs#011(Started lustre03-left)
 LogActions: Stop    testfs-mdt#011(Started lustre01-left)
 LogActions: Stop    testfs-ost#011(Started lustre01-left)
 LogActions: Stop    testfs-ost0001#011(Started lustre02-left)
 LogActions: Stop    testfs-ost0002#011(Started lustre03-left)
 LogActions: Stop    testfs-ost0003#011(Started lustre04-left)
 LogActions: Stop    drbd-mgs:0#011(Master lustre01-left)
 LogActions: Stop    drbd-mgs:1#011(Slave lustre02-left)
 LogActions: Stop    drbd-testfs-mdt:0#011(Master lustre01-left)
 LogActions: Stop    drbd-testfs-mdt-left:0#011(Master lustre01-left)
 LogActions: Stop    drbd-testfs-mdt-left:1#011(Slave lustre02-left)
 LogActions: Stop    drbd-testfs-ost:0#011(Master lustre01-left)
 LogActions: Stop    drbd-testfs-ost-left:0#011(Master lustre01-left)
 LogActions: Stop    drbd-testfs-ost-left:1#011(Slave lustre02-left)
 LogActions: Stop    drbd-testfs-ost0001:0#011(Master lustre02-left)
 LogActions: Stop    drbd-testfs-ost0001-left:0#011(Master lustre02-left)
 LogActions: Stop    drbd-testfs-ost0001-left:1#011(Slave lustre01-left)
 LogActions: Stop    drbd-testfs-ost0002:0#011(Master lustre03-left)
 LogActions: Stop    drbd-testfs-ost0002-left:0#011(Master lustre03-left)
 LogActions: Stop    drbd-testfs-ost0002-left:1#011(Slave lustre04-left)
 LogActions: Stop    drbd-testfs-ost0003:0#011(Master lustre04-left)
 LogActions: Stop    drbd-testfs-ost0003-left:0#011(Master lustre04-left)
 LogActions: Stop    drbd-testfs-ost0003-left:1#011(Slave lustre03-left)

 For some reason demote is not run on both mdt drbd esources (should
 it?), so drbd RA prints warning about that.

So its not just a logging error, the demote really isn't scheduled?
That would be bad, can you file a bug please?


 What I see then is that ms-drbd-testfs-mdt-left is tried to stop
 before ms-drbd-testfs-mdt.

 More, testfs-mdt filesystem resource is not stopped before stopping
 drbd-testfs-mdt.

 I have advisory ordering constraints between mdt and ost filesystem
 resources, so all ost's are stopped before mdt. Thus mdt stop is delayed
 a bit. May be this influences what happens.

 I'm pretty sure I have correct constraints for at least these three
 resources, so it looks like a bug, because mandatory ordering is not
 preserved.

 I can produce report for this.

 Best,
 Vladislav

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Patch]Patch for crmd-transition-delay processing.

2012-03-29 Thread renayama19661014
Hi Andrew,

Thank you for comment.

 The patch makes sense, could you resend as a github pull request? :-D

All right!! I send it if ready.
Please wait

Best Regards,
Hideo Yamauchi.

--- On Thu, 2012/3/29, Andrew Beekhof and...@beekhof.net wrote:

 The patch makes sense, could you resend as a github pull request? :-D
 
 On Thu, Mar 22, 2012 at 8:18 PM,  renayama19661...@ybb.ne.jp wrote:
  Hi All,
 
  Sorry
 
  My patch was wrong.
  I send a right patch.
 
  Best Regards,
  Hideo Yamauchi.
 
  --- On Thu, 2012/3/22, renayama19661...@ybb.ne.jp 
  renayama19661...@ybb.ne.jp wrote:
 
  Hi All,
 
  The crmd-transition-delay waits for the update of the attribute to be late.
 
  However, crmd cannot realize the wait of the attribute well because a 
  timer is not reset when the delay of the attribute occurs after a timer 
  was set.
 
  As a result, the resource may not be placed definitely.
 
  I wrote a patch for Pacemaker 1.0.12.
 
  And this patch blocks the handling of tengine when a crmd-transition-delay 
  timer is set.
  And tengine handles instructions of pengine after a crmd-transition-delay 
  timer exercised it definitely.
 
 
  By this patch, the start of the resource may be late.
  However, it realizes the placement of a right resource depending on 
  limitation.
 
   * I think that the similar correction is necessary for a development 
  version of Pacemaker.
 
  Best Regards,
  Hideo Yamauchi.
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem] The cluster fails in the stop of the node.

2012-03-29 Thread renayama19661014
Hi Andrew,

 This appears to be resolved with 1.1.7, perhaps look for a patch to backport?

I confirm movement of Pacemaker 1.1.7.
And I talk about the backporting with Mr Mori.

Best Regards,
Hideo Yamauchi.

--- On Thu, 2012/3/29, Andrew Beekhof and...@beekhof.net wrote:

 This appears to be resolved with 1.1.7, perhaps look for a patch to backport?
 
 On Tue, Mar 27, 2012 at 4:46 PM,  renayama19661...@ybb.ne.jp wrote:
  Hi All,
 
  When we set a group resource within Master/Slave resource, we found the 
  problem that a node could not stop.
 
  This problem occurs in Pacemaker1.0.11.
 
  We confirmed a problem in the following procedure.
 
  Step1) Start all nodes.
 
  
  Last updated: Tue Mar 27 14:35:16 2012
  Stack: Heartbeat
  Current DC: test2 (b645c456-af78-429e-a40a-279ed063b97d) - partition 
  WITHOUT quorum
  Version: 1.0.12-unknown
  2 Nodes configured, unknown expected votes
  4 Resources configured.
  
 
  Online: [ test1 test2 ]
 
   Master/Slave Set: msGroup01
      Masters: [ test1 ]
      Slaves: [ test2 ]
   Resource Group: testGroup
      prmDummy1  (ocf::pacemaker:Dummy): Started test1
      prmDummy2  (ocf::pacemaker:Dummy): Started test1
   Resource Group: grpStonith1
      prmStonithN1       (stonith:external/ssh): Started test2
   Resource Group: grpStonith2
      prmStonithN2       (stonith:external/ssh): Started test1
 
  Migration summary:
  * Node test2:
  * Node test1:
 
  Step2) Stop Slave node.
 
  [root@test2 ~]# service heartbeat stop
  Stopping High-Availability services: Done.
 
  Step3) Stop Master node. However, a loop does the Master node and does not 
  stop.
 
  (snip)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: run_graph: Transition 3 
  (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=23, 
  Source=/var/lib/pengine/pe-input-3.bz2): Terminated
  Mar 27 14:38:06 test1 crmd: [21443]: ERROR: te_graph_trigger: Transition 
  failed: terminated
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Graph 3 (30 actions 
  in 30 synapses): batch-limit=30 jobs, network-delay=6ms
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 0 is 
  pending (priority: 0)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:     [Action 12]: 
  Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 14]: 
  Completed (id: testMsGroup01:0_demote_0, type: pseduo, priority: 0)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 32]: 
  Pending (id: msGroup01_stop_0, type: pseduo, priority: 0)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 1 is 
  pending (priority: 0)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:     [Action 13]: 
  Pending (id: testMsGroup01:0_stopped_0, type: pseduo, priority: 0)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 8]: 
  Pending (id: prmStateful1:0_stop_0, loc: test1, priority: 0)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 9]: 
  Pending (id: prmStateful2:0_stop_0, loc: test1, priority: 0)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 12]: 
  Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0)
  Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 2 was 
  confirmed (priority: 0)
  (snip)
 
  I attach data of hb_report.
 
  Best Regards,
  Hideo Yamauchi.
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Nodes not rejoining cluster

2012-03-29 Thread Gregg Stock
I had a circuit breaker go out and take two of the 5 nodes in my cluster 
down. Now that their back up and running, they are not rejoining the 
cluster.


Here is what I get from crm_mon -1

node 1,2 and 3 itchy, scratchy and walter show the following:

Last updated: Thu Mar 29 19:04:05 2012
Last change: Thu Mar 29 19:04:03 2012 via cibadmin on walter
Stack: openais
Current DC: walter - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
5 Nodes configured, 5 expected votes
9 Resources configured.


Online: [ itchy scratchy walter butthead timmy ]


On butthead I get


Last updated: Thu Mar 29 19:04:24 2012
Last change: Thu Mar 29 18:42:09 2012 via cibadmin on itchy
Stack: openais
Current DC: NONE
5 Nodes configured, 5 expected votes
9 Resources configured.


OFFLINE: [ itchy scratchy walter butthead timmy ]


On Timmy, I get


Last updated: Thu Mar 29 19:04:20 2012
Last change:
Current DC: NONE
0 Nodes configured, unknown expected votes
0 Resources configured.



I don't have anything important running yet. so I can do a full clean up 
of everything if needed.


I also get some weird behavior with timmy. I brought this node up with 
the host name as timmy.example.com and I changed the host name to timmy 
but when the cluster is offline timmy.example.com shows up as offline. I 
enter crm node delete timmy.example.com and it goes away until timmy 
goes offline again.


Thanks,
Gregg Stock


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Nodes not rejoining cluster

2012-03-29 Thread Andrew Beekhof
Gotta have logs.  From all 3 nodes mentioned.
Only then can we determine if the problem is at the corosync or
pacemaker layer - which is the pre-requisit for figuring out what to
do next :)

On Fri, Mar 30, 2012 at 1:30 PM, Gregg Stock gr...@damagecontrolusa.com wrote:
 I had a circuit breaker go out and take two of the 5 nodes in my cluster
 down. Now that their back up and running, they are not rejoining the
 cluster.

 Here is what I get from crm_mon -1

 node 1,2 and 3 itchy, scratchy and walter show the following:
 
 Last updated: Thu Mar 29 19:04:05 2012
 Last change: Thu Mar 29 19:04:03 2012 via cibadmin on walter
 Stack: openais
 Current DC: walter - partition with quorum
 Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
 5 Nodes configured, 5 expected votes
 9 Resources configured.
 

 Online: [ itchy scratchy walter butthead timmy ]


 On butthead I get

 
 Last updated: Thu Mar 29 19:04:24 2012
 Last change: Thu Mar 29 18:42:09 2012 via cibadmin on itchy
 Stack: openais
 Current DC: NONE
 5 Nodes configured, 5 expected votes
 9 Resources configured.
 

 OFFLINE: [ itchy scratchy walter butthead timmy ]


 On Timmy, I get

 
 Last updated: Thu Mar 29 19:04:20 2012
 Last change:
 Current DC: NONE
 0 Nodes configured, unknown expected votes
 0 Resources configured.
 


 I don't have anything important running yet. so I can do a full clean up of
 everything if needed.

 I also get some weird behavior with timmy. I brought this node up with the
 host name as timmy.example.com and I changed the host name to timmy but when
 the cluster is offline timmy.example.com shows up as offline. I enter crm
 node delete timmy.example.com and it goes away until timmy goes offline
 again.

 Thanks,
 Gregg Stock



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org