[Pacemaker] Cluster Refuses to Stop/Shutdown

2009-09-24 Thread Remi Broemeling




I posted this to the OpenAIS Mailing List
(open...@lists.linux-foundation.org) yesterday, but haven't received a
response and upon further reflection I think that maybe I chose the
wrong list to post it to.  That list seems to be far less about user
support and far more about developer communication.  Therefore
re-trying here, as the archives show it to be somewhat more
user-focused.

The problem is that I'm having an issue with corosync refusing to
shutdown in response to a
QUIT signal.  Given the below cluster (output of crm_mon):


Last updated: Wed Sep 23 15:56:24 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ boot1 boot2 ]

If I go onto the host 'boot2', and issue the command "killall -QUIT
corosync", the anticipated result would be that boot2 would go offline
(out of the cluster), and all of the cluster processes
(corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down. 
However, this is not occurring, and I don't really have any idea why. 
After logging into boot2, and issuing the command "killall -QUIT
corosync", the result is a split-brain:

>From boot1's viewpoint:

Last updated: Wed Sep 23 15:58:27 2009
Stack: openais
Current DC: boot1 - partition WITHOUT quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ boot1 ]
OFFLINE: [ boot2 ]

>From boot2's viewpoint:

Last updated: Wed Sep 23 15:58:35 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ boot1 boot2 ]

At this point the status quo holds until such time as ANOTHER QUIT
signal is sent to corosync, (i.e. the command "killall -QUIT corosync"
is executed on boot2 again).  Then, boot2 shuts down properly and
everything appears to be kosher.  Basically, what I expect to happen
after a single QUIT signal is instead taking two QUIT signals to occur;
and that summarizes my question: why does it take two QUIT signals to
force corosync to actually shutdown?  Is that desired behavior?  From
everything online that I have read it seems to be very strange, and it
makes me think that I have a problem in my configuration(s), but I've
no idea what that would be even after playing with things and
investigating for the day.

I would be very grateful for any guidance that could be provided, as at
the moment I seem to be at an impasse.

Log files, with debugging set to 'on', can be found at the following
pastebin locations:
    After first QUIT signal issued on boot2:
        boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
        boot2:/var/log/syslog: http://pastebin.com/d26fdfee
    After second QUIT signal issued on boot2:
        boot1:/var/log/syslog: http://pastebin.com/m755fb989
        boot2:/var/log/syslog: http://pastebin.com/m22dcef45

OS, Software Packages, and Versions:
    * two nodes, each running Ubuntu Hardy Heron LTS
    * ubuntu-ha packages, as downloaded from
http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
        * pacemaker-openais package version
1.0.5+hg20090813-0ubuntu2~hardy1
        * openais package version 1.0.0-3ubuntu1~hardy1
        * corosync package version 1.0.0-4ubuntu1~hardy2
        * heartbeat-common package version
heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1

Network Setup:
    * boot1
        * eth0 is 192.168.10.192
        * eth1 is 172.16.1.1
    * boot2
        * eth0 is 192.168.10.193
        * eth1 is 172.16.1.2
    * boot1:eth0 and boot2:eth0 both connect to the same switch.
    * boot1:eth1 and boot2:eth1 are connected directly to each other
via a cross-over cable.
    * no firewalls are involved, and tcpdump shows the multicast and
UDP traffic flowing correctly over these links.
    * I attempted a broadcast (rather than multicast) configuration, to
see if that would fix the problem.  It did not.

`crm configure show` output:
    node boot1
    node boot2
    property $id="cib-bootstrap-options" \
            dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56"
\
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"

Contents of /etc/corosync/corosync.conf:
    # Please read the corosync.conf.5 manual page
    compatibility: whitetank

    totem {
    clear_node_high_bit: yes
    version: 2
    secauth: on
    threads: 1
    heartbeat_failures_allowed: 3
    interface {
    ringnumber: 0
    bindnetaddr: 172.16.1.0
    mcastaddr: 239.42.0.1
    mcastport: 5505
    }
    interface {
    ringnumber: 1
    bindnetaddr: 192.168.10.0

[Pacemaker] Low cost stonith device

2009-09-24 Thread Mario Giammarco
Hello,

Can you suggest me a list of stonith devices compatible with 
pacemaker?

I need a low cost one.

I have also another idea to build a low cost stonith device:

I have intelligent switches. To stonith a node I can send to
 a switch the command 
to turn off all ethernet ports linked to the node to be fenced. 

So the node is powered on but it cannot do any harm because 
it is disconnected 
from network.

Is it a good idea? How can I implement it?

Thanks in advance for any help.

Mario



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Low cost stonith device

2009-09-24 Thread Remi Broemeling




Hi Mario -- I was just looking into this myself, today.  I think the
lowest cost that you'll be able to find is an IPMI card for your
motherboard (as long as the motherboard in question supports it).  To
find out you'll need to look into your specific model of motherboard
and see what is needed (or even if it is possible).  IPMI cards aren't
foolproof and have their own problems; but it's better than nothing, I
would think.

I think that your idea for fencing at the switch would be workable; but
how exactly to accomplish that would depend on the switch that you were
talking about; as the commands (and even if external ability to do so
is available) would be switch/manufacturer-dependent.

Of course even lower cost (but even more problematic) is simply using
external/ssh to shutdown the server to be killed, as that should be
free.

See here for more documentation on STONITH:
http://www.clusterlabs.org/mediawiki/images/f/f2/Crm_fencing.pdf

Mario Giammarco wrote:

  Hello,

Can you suggest me a list of stonith devices compatible with 
pacemaker?

I need a low cost one.

I have also another idea to build a low cost stonith device:

I have intelligent switches. To stonith a node I can send to
 a switch the command 
to turn off all ethernet ports linked to the node to be fenced. 

So the node is powered on but it cannot do any harm because 
it is disconnected 
from network.

Is it a good idea? How can I implement it?

Thanks in advance for any help.

Mario
  

-- 


Remi Broemeling
Sr System Administrator

Nexopia.com Inc.
 direct: 780 444 1250 ext 435
email: r...@nexopia.com
fax: 780 487 0376 




I like you. You remind me of when I was young and
stupid.
Things you would love to say at work but can't




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Cluster Refuses to Stop/Shutdown

2009-09-24 Thread Remi Broemeling




I've spent all day working on this; even going so far as to completely
build my own set of packages from the Debian-available ones (which
appear to be different than the Ubuntu-available ones).  It didn't have
any effect on the issue at all: the cluster still freaks out and
becomes a split-brain after a single SIGQUIT.

The debian packages that also demonstrate this behavior were the below
versions:
    cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
    corosync_1.0.0-5~bpo50+1_i386.deb
    libcorosync4_1.0.0-5~bpo50+1_i386.deb
    libopenais3_1.0.0-4~bpo50+1_i386.deb
    openais_1.0.0-4~bpo50+1_i386.deb
    pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb

These packages were re-built (under Ubuntu Hardy Heron LTS) from the
*.diff.gz, *.dsc, and *.orig.tar.gz files available at http://people.debian.org/~madkiss/ha-corosync,
and as I said the symptoms remain exactly the same, both under the
configuration that I list below and the sample configuration that came
with these packages.  I also attempted the same with a single IP
Address resource associated with the cluster; just to be sure it wasn't
an edge case for a cluster with no resources; but again that had no
effect.

Basically I'm still exactly at the point that I was at yesterday
morning at about 0900.

Remi Broemeling wrote:
I
posted this to the OpenAIS Mailing List
(open...@lists.linux-foundation.org)
yesterday, but haven't received a
response and upon further reflection I think that maybe I chose the
wrong list to post it to.  That list seems to be far less about user
support and far more about developer communication.  Therefore
re-trying here, as the archives show it to be somewhat more
user-focused.
  
The problem is that I'm having an issue with corosync refusing to
shutdown in response to a
QUIT signal.  Given the below cluster (output of crm_mon):
  

Last updated: Wed Sep 23 15:56:24 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.

  
Online: [ boot1 boot2 ]
  
If I go onto the host 'boot2', and issue the command "killall -QUIT
corosync", the anticipated result would be that boot2 would go offline
(out of the cluster), and all of the cluster processes
(corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down. 
However, this is not occurring, and I don't really have any idea why. 
After logging into boot2, and issuing the command "killall -QUIT
corosync", the result is a split-brain:
  
>From boot1's viewpoint:
  
Last updated: Wed Sep 23 15:58:27 2009
Stack: openais
Current DC: boot1 - partition WITHOUT quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.

  
Online: [ boot1 ]
OFFLINE: [ boot2 ]
  
>From boot2's viewpoint:
  
Last updated: Wed Sep 23 15:58:35 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.

  
Online: [ boot1 boot2 ]
  
At this point the status quo holds until such time as ANOTHER QUIT
signal is sent to corosync, (i.e. the command "killall -QUIT corosync"
is executed on boot2 again).  Then, boot2 shuts down properly and
everything appears to be kosher.  Basically, what I expect to happen
after a single QUIT signal is instead taking two QUIT signals to occur;
and that summarizes my question: why does it take two QUIT signals to
force corosync to actually shutdown?  Is that desired behavior?  From
everything online that I have read it seems to be very strange, and it
makes me think that I have a problem in my configuration(s), but I've
no idea what that would be even after playing with things and
investigating for the day.
  
I would be very grateful for any guidance that could be provided, as at
the moment I seem to be at an impasse.
  
Log files, with debugging set to 'on', can be found at the following
pastebin locations:
    After first QUIT signal issued on boot2:
        boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
        boot2:/var/log/syslog: http://pastebin.com/d26fdfee
    After second QUIT signal issued on boot2:
        boot1:/var/log/syslog: http://pastebin.com/m755fb989
        boot2:/var/log/syslog: http://pastebin.com/m22dcef45
  
OS, Software Packages, and Versions:
    * two nodes, each running Ubuntu Hardy Heron LTS
    * ubuntu-ha packages, as downloaded from
  http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
        * pacemaker-openais package version
1.0.5+hg20090813-0ubuntu2~hardy1
        * openais package version 1.0.0-3ubuntu1~hardy1
        * corosync package version 1.0.0-4ubuntu1~hardy2
        * heartbeat-common package version
heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1
  
Network Setup:
    * boot1
        * eth0 is 192.168.10.192
        * eth1 is 172.16.1.1

Re: [Pacemaker] Cluster Refuses to Stop/Shutdown

2009-09-24 Thread Steven Dake
Remi,

Likely a defect.  We will have to look into it.  Please file a bug as
per instructions on the corosync wiki at www.corosync.org.

On Thu, 2009-09-24 at 16:47 -0600, Remi Broemeling wrote:
> I've spent all day working on this; even going so far as to completely
> build my own set of packages from the Debian-available ones (which
> appear to be different than the Ubuntu-available ones).  It didn't
> have any effect on the issue at all: the cluster still freaks out and
> becomes a split-brain after a single SIGQUIT.
> 
> The debian packages that also demonstrate this behavior were the below
> versions:
> cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
> corosync_1.0.0-5~bpo50+1_i386.deb
> libcorosync4_1.0.0-5~bpo50+1_i386.deb
> libopenais3_1.0.0-4~bpo50+1_i386.deb
> openais_1.0.0-4~bpo50+1_i386.deb
> pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb
> 
> These packages were re-built (under Ubuntu Hardy Heron LTS) from the
> *.diff.gz, *.dsc, and *.orig.tar.gz files available at
> http://people.debian.org/~madkiss/ha-corosync, and as I said the
> symptoms remain exactly the same, both under the configuration that I
> list below and the sample configuration that came with these packages.
> I also attempted the same with a single IP Address resource associated
> with the cluster; just to be sure it wasn't an edge case for a cluster
> with no resources; but again that had no effect.
> 
> Basically I'm still exactly at the point that I was at yesterday
> morning at about 0900.
> 
> Remi Broemeling wrote: 
> > I posted this to the OpenAIS Mailing List
> > (open...@lists.linux-foundation.org) yesterday, but haven't received
> > a response and upon further reflection I think that maybe I chose
> > the wrong list to post it to.  That list seems to be far less about
> > user support and far more about developer communication.  Therefore
> > re-trying here, as the archives show it to be somewhat more
> > user-focused.
> > 
> > The problem is that I'm having an issue with corosync refusing to
> > shutdown in response to a QUIT signal.  Given the below cluster
> > (output of crm_mon):
> > 
> > 
> > Last updated: Wed Sep 23 15:56:24 2009
> > Stack: openais
> > Current DC: boot1 - partition with quorum
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, 2 expected votes
> > 0 Resources configured.
> > 
> > 
> > Online: [ boot1 boot2 ]
> > 
> > If I go onto the host 'boot2', and issue the command "killall -QUIT
> > corosync", the anticipated result would be that boot2 would go
> > offline (out of the cluster), and all of the cluster processes
> > (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.
> > However, this is not occurring, and I don't really have any idea
> > why.  After logging into boot2, and issuing the command "killall
> > -QUIT corosync", the result is a split-brain:
> > 
> > From boot1's viewpoint:
> > 
> > Last updated: Wed Sep 23 15:58:27 2009
> > Stack: openais
> > Current DC: boot1 - partition WITHOUT quorum
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, 2 expected votes
> > 0 Resources configured.
> > 
> > 
> > Online: [ boot1 ]
> > OFFLINE: [ boot2 ]
> > 
> > From boot2's viewpoint:
> > 
> > Last updated: Wed Sep 23 15:58:35 2009
> > Stack: openais
> > Current DC: boot1 - partition with quorum
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, 2 expected votes
> > 0 Resources configured.
> > 
> > 
> > Online: [ boot1 boot2 ]
> > 
> > At this point the status quo holds until such time as ANOTHER QUIT
> > signal is sent to corosync, (i.e. the command "killall -QUIT
> > corosync" is executed on boot2 again).  Then, boot2 shuts down
> > properly and everything appears to be kosher.  Basically, what I
> > expect to happen after a single QUIT signal is instead taking two
> > QUIT signals to occur; and that summarizes my question: why does it
> > take two QUIT signals to force corosync to actually shutdown?  Is
> > that desired behavior?  From everything online that I have read it
> > seems to be very strange, and it makes me think that I have a
> > problem in my configuration(s), but I've no idea what that would be
> > even after playing with things and investigating for the day.
> > 
> > I would be very grateful for any guidance that could be provided, as
> > at the moment I seem to be at an impasse.
> > 
> > Log files, with debugging set to 'on', can be found at the following
> > pastebin locations:
> > After first QUIT signal issued on boot2:
> > boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
> > boot2:/var/log/syslog: http://pastebin.com/d26fdfee
> > After second QUIT signal issued on boot2:
> > boot1:/var/log/syslog: http://pastebin.com/m755fb989
> > boot2:/var/log/syslog: http://pastebin.com/m22dcef45
> > 
> > OS, Software Packages, and Vers

Re: [Pacemaker] Cluster Refuses to Stop/Shutdown

2009-09-24 Thread Remi Broemeling




Ok, thanks for the note Steven.  I've filed the bug, it is #525589.

Steven Dake wrote:

  Remi,

Likely a defect.  We will have to look into it.  Please file a bug as
per instructions on the corosync wiki at www.corosync.org.

On Thu, 2009-09-24 at 16:47 -0600, Remi Broemeling wrote:
  
  
I've spent all day working on this; even going so far as to completely
build my own set of packages from the Debian-available ones (which
appear to be different than the Ubuntu-available ones).  It didn't
have any effect on the issue at all: the cluster still freaks out and
becomes a split-brain after a single SIGQUIT.

The debian packages that also demonstrate this behavior were the below
versions:
cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
corosync_1.0.0-5~bpo50+1_i386.deb
libcorosync4_1.0.0-5~bpo50+1_i386.deb
libopenais3_1.0.0-4~bpo50+1_i386.deb
openais_1.0.0-4~bpo50+1_i386.deb
pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb

These packages were re-built (under Ubuntu Hardy Heron LTS) from the
*.diff.gz, *.dsc, and *.orig.tar.gz files available at
http://people.debian.org/~madkiss/ha-corosync, and as I said the
symptoms remain exactly the same, both under the configuration that I
list below and the sample configuration that came with these packages.
I also attempted the same with a single IP Address resource associated
with the cluster; just to be sure it wasn't an edge case for a cluster
with no resources; but again that had no effect.

Basically I'm still exactly at the point that I was at yesterday
morning at about 0900.

Remi Broemeling wrote: 


  I posted this to the OpenAIS Mailing List
(open...@lists.linux-foundation.org) yesterday, but haven't received
a response and upon further reflection I think that maybe I chose
the wrong list to post it to.  That list seems to be far less about
user support and far more about developer communication.  Therefore
re-trying here, as the archives show it to be somewhat more
user-focused.

The problem is that I'm having an issue with corosync refusing to
shutdown in response to a QUIT signal.  Given the below cluster
(output of crm_mon):


Last updated: Wed Sep 23 15:56:24 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ boot1 boot2 ]

If I go onto the host 'boot2', and issue the command "killall -QUIT
corosync", the anticipated result would be that boot2 would go
offline (out of the cluster), and all of the cluster processes
(corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.
However, this is not occurring, and I don't really have any idea
why.  After logging into boot2, and issuing the command "killall
-QUIT corosync", the result is a split-brain:

>From boot1's viewpoint:

Last updated: Wed Sep 23 15:58:27 2009
Stack: openais
Current DC: boot1 - partition WITHOUT quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ boot1 ]
OFFLINE: [ boot2 ]

>From boot2's viewpoint:

Last updated: Wed Sep 23 15:58:35 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ boot1 boot2 ]

At this point the status quo holds until such time as ANOTHER QUIT
signal is sent to corosync, (i.e. the command "killall -QUIT
corosync" is executed on boot2 again).  Then, boot2 shuts down
properly and everything appears to be kosher.  Basically, what I
expect to happen after a single QUIT signal is instead taking two
QUIT signals to occur; and that summarizes my question: why does it
take two QUIT signals to force corosync to actually shutdown?  Is
that desired behavior?  From everything online that I have read it
seems to be very strange, and it makes me think that I have a
problem in my configuration(s), but I've no idea what that would be
even after playing with things and investigating for the day.

I would be very grateful for any guidance that could be provided, as
at the moment I seem to be at an impasse.

Log files, with debugging set to 'on', can be found at the following
pastebin locations:
After first QUIT signal issued on boot2:
boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
boot2:/var/log/syslog: http://pastebin.com/d26fdfee
After second QUIT signal issued on boot2:
boot1:/var/log/syslog: http://pastebin.com/m755fb989
boot2:/var/log/syslog: http://pastebin.com/m22dcef45

OS, Software Packages, and Versions:
* two nodes, each running Ubuntu Hardy Heron LTS
* ubuntu-ha packages, as downloaded from
http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
* pacemaker-openais package version 1.0.5
+hg20090813-0ubuntu2~hardy1
* opena