On Thu, 2007-05-03 at 16:12 +0200, Dejan Muhamedagic wrote:

> On Thu, May 03, 2007 at 09:08:12AM -0400, Doug Knight wrote:
> > Thanks Dejan, I'll try the kill -9. One thing I'm seeing is that I can
> > easily move the resources between nodes using the <location> constraint,
> > but if I shutdown heartbeat on one node (/etc/init.d/heartbeat stop) I
> > run into problems. If I shutdown the node with the active resources,
> > heartbeat migrates the DRBD Master to the other node but the colocated
> > group does not migrate (it remains stopped on the active node). I'm
> 
> That's no good. You should send logs/config.
> 

I've attached cibadmin -Q and ha.cf. See below on question about logs.
I'm going to try flipping the nodes around and repeating the shutdown of
the "non-active" node.


> > digging into that now. If I shutdown the node that does not have the
> > active resources, the following happens:
> > 
> > (State: DC on active node1, running drbd master and group resources)
> > shutdown node2
> > demote attempted on node1 for drbd master,
> 
> Why demote? It's master running on a good node.
> 

Don't know, this is what I observed. I wondered why it would do a demote
when this node is already OK.

> > no attempt at halting groups
> > resources that depend on drbd
> 
> Why should the resources be stopped? You shutdown a node which
> doesn't have any resources.
> 

Same here, don't know why taking down the node without resources would
affect the other. One thing I keep coming back to is the <locate>
constraint, and how it affects processing. Probably not an issue, but...


> I'm not versatile in the master/slave business, so I can't comment
> more. But something seems to be very broken: either your config
> or you ran into a bug.
> 

Just completed some more testing, and it gets more interesting (and
probably supports your thought that a config might be broken somewhere):

Node1 - DC, active resources, <locate> constraint to this node
- executed heartbeat shutdown on node1 - all resources migrated, as well
as DC, to node2
- executed heartbeat startup on node1 - all resources migrate back to
node1, DC stays on node2
- executed heartbeat shutdown again on node1 - all resources migrated to
node2
- executed heartbeat startup again on node1 - all resources migrated
back to node 1, DC stays on node2

However:
Node2 - DC, active resources, <locate> constraint to this node
- executed heartbeat shutdown on node2 - all resources stopped, shutdown
takes about 14 minutes to complete, all resources migrated to node1
- executed heartbeat startup on node2 - all resources stop on node1,
file system resource within group resource flashes FAILED on crm_mon,
drbd Master migrates back to node1, group resources stay in stopped
state

So, there is definitely something going on different between the two
nodes. I'll attach the cibadmin -Q and ha.cf files (minimal diff, just
the IP addresses), can you suggest an "optimal" way of determining how
much or what part of the ha.debug logs to capture? I am attaching the
log from node2 where the shutdown took ~14 minutes.


> > demote of drbd master fails due to "device held open" error, filesystem
> > still has it mounted
> > loops through continuously trying to demote drbd (spin condition)
> > shutdown command never completes, control-C, then kill -9 main heartbeat
> > on node1
> > drbd:0 goes stopped, :1 Master goes FAILED, group resources all still
> > show started
> > startup command executed on node1, Bad Things Happen, eventually drbd
> > goes unmanaged
> > after node1 heartbeat startup completes, stop group and drbd, restart
> > resources, everything comes up fine
> > 
> > I'm going to try a similar test, but using kill -9 right off the bat
> > instead of the controlled shutdown. If there's any info I need to
> > provide to make this clearer, please, anybody, just let me know.
> > 
> > Doug
> > 
> > On Thu, 2007-05-03 at 13:14 +0200, Dejan Muhamedagic wrote:
> > 
> > > On Fri, Apr 27, 2007 at 03:10:22PM -0400, Doug Knight wrote:
> > > > I now have a working configuration with DRBD master/slave, and a
> > > > filesystem/pgsql/ipaddr group following it around. So far, I've been
> > > > using a Place constraint and modifying its uname value to test the "fail
> > > > over" of the resources. Can someone suggest a reasonable set of tests
> > > > that most do to verify other possible error conditions (short of pulling
> > > > the plug on one of the servers)?
> > > 
> > > You can run CTS with your configuration. Otherwise, stopping
> > > heartbeat in a way that it doesn't notice being stopped (kill -9)
> > > simulates the "pull power plug" condition. You'd also want to
> > > make various resources fail.
> > > 
> > > > Also, the Place constraint is on the
> > > > DRBD master/slave, does that make sense or should it be placed on one of
> > > > the "higher level" resources like the file system or pgsql?
> > > 
> > > I don't think it matters, you can go with either, given that the
> > > resources are collocated.
> > > 
> > > > Thanks,
> > > > Doug
> > > > 
> > > > On Thu, 2007-04-26 at 09:45 -0400, Doug Knight wrote:
> > > > 
> > > > > Hi Alastair,
> > > > > Have you encountered a situation where when you first start up the 
> > > > > drbd
> > > > > master/slave resource, crm_mon and/or the GUI indicate Master status 
> > > > > on
> > > > > one node, and Started status on the other (as opposed to Slave)? If 
> > > > > so,
> > > > > how did you correct it?
> > > > > 
> > > > > Doug
> > > > > p.s. Thanks for the scripts and xml, they're a big help!
> > > > > 
> > 
> > 
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> 

Attachment: cibadmin.node1.gz
Description: GNU Zip compressed data

Attachment: cibadmin.node2.gz
Description: GNU Zip compressed data

Attachment: ha.cf.node1.gz
Description: GNU Zip compressed data

Attachment: ha.cf.node2.gz
Description: GNU Zip compressed data

Attachment: node2_shutdown_hang.log.gz
Description: GNU Zip compressed data

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to