On Mar 14, 2012, at 12:33 PM, Florian Haas wrote:

>> However, sometimes pacemakerd will not stop cleanly.
> 
> OK. Whether this is related to your original problem or not a complete
> open question, jftr.
> 
>> I thought it might happen when stopping pacemaker on the current DC, but 
>> after successfully reproducing this failure twice, I couldn't do it again. 
>> Pacemakerd seems to exit, but fail to notify the other nodes of its 
>> shutdown. Syslog is flooded with "Retransmit List" messages (log attached). 
>> These persist until I stop corosync. Asked immediately after stopping 
>> pacemaker and corosync on one node, "crm status" other nodes will report 
>> that node as still online. After a while, the stopped node switches to 
>> offline; I assume some timeout is expiring and they are assuming it crashed.
> 
> You didn't give much other information, so I'm asking this on a hunch:
> does your pacemaker service configuration stanza for corosync (either
> in /etc/corosync/corosync.conf or in
> /etc/corosync/service.d/pacemaker) say "ver: 0" or "ver: 1"?

I'm not sure if this is the same problem or not. I did experience a symptom 
that looked to my inexperienced eyes very similar before I installed 
1.0.9+hg2665-1~bpo60+2 - that is, I'd try to stop pacemaker, and it wouldn't 
stop, and I'd get that flood of retransmits in syslog.

To answer your question, I am using "ver: 1". It's worth mentioning that the 
corosync.conf that comes with the packages in squeeze-backports has a service 
block with ver: 0 in it, which took me some time to discover. However, I've 
long ago removed it. Syslog seems to verify that ver: 1 is in effect:

Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found 
'pacemaker' for option: name
Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found '1' 
for option: ver

After playing with this system more, it seems this problem of "Retransmit List" 
being flooded to syslog is not only on pacemakerd shutdown. For example, I was 
just trying to add a DRBD resource, and crm got hung up at "cib commit":

crm(drbd)# cib commit drbd
[long pause, some minutes long]
Could not commit shadow instance 'drbd' to the CIB: Remote node did not respond
ERROR: failed to commit the drbd shadow CIB

"corosync[7915]:   [TOTEM ] Retransmit List: b7 b8 b9" is being flooded to 
syslog.

Every time I try to reproduce this, I can once or twice, but then no more. I'm 
beginning to think that to set this up, a node has to have been running for 
some time. I can reproduce it a few times because I try it on each node. Then I 
have to restart corosync on each node to get things working again, and after 
that, everything is fine, until I move on, spend some time reading 
documentation, and try again.

I'm assuming these "Retransmit List" messages in syslog indicate that corosync 
attempted to send a message to other nodes, did not receive acknowledgement, 
and is thus attempting to resend them. I know corosync uses IP multicast to 
communicate with the other nodes. Is it possible that my network is doing 
something that breaks multicast connectivity? Multicast IP isn't something I've 
ever had to deal with, so I'm not really sure. It's hard to find anything that 
talks about configuring a network for multicast that doesn't start talking 
about IP routers, which isn't relevant in my setup because all the cluster 
nodes are on the same VLAN, on the same switch. Could this be an issue? Is 
there a lower-level utility (like, ping) that I can use to verify multicast IP 
at a lower level?

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to