On 12/09/13 02:57, Pascal Ehlert wrote:
On 11/09/13 7:31 PM, Digimer wrote:
That log message does show the node joining. Can you reliably
reproduce this? If so, can you please 'tail -f -n 0 /var/log/messages'
on both nodes, break the cluster and wait for the node to restart,
'tail' the rebooted node's /var/log/messages, wait the six minutes and
then, after the second fence occurs, post both node's logs?
I was indeed able to reliably reproduce this and that's where my
confusion came from. I don't understand why the node seems to be joining
(and leaving immediately afterwards as per the log), all within the
360secs post join fence delay and still gets fenced.
As this is a semi-production system (we had to move quickly), I went
with a qdisk based approach now, using a small iscsi disk from a remote
site. This works very well and reliable as far as I can tell from the
testing that I have done so far. I would still be interested to hear why
the initial approach failed.
How would have manually starting the cluster services a difference
anyway? Does that mean that one should join the cluster and fence domain
first to ensure a stateless join and only then start rgmanager? Isn't
that something that could be achieved with some delays in the startup
scripts as well?
Either way, thank you all for helping out this quick!
I honestly don't know why it wound join -> fence; That's most likely a
network issue but I couldn't guess any more than that. Regardless, you
have an issue as this behaviour is certainly not normal. You may have
masked it with qdisk, but please don't leave things as they are. This is
worthy of further investigation.
In this case, manually starting the cluster would probably not change
anything. It would, however, allow you to more easily debug because you
could get the logs tail'ing before attempting to start the cluster.
We'll really need to see the logs in order to go much further.
If you can schedule a maintenance window, please reproduce this and post
the logs here. I am very curious as to what might be going on. In the
meantime, run 'cman_tool status', record the multicast address and make
sure that group is persistent in your switches.
There is a small chance that one of the services under rgmanager's
control that is causing an interruption. Again; guessing.
digimer
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster