> El 6 jul 2017, a las 17:34, Ken Gaillot <kgail...@redhat.com> escribió:
> 
> On 07/06/2017 10:27 AM, Cesar Hernandez wrote:
>> 
>>> 
>>> It looks like a bug when the fenced node rejoins quickly enough that it
>>> is a member again before its fencing confirmation has been sent. I know
>>> there have been plenty of clusters with nodes that quickly reboot and
>>> slow fencing devices, so that seems unlikely, but I don't see another
>>> explanation.
>>> 
>> 
>> Could it be caused if node 2 becomes rebooted and alive before the stonith 
>> script has finished?
> 
> That *shouldn't* cause any problems, but I'm not sure what's happening
> in this case.


So, this was the cause for the problem...
Before the two servers I have now, I've made other 3 cluster installations with 
a different internet hosting provider. Using that provider, a machine lasted 
more than 2 minutes to reboot using the fencing script (slow boot process and 
slow remote api to respond)
So I added a "sleep 90" before the end of the script and it always worked 
perfectly.

Now, with a different provider, I used the same script, just changing the 
remote api for the provider api. In this case, a machine lasts aprox 10 seconds 
to do a full reboot, and also the api is faster (just 2 or 3 seconds to 
respond).
So the machine was up again in less than 20 seconds. 

I suppose the problem comes when the node (node2 for example) that has been 
rebooted sees that node1 is still waiting for the fencing script to finish (due 
to the sleep 90) and it just becomes confused and exits pacemaker.

I changed that sleep 90 for a sleep 5 and it hasn't happened again

Thanks a lot to everyone for the help

Cheers
Cesar



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to