Hello Gianluca Do you have a cluster private network?
if your answer it's yes i recommend don't use heuristic because if your cluster public network goes down your cluster take a fencing loop Or you can do something better, use pacemaker+corosync Il giorno 09 marzo 2012 15:14, Gianluca Cecchi <gianluca.cec...@gmail.com>ha scritto: > Hello, > I have a cluster in RH EL 5.7 with quorum disk and an heuristic. > Current versions of main cluster packages are: > rgmanager-2.0.52-21.el5_7.1 > cman-2.0.115-85.el5_7.3 > > This is the loaded heuristic > > Heuristic: 'ping -c1 -w1 10.4.5.250' score=1 interval=2 tko=200 > > Line in cluster.conf: > <heuristic interval="2" program="ping -c1 -w1 10.4.5.250" score="1" > tko="200"/> > > where 10.4.5.250 is the gateway of the production lan, > >From ping man page: > -c count > Stop after sending count ECHO_REQUEST packets. With deadline (-w) > option, ping waits for count ECHO_REPLY packets, until the timeout > expires. > -w deadline > Specify a timeout, in seconds, before ping exits regardless of how many > packets have been sent or received. In this case ping does not stop > after count packet are sent, it waits either for deadline expire or > until count probes are answered or for some error notification from > network. > > So I would expect that the single ping command, executed as a sanity > check, at most after 1 second > should exit with a code, regardless an echo reply has been received or not > And in fact I had no particular problem for many months > > As a test, putting an ip on an unreachable lan (say 10.4.6.5): > date > n=0 > while [ $n -lt 20 ] > do > ping -c1 -w1 10.4.6.5 > sleep 2 > n=$(expr $n + 1) > done > date > > Output is > Fri Mar 9 11:59:02 CET 2012 > PING 10.4.6.5 (10.4.6.5) 56(84) bytes of data. > > --- 10.4.6.5 ping statistics --- > 2 packets transmitted, 0 received, 100% packet loss, time 1000ms > > ... > > --- 10.4.6.5 ping statistics --- > 2 packets transmitted, 0 received, 100% packet loss, time 999ms > > Fri Mar 9 12:00:02 CET 2012 > > so 60 seconds.... > > In case of gateway reachability problems (also tested with an iptables > rule that drops icmp output request) I would then have: > > qdiskd[2780]: <debug> Heuristic: 'ping -c1 -w1 10.4.5.250' missed > (1/200) > > Strange thing I got yesterday night was this only line: > > qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN - > Exceeded timeout of 75 seconds > > and the node self-fencing causing relocation of some services > So for some reason the ping command was not able to exit at all, I > presume... > despite the -c and -w options.... > > I suppose a condition that causes an internal timeout defined for the > monitor operation itself (default to 75 seconds?) > something like a pacemaker directive > op monitor interval="20" timeout="40" > > And the cluster at this point considering as heuristic failed at all > and self-fencing.... > Is this right? > > My default quorumd directive is this one, btw: > > <quorumd device="/dev/mapper/mpquorum" interval="5" label="oraprquorum" > log_facility="local4" log_level="7" tko="16" votes="1"> > > And in fact when for some reason I have temporary problems with my > SAN, I get something like: > > qdiskd[1339]: <warning> qdisk cycle took more than 5 seconds to complete > (34.540000) > > and on the other node > qdiskd[6025]: <debug> Node 1 missed an update (2/200) > qdiskd[6025]: <debug> Node 1 missed an update (3/200) > ... > > Can anyone give any insight for the message I got yesterday that I > never saw before: > qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN - > Exceeded timeout of 75 seconds > > ? > Do I have to suppose a bug in the ping command? > > Thanks in advance, > Gianluca > > -- > Linux-cluster mailing list > Linux-cluster@redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera
-- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster