Hello Gianluca

Do you have a cluster private network?

if your answer it's yes i recommend don't use heuristic because if your
cluster public network goes down your cluster take a fencing loop

Or you can do something better, use pacemaker+corosync

Il giorno 09 marzo 2012 15:14, Gianluca Cecchi
<gianluca.cec...@gmail.com>ha scritto:

> Hello,
> I have a cluster in RH EL 5.7 with quorum disk and an heuristic.
> Current versions of main cluster packages are:
> rgmanager-2.0.52-21.el5_7.1
> cman-2.0.115-85.el5_7.3
>
> This is the loaded heuristic
>
> Heuristic: 'ping -c1 -w1 10.4.5.250' score=1 interval=2 tko=200
>
> Line in cluster.conf:
> <heuristic interval="2" program="ping -c1 -w1 10.4.5.250" score="1"
> tko="200"/>
>
> where 10.4.5.250 is the gateway of the production lan,
> >From ping man page:
>  -c count
>  Stop after sending count ECHO_REQUEST packets. With deadline (-w)
> option,  ping  waits  for count ECHO_REPLY packets, until the timeout
> expires.
> -w deadline
>  Specify a timeout, in seconds, before ping exits regardless of how many
> packets have  been  sent or  received.  In  this case ping does not stop
> after count packet are sent, it waits either for deadline expire or
> until count probes are answered or for some error notification from
> network.
>
> So I would expect that the single ping command, executed as a sanity
> check, at most after 1 second
> should exit with a code, regardless an echo reply has been received or not
> And in fact I had no particular problem for many months
>
> As a test, putting an ip on an unreachable lan (say 10.4.6.5):
> date
> n=0
> while [ $n -lt 20 ]
> do
>  ping -c1 -w1 10.4.6.5
>  sleep 2
>  n=$(expr $n + 1)
> done
> date
>
> Output is
> Fri Mar  9 11:59:02 CET 2012
> PING 10.4.6.5 (10.4.6.5) 56(84) bytes of data.
>
> --- 10.4.6.5 ping statistics ---
> 2 packets transmitted, 0 received, 100% packet loss, time 1000ms
>
> ...
>
> --- 10.4.6.5 ping statistics ---
> 2 packets transmitted, 0 received, 100% packet loss, time 999ms
>
> Fri Mar  9 12:00:02 CET 2012
>
> so 60 seconds....
>
> In case of gateway reachability problems (also tested with an iptables
> rule that drops icmp output request) I would then have:
>
> qdiskd[2780]: <debug> Heuristic: 'ping -c1 -w1 10.4.5.250' missed
> (1/200)
>
> Strange thing I got yesterday night was this only line:
>
> qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN -
> Exceeded timeout of 75 seconds
>
> and the node self-fencing causing relocation of some services
> So for some reason the ping command was not able to exit at all, I
> presume...
> despite the -c and -w options....
>
> I suppose a condition that causes an internal timeout defined for the
> monitor operation itself (default to 75 seconds?)
> something like a pacemaker directive
> op monitor interval="20" timeout="40"
>
> And the cluster at this point considering as heuristic failed at all
> and self-fencing....
> Is this right?
>
> My default quorumd directive is this one, btw:
>
> <quorumd device="/dev/mapper/mpquorum" interval="5" label="oraprquorum"
> log_facility="local4" log_level="7" tko="16" votes="1">
>
> And in fact when for some reason I have temporary problems with my
> SAN, I get something like:
>
> qdiskd[1339]: <warning> qdisk cycle took more than 5 seconds to complete
> (34.540000)
>
> and on the other node
> qdiskd[6025]: <debug> Node 1 missed an update (2/200)
> qdiskd[6025]: <debug> Node 1 missed an update (3/200)
> ...
>
> Can anyone give any insight for the message I got yesterday that I
> never saw before:
> qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN -
> Exceeded timeout of 75 seconds
>
> ?
> Do I have to suppose a bug in the ping command?
>
> Thanks in advance,
> Gianluca
>
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to