Speed up failure detection

Yakov Zhdanov Tue, 14 Apr 2015 13:26:35 -0700

Guys,

I think we can (1) make grid configuration significantly easier and (2)
speed up failure detection.


Here are disco SPI configuration properties which are responsible for
failure detection:

   - reconnectCount,
   - sockTimeout,
   - networkTImeout,
   - ackTImeout,
   - maxAckTimeout,
   - heartbeatFrequency
   - maxMissedHearbeats

Same for communication SPI

   - reconnectCount,
   - maxConnTimeout,
   - connTimeout

10 or even more properties.

We did it to address half-opened sockets problem (which is pretty common
for cloud environment) and GC pauses which may happen on cluster nodes - we
can increase ack timeouts to prevent them

By setting value for these props I set timeout for failure detection. Why
do we need such great number of parameters instead of having 1 on
IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold -
can anyone propose better name?).

All other parameters will be calculated automatically (I think user can
still set some of them for full control over situation - need to decide if
this is needed.)

Ticket filed - https://issues.apache.org/jira/browse/IGNITE-752

Thoughts?

--Yakov

Speed up failure detection

Reply via email to