Re: Speed up failure detection

Ivan Veselovskiy Tue, 14 Apr 2015 13:46:50 -0700

it would be great!

- does sockTimeout affect all the server sockets involved in Ignite node?
(E.g. there are sockets in discovery, in Hadoop job tracker, in IGFS
interface, even in shmem handshake.)
- to reduce GC pauses G1 collector can potentially be helpful. Is there any
experience with it in Ignite?


--ivan

On Wed, Apr 15, 2015 at 12:25 AM, Yakov Zhdanov <[email protected]> wrote:

> Guys,
>
> I think we can (1) make grid configuration significantly easier and (2)
> speed up failure detection.
>
> Here are disco SPI configuration properties which are responsible for
> failure detection:
>
>    - reconnectCount,
>    - sockTimeout,
>    - networkTImeout,
>    - ackTImeout,
>    - maxAckTimeout,
>    - heartbeatFrequency
>    - maxMissedHearbeats
>
> Same for communication SPI
>
>    - reconnectCount,
>    - maxConnTimeout,
>    - connTimeout
>
> 10 or even more properties.
>
> We did it to address half-opened sockets problem (which is pretty common
> for cloud environment) and GC pauses which may happen on cluster nodes - we
> can increase ack timeouts to prevent them
>
> By setting value for these props I set timeout for failure detection. Why
> do we need such great number of parameters instead of having 1 on
> IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold -
> can anyone propose better name?).
>
> All other parameters will be calculated automatically (I think user can
> still set some of them for full control over situation - need to decide if
> this is needed.)
>
> Ticket filed - https://issues.apache.org/jira/browse/IGNITE-752
>
> Thoughts?
>
> --Yakov
>

Re: Speed up failure detection

Reply via email to