[ 
https://issues.apache.org/jira/browse/IGNITE-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-23118:
--------------------------------------
    Affects Version/s: 2.14

> Insufficient backward connection check.
> ---------------------------------------
>
>                 Key: IGNITE-23118
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23118
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.14
>            Reporter: Vladimir Steshin
>            Priority: Major
>              Labels: ise
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We do the node status backward check only by socket opening:
> {code:java}
> ServerImpl#SocketReader#checkConnection(TcpDiscoveryNode node, int timeout):
> InetSocketAddress addr = addrs.get(addrIdx.getAndIncrement()):
> try (Socket sock = new Socket()) {
>     if (liveAddrHolder.get() == null) {
>         sock.connect(addr, perAddrTimeout);
>         liveAddrHolder.compareAndSet(null, addr);
>     }
> }
> {code}
> We write no byte and wait for no any trivial response. If JVM stucks GC pause 
> but accepts socket connection, this check gives a false positive result. This 
> can issue wrong node leaves the cluster. A node before the hanging one.
> Consider:
>     1) There a cluster with nodes 'A', 'B', 'C'.
>     2) 'B' delays in GC pause or waits for some threads to stop at safe 
> points. Its discovery threads are already suspended and do not read or write 
> messages/responses.
>     3) 'A' fails to send a message to 'B' and sees the timeout.
>     4) 'A' connects to 'C', asks to check 'B' and to establish new permanent 
> cluster connection 'A'->'C' if 'C' cannot check/ping 'B'. 
>     5) 'C' pings 'B', successfully creates connection to it 
> (Socket#connect()). And closes the socket just after it was opened.
>     6) 'C' denies establishing a permanent cluster connection with 'A', 
> answers that 'B' is alive.
>     7) 'A' tries to connect to 'B' again. Successfully connects to it 
> (Socket#connect()), but receives no any answer because  the JVM of 'B' can 
> only accept connections, but the reading/writing to socket Ignite's threads 
> are suspended.
>     8) 'A' loops in #3 - #7 till reaches `connectionRecoveryTimeout`.
>     9) 'A' segments, leaves the cluster despite it is alive and is able to 
> establish a permanent cluster connection to 'C'.
> We should either make this check writing something to the socket and waiting 
> for a response or even remove it at all.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to