[ https://issues.apache.org/jira/browse/IGNITE-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Steshin updated IGNITE-23118: -------------------------------------- Affects Version/s: 2.14 > Insufficient backward connection check. > --------------------------------------- > > Key: IGNITE-23118 > URL: https://issues.apache.org/jira/browse/IGNITE-23118 > Project: Ignite > Issue Type: Bug > Affects Versions: 2.14 > Reporter: Vladimir Steshin > Priority: Major > Labels: ise > Time Spent: 0.5h > Remaining Estimate: 0h > > We do the node status backward check only by socket opening: > {code:java} > ServerImpl#SocketReader#checkConnection(TcpDiscoveryNode node, int timeout): > InetSocketAddress addr = addrs.get(addrIdx.getAndIncrement()): > try (Socket sock = new Socket()) { > if (liveAddrHolder.get() == null) { > sock.connect(addr, perAddrTimeout); > liveAddrHolder.compareAndSet(null, addr); > } > } > {code} > We write no byte and wait for no any trivial response. If JVM stucks GC pause > but accepts socket connection, this check gives a false positive result. This > can issue wrong node leaves the cluster. A node before the hanging one. > Consider: > 1) There a cluster with nodes 'A', 'B', 'C'. > 2) 'B' delays in GC pause or waits for some threads to stop at safe > points. Its discovery threads are already suspended and do not read or write > messages/responses. > 3) 'A' fails to send a message to 'B' and sees the timeout. > 4) 'A' connects to 'C', asks to check 'B' and to establish new permanent > cluster connection 'A'->'C' if 'C' cannot check/ping 'B'. > 5) 'C' pings 'B', successfully creates connection to it > (Socket#connect()). And closes the socket just after it was opened. > 6) 'C' denies establishing a permanent cluster connection with 'A', > answers that 'B' is alive. > 7) 'A' tries to connect to 'B' again. Successfully connects to it > (Socket#connect()), but receives no any answer because the JVM of 'B' can > only accept connections, but the reading/writing to socket Ignite's threads > are suspended. > 8) 'A' loops in #3 - #7 till reaches `connectionRecoveryTimeout`. > 9) 'A' segments, leaves the cluster despite it is alive and is able to > establish a permanent cluster connection to 'C'. > We should either make this check writing something to the socket and waiting > for a response or even remove it at all. -- This message was sent by Atlassian Jira (v8.20.10#820010)