[ 
https://issues.apache.org/jira/browse/IGNITE-20940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Henderson updated IGNITE-20940:
------------------------------------
    Description: 
|SSL has to be enabled to trigger this deadlock.


Either accidentally or maliciously cause a node to run out of file descriptors 
on a Unix-type system by creating a cache or caches with the number of 
partitions exceeding the number of remaining file descriptors (native 
persistence has to be on), or by keep opening socket connections to the server 
(no SSL certificate required) without ever closing them, or by using a 
commercial piece of software such as 3DNS that periodically polls the Ignite 
discovery port to check the live-ness of the port - this seems to cause Ignite 
to leak open files.


Doing any of the above will ultimately cause the node to run out of file 
descriptors and the message send to another node to fail because it can't open 
a new socket connection but the node then waits indefinitely for a reply that 
will never be received because the original message wasn't sent.


ServerImpl -> SocketReader -> body() calls unmarshal() which ultimately 
attempts to read from a socket which has no socket timeout set.


A handshake timeout is eventually triggered by the system timer thread and it 
attempts to close the socket to break the stalemate. The socket close process 
invokes the GridNioSslFilter -> onSessionClose() function and that tries to 
acquire the sslHandler lock but the lock is already owned by the socket read or 
other related thread; the result is deadlock.


A separate watchdog thread spots that the system timer thread has stopped 
updating its heartbeat time value and reports "Blocked system-critical thread 
has been detected" and triggers the failure handler.


If the failure handler is set to restart, the node restart process tries to 
create a marker file but fails because there are no free file descriptors and 
the restart process stalls. The node is now in an invalid state. If you try to 
stop the JVM with a SIGTERM, the shutdown hook handler deadlocks too.

If the failure handler is set to stop, it first attempts to cleanly close all 
existing connections; eventually it tries to close deadlocked connection but 
before doing so the GridNioSslFilter again attempts to acquire the sslHandler 
lock first, deadlocking the stop process too.

Suggested fix(es): Add socket timeout before calling unmarshal() and/or add 
time limit in GridNioSslFilter when waiting to acquire the sslHandler lock. 
Update the restart process so it doesn't have to create a flag file.

Also investigate why a commercial piece of software (3DNS) is causing file 
descriptor leaks with its discovery port polling.|
 

  was:
SSL has to be enabled to trigger this deadlock.

ServerImpl -> SocketReader -> body() calls unmarshal() which ultimately 
attempts to read from a socket which has no socket timeout set. If, as can 
happen during periods of network instability, one nodes thinks it has 
successfully sent a message to another node but the other node hasn't received 
the message, then both nodes can become blocked in the same unmarshal() call, 
each waiting for the other to send something.

A handshake timeout eventually triggers and attempts to close the socket to 
break the stalemate but before closing the socket the GridNioSslFilter -> 
onSessionClose() function is invoked and that tries to acquire the sslHandler 
lock but the lock is already owned by the socket read or other related thread; 
the result is deadlock.

A separate watchdog thread spots that the system timer thread has stopped 
updating its heartbeat time value and reports "Blocked system-critical thread 
has been detected" and triggers the failure handler.

If the failure handler is set to restart, the node restart process is triggered 
which first attempts to cleanly close all existing connections; eventually it 
tries to close deadlocked connection but before doing so the GridNioSslFilter 
again attempts to acquire the sslHandler lock first, deadlocking the restart 
process too,

Suggested fix(es): Add socket timeout before calling unmarshal() and/or add 
time limit in GridNioSslFilter when waiting to acquire the sslHandler lock.

 

Stack traces of relevant threads:

Thread [name="tcp-disco-sock-reader-[3cff52b3 IP:32602 client]-#285-#531", 
id=569, state=RUNNABLE, blockCnt=4, waitCnt=0|#285-#531", id=569, 
state=RUNNABLE, blockCnt=4, waitCnt=0]
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at 
sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:475)
        at 
sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:469)
        at 
sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:69)
        at 
sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1266)
        at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:76)
        at 
sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:943)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        - locked java.io.BufferedInputStream@20b3a454
        at 
o.a.i.marshaller.jdk.JdkMarshallerInputStreamWrapper.read(JdkMarshallerInputStreamWrapper.java:53)
        at 
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2837)
        at 
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2853)
        at 
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3330)
        at 
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:939)
        at java.io.ObjectInputStream.<init>(ObjectInputStream.java:401)
        at 
o.a.i.marshaller.jdk.JdkMarshallerObjectInputStream.<init>(JdkMarshallerObjectInputStream.java:43)
        at o.a.i.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:122)
        at 
o.a.i.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
        at o.a.i.i.util.IgniteUtils.unmarshal(IgniteUtils.java:10709)
        at 
o.a.i.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7020)
        at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)

 

Thread [name="grid-nio-worker-client-listener-1-#33", id=53, state=RUNNABLE, 
blockCnt=383, waitCnt=1|#33", id=53, state=RUNNABLE, blockCnt=383, waitCnt=1]
        at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:418)
        at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:397)
        - locked sun.security.ssl.SSLEngineImpl@2f9b9b2e
        at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:626)
        at 
o.a.i.i.util.nio.ssl.GridNioSslHandler.unwrap0(GridNioSslHandler.java:610)
        at 
o.a.i.i.util.nio.ssl.GridNioSslHandler.unwrapData(GridNioSslHandler.java:518)
        at 
o.a.i.i.util.nio.ssl.GridNioSslHandler.messageReceived(GridNioSslHandler.java:336)
        at 
o.a.i.i.util.nio.ssl.GridNioSslFilter.onMessageReceived(GridNioSslFilter.java:397)
        at 
o.a.i.i.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
        at 
o.a.i.i.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:3752)
        at 
o.a.i.i.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:175)
        at 
o.a.i.i.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1379)
        at 
o.a.i.i.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2526)
        at 
o.a.i.i.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2281)
        at 
o.a.i.i.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910)
        at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125)
        at java.lang.Thread.run(Thread.java:750)

*The blocked system timer thread:*

Thread [name="grid-timeout-worker-#22", id=40, state=WAITING, blockCnt=4, 
waitCnt=622037|#22", id=40, state=WAITING, blockCnt=4, waitCnt=622037]
    Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync@3ccdf067, 
ownerName=grid-nio-worker-client-listener-1-#33, ownerId=53]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at 
o.a.i.i.util.nio.ssl.GridNioSslFilter.onSessionClose(GridNioSslFilter.java:431)
        at 
o.a.i.i.util.nio.GridNioFilterAdapter.proceedSessionClose(GridNioFilterAdapter.java:128)
        at 
o.a.i.i.util.nio.GridNioCodecFilter.onSessionClose(GridNioCodecFilter.java:137)
        at 
o.a.i.i.util.nio.GridNioFilterAdapter.proceedSessionClose(GridNioFilterAdapter.java:128)
        at 
o.a.i.i.util.nio.GridNioAsyncNotifyFilter.onSessionClose(GridNioAsyncNotifyFilter.java:124)
        at 
o.a.i.i.util.nio.GridNioFilterAdapter.proceedSessionClose(GridNioFilterAdapter.java:128)
        at 
o.a.i.i.util.nio.GridNioFilterChain$TailFilter.onSessionClose(GridNioFilterChain.java:274)
        at 
o.a.i.i.util.nio.GridNioFilterChain.onSessionClose(GridNioFilterChain.java:203)
        at 
o.a.i.i.util.nio.GridNioSessionImpl.close(GridNioSessionImpl.java:169)
        at 
o.a.i.i.util.nio.GridSelectorNioSessionImpl.close(GridSelectorNioSessionImpl.java:498)
        at 
o.a.i.i.processors.odbc.ClientListenerNioListener$1.run(ClientListenerNioListener.java:264)
        at 
o.a.i.i.processors.timeout.GridTimeoutProcessor$CancelableTask.onTimeout(GridTimeoutProcessor.java:365)
        - locked 
o.a.i.i.processors.timeout.GridTimeoutProcessor$CancelableTask@a2e6d09
        at 
o.a.i.i.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:234)
        at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125)
        at java.lang.Thread.run(Thread.java:750)

 


> Ignite SSL filter can cause internal node deadlock on inter-node communication
> ------------------------------------------------------------------------------
>
>                 Key: IGNITE-20940
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20940
>             Project: Ignite
>          Issue Type: Bug
>          Components: networking
>    Affects Versions: 2.13, 2.14, 2.15
>         Environment: Ignite 2.15.0
> Java 1.8+
> Calcite SQL engine, although SQL isn't being used.
> Three nodes with mixture of fat and thin clients.
> Native persistence on.
> SSL on.
> Failure handler set to RestartProcessFailureHandler
> Custom SegmentationPlugInProvider.
> Custom network timeout values to try and work-around unstable network at 
> night during large backups.
>  
>            Reporter: Gary Henderson
>            Priority: Major
>
> |SSL has to be enabled to trigger this deadlock.
> Either accidentally or maliciously cause a node to run out of file 
> descriptors on a Unix-type system by creating a cache or caches with the 
> number of partitions exceeding the number of remaining file descriptors 
> (native persistence has to be on), or by keep opening socket connections to 
> the server (no SSL certificate required) without ever closing them, or by 
> using a commercial piece of software such as 3DNS that periodically polls the 
> Ignite discovery port to check the live-ness of the port - this seems to 
> cause Ignite to leak open files.
> Doing any of the above will ultimately cause the node to run out of file 
> descriptors and the message send to another node to fail because it can't 
> open a new socket connection but the node then waits indefinitely for a reply 
> that will never be received because the original message wasn't sent.
> ServerImpl -> SocketReader -> body() calls unmarshal() which ultimately 
> attempts to read from a socket which has no socket timeout set.
> A handshake timeout is eventually triggered by the system timer thread and it 
> attempts to close the socket to break the stalemate. The socket close process 
> invokes the GridNioSslFilter -> onSessionClose() function and that tries to 
> acquire the sslHandler lock but the lock is already owned by the socket read 
> or other related thread; the result is deadlock.
> A separate watchdog thread spots that the system timer thread has stopped 
> updating its heartbeat time value and reports "Blocked system-critical thread 
> has been detected" and triggers the failure handler.
> If the failure handler is set to restart, the node restart process tries to 
> create a marker file but fails because there are no free file descriptors and 
> the restart process stalls. The node is now in an invalid state. If you try 
> to stop the JVM with a SIGTERM, the shutdown hook handler deadlocks too.
> If the failure handler is set to stop, it first attempts to cleanly close all 
> existing connections; eventually it tries to close deadlocked connection but 
> before doing so the GridNioSslFilter again attempts to acquire the sslHandler 
> lock first, deadlocking the stop process too.
> Suggested fix(es): Add socket timeout before calling unmarshal() and/or add 
> time limit in GridNioSslFilter when waiting to acquire the sslHandler lock. 
> Update the restart process so it doesn't have to create a flag file.
> Also investigate why a commercial piece of software (3DNS) is causing file 
> descriptor leaks with its discovery port polling.|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to