[jira] [Commented] (QPID-8489) Connection thread looping

Daniil Kirilyuk (Jira) Wed, 16 Dec 2020 02:17:07 -0800


    [ 
https://issues.apache.org/jira/browse/QPID-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250226#comment-17250226
 ]


Daniil Kirilyuk commented on QPID-8489:
---------------------------------------

Hi [~orudyy],

Thank you very much for the links to the JDK issues, and sorry for the delay - 
it took some time for testing.

With the latest JDK 11 problem could not be reproduced indeed:
{noformat}
java --version
openjdk 11.0.9 2020-10-20
OpenJDK Runtime Environment 18.9 (build 11.0.9+11)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.9+11, mixed mode)
{noformat}
Log under JDK 11.0.9:
{noformat}
2020-12-16 10:27:25,570 INFO  [IO-/172.23.38.36:41814] 
(o.a.q.s.t.NonBlockingConnectionTLSDelegate) - isOutboundDone(): false, 
isInboundDone(): true, _status: Status = OK HandshakeStatus = NOT_HANDSHAKING
bytesConsumed = 0 bytesProduced = 0
2020-12-16 10:27:38,946 INFO  [IO-/172.23.38.36:41814] 
(o.a.q.s.t.NonBlockingConnectionTLSDelegate) - isOutboundDone(): false, 
isInboundDone(): true, _status: Status = OK HandshakeStatus = NOT_HANDSHAKING
bytesConsumed = 0 bytesProduced = 0
2020-12-16 10:27:38,946 INFO  [IO-/172.23.38.36:41814] 
(o.a.q.s.t.NonBlockingConnectionTLSDelegate) - isOutboundDone(): false, 
isInboundDone(): true, _status: Status = OK HandshakeStatus = NOT_HANDSHAKING
bytesConsumed = 0 bytesProduced = 0
2020-12-16 10:27:38,949 INFO  [Broker-Config] (q.m.c.dropped_connection) - 
[con:5(TWBXT_TWEB2FOAADEVCLR@null/default)] CON-1004 : Connection dropped
{noformat}
Unfortunately we have to provide support for Java 8 (as well as the TLSv1.3).

As for JDK-8214418 some details could be found in the [mailing 
list|http://mail.openjdk.java.net/pipermail/security-dev/2019-January/019142.html]
 and [here|http://cr.openjdk.java.net/~xuelei/8214418/webrev.00/open.patch].

Regarding the fix you suggested earlier: could the condition be made more 
strict, e.g.:
{noformat}
if (_status.bytesProduced() < 1 && _status.getHandshakeStatus() == 
SSLEngineResult.HandshakeStatus.NEED_WRAP && !_sslEngine.isOutboundDone() && 
_sslEngine.isInboundDone())
{
    throw new SSLException(String.format("SSLEngine.wrap produced 0 bytes 
(status %s, handshake status %s)", _status.getStatus(), 
_status.getHandshakeStatus()));
}
{noformat}
This check seems to handle correctly both the tight looping initiation as well 
as the normal connection lifecycle.

> Connection thread looping
> -------------------------
>
>                 Key: QPID-8489
>                 URL: https://issues.apache.org/jira/browse/QPID-8489
>             Project: Qpid
>          Issue Type: Bug
>          Components: Broker-J
>    Affects Versions: qpid-java-broker-8.0.2
>            Reporter: Daniil Kirilyuk
>            Priority: Minor
>         Attachments: 
> 0001-QPID-8489-Break-an-infinite-connection-processing-lo.patch, QPID-8489 - 
> java.ssl.debug.log, QPID-8489.log, simple_recv.cpp, thread-dump.st
>
>
> Error happens quite rarely and is not easy to reproduce. Although the problem 
> was partly fixed by fixing QPID-8477, it still can be reproduced. The main 
> symptom is significant increase of CPU usage even when no messages are sent 
> to broker anymore. CPU usage can rise from 30% to 90% and higher, making 
> broker unusable. After such CPU rise the only way to fix broker will be 
> restarting it.
> Analysis has shown, that error occurs with CPP proton client in cases when
> 1) SSL connection is used
> 2) connection errors on client side are ignored
> 3) connection is dropped due to the client process termination / network 
> disconnection
> *Steps to reproduce*
>  # Java broker should be installed
>  # Broker should be configured to allow one connection
>  # Prepare certificates
>  # Install Qpid::Proton 0.28.0
> wget 
> [http://archive.apache.org/dist/qpid/proton/0.28.0/qpid-proton-0.28.0.tar.gz]
> gunzip qpid-proton-0.28.0.tar.gz
> mkdir -p qpid-proton-0.28.0/build && pushd qpid-proton-0.28.0/build && cmake 
> .. && make all && popd
>  # Replace and edit example *qpid-proton-0.28.0/cpp/examples/simple_recv.cpp* 
> with the one attached
>  # Build again
> cd qpid-proton-0.28.0/build
> make
>  # Break the broker
> ./cpp/examples/simple_recv & ./cpp/examples/simple_recv
> Connection error
> {color:#ffab00}^C <= Hit Ctrl+C to kill process{color}
>  # {color:#172b4d}If CPU usage didn't increased, find the PID of the first 
> simple_recv process using ps-ef | grep simple_recv and kill it using kill -9 
> PID.{color}
> *Analysis*
> CPU usage rises when connection is dropped on the client side or when network 
> is broken between client and broker. The main point is that client isn't well 
> behaved and connection shouldn't be closed correctly.
> On broker side connection becomes "orphaned": it is still maintained by 
> broker, but no real reading / writing is performed. Following method calls 
> are performed in an endless loop for each "orphaned" connection:
> SelectorThread.performSelect() 
> SelectorThread.ConnectionProcessor.processConnection()
> NetworkConnectionScheduler.processConnection()
> NonBlockingConnection.doWork()
> As there nothing physically read or written, both methods 
> NonBlockingConnection.doRead() and NonBlockingConnection.doWrite() execute 
> very fast (several milliseconds) without any blocking processes and after 
> that connection is immediately rescheduled for processing in 
> NetworkConnectionScheduler. After that loop repeats.
> As the connection lifecycle is normal, there is logged nothing unusual or 
> suspicious (nothing is seen in log at all).
> In thread dump (see attachment) there is seen, that utilized are mostly 
> thread with names virtualhost-default-iopool-XX. Typical stacktrace looks 
> like following:
> {noformat}
> "virtualhost-default-iopool-39" #92 prio=5 os_prio=0 tid=0x00007f47ec335800 
> nid=0x37196 waiting on condition 
> [0x00007f476a4e3000]"virtualhost-default-iopool-39" #92 prio=5 os_prio=0 
> tid=0x00007f47ec335800 nid=0x37196 waiting on condition [0x00007f476a4e3000]  
>  java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
> Method) - parking to wait for  <0x00000000f39105d0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:532) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at 
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
>  at 
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$18/422330142.run(Unknown
>  Source) at java.lang.Thread.run(Thread.java:748)}
> {noformat}
> The main symptom of an error is rising CPU usage, which can reach up to 90% 
> in case, when several connections are "orphaned". Additional factor leading 
> to the problem is disabled keep-alive option for a connection or long 
> keep-alive interval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

[jira] [Commented] (QPID-8489) Connection thread looping

Reply via email to