[ https://issues.apache.org/jira/browse/QPID-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250226#comment-17250226 ]
Daniil Kirilyuk commented on QPID-8489: --------------------------------------- Hi [~orudyy], Thank you very much for the links to the JDK issues, and sorry for the delay - it took some time for testing. With the latest JDK 11 problem could not be reproduced indeed: {noformat} java --version openjdk 11.0.9 2020-10-20 OpenJDK Runtime Environment 18.9 (build 11.0.9+11) OpenJDK 64-Bit Server VM 18.9 (build 11.0.9+11, mixed mode) {noformat} Log under JDK 11.0.9: {noformat} 2020-12-16 10:27:25,570 INFO [IO-/172.23.38.36:41814] (o.a.q.s.t.NonBlockingConnectionTLSDelegate) - isOutboundDone(): false, isInboundDone(): true, _status: Status = OK HandshakeStatus = NOT_HANDSHAKING bytesConsumed = 0 bytesProduced = 0 2020-12-16 10:27:38,946 INFO [IO-/172.23.38.36:41814] (o.a.q.s.t.NonBlockingConnectionTLSDelegate) - isOutboundDone(): false, isInboundDone(): true, _status: Status = OK HandshakeStatus = NOT_HANDSHAKING bytesConsumed = 0 bytesProduced = 0 2020-12-16 10:27:38,946 INFO [IO-/172.23.38.36:41814] (o.a.q.s.t.NonBlockingConnectionTLSDelegate) - isOutboundDone(): false, isInboundDone(): true, _status: Status = OK HandshakeStatus = NOT_HANDSHAKING bytesConsumed = 0 bytesProduced = 0 2020-12-16 10:27:38,949 INFO [Broker-Config] (q.m.c.dropped_connection) - [con:5(TWBXT_TWEB2FOAADEVCLR@null/default)] CON-1004 : Connection dropped {noformat} Unfortunately we have to provide support for Java 8 (as well as the TLSv1.3). As for JDK-8214418 some details could be found in the [mailing list|http://mail.openjdk.java.net/pipermail/security-dev/2019-January/019142.html] and [here|http://cr.openjdk.java.net/~xuelei/8214418/webrev.00/open.patch]. Regarding the fix you suggested earlier: could the condition be made more strict, e.g.: {noformat} if (_status.bytesProduced() < 1 && _status.getHandshakeStatus() == SSLEngineResult.HandshakeStatus.NEED_WRAP && !_sslEngine.isOutboundDone() && _sslEngine.isInboundDone()) { throw new SSLException(String.format("SSLEngine.wrap produced 0 bytes (status %s, handshake status %s)", _status.getStatus(), _status.getHandshakeStatus())); } {noformat} This check seems to handle correctly both the tight looping initiation as well as the normal connection lifecycle. > Connection thread looping > ------------------------- > > Key: QPID-8489 > URL: https://issues.apache.org/jira/browse/QPID-8489 > Project: Qpid > Issue Type: Bug > Components: Broker-J > Affects Versions: qpid-java-broker-8.0.2 > Reporter: Daniil Kirilyuk > Priority: Minor > Attachments: > 0001-QPID-8489-Break-an-infinite-connection-processing-lo.patch, QPID-8489 - > java.ssl.debug.log, QPID-8489.log, simple_recv.cpp, thread-dump.st > > > Error happens quite rarely and is not easy to reproduce. Although the problem > was partly fixed by fixing QPID-8477, it still can be reproduced. The main > symptom is significant increase of CPU usage even when no messages are sent > to broker anymore. CPU usage can rise from 30% to 90% and higher, making > broker unusable. After such CPU rise the only way to fix broker will be > restarting it. > Analysis has shown, that error occurs with CPP proton client in cases when > 1) SSL connection is used > 2) connection errors on client side are ignored > 3) connection is dropped due to the client process termination / network > disconnection > *Steps to reproduce* > # Java broker should be installed > # Broker should be configured to allow one connection > # Prepare certificates > # Install Qpid::Proton 0.28.0 > wget > [http://archive.apache.org/dist/qpid/proton/0.28.0/qpid-proton-0.28.0.tar.gz] > gunzip qpid-proton-0.28.0.tar.gz > mkdir -p qpid-proton-0.28.0/build && pushd qpid-proton-0.28.0/build && cmake > .. && make all && popd > # Replace and edit example *qpid-proton-0.28.0/cpp/examples/simple_recv.cpp* > with the one attached > # Build again > cd qpid-proton-0.28.0/build > make > # Break the broker > ./cpp/examples/simple_recv & ./cpp/examples/simple_recv > Connection error > {color:#ffab00}^C <= Hit Ctrl+C to kill process{color} > # {color:#172b4d}If CPU usage didn't increased, find the PID of the first > simple_recv process using ps-ef | grep simple_recv and kill it using kill -9 > PID.{color} > *Analysis* > CPU usage rises when connection is dropped on the client side or when network > is broken between client and broker. The main point is that client isn't well > behaved and connection shouldn't be closed correctly. > On broker side connection becomes "orphaned": it is still maintained by > broker, but no real reading / writing is performed. Following method calls > are performed in an endless loop for each "orphaned" connection: > SelectorThread.performSelect() > SelectorThread.ConnectionProcessor.processConnection() > NetworkConnectionScheduler.processConnection() > NonBlockingConnection.doWork() > As there nothing physically read or written, both methods > NonBlockingConnection.doRead() and NonBlockingConnection.doWrite() execute > very fast (several milliseconds) without any blocking processes and after > that connection is immediately rescheduled for processing in > NetworkConnectionScheduler. After that loop repeats. > As the connection lifecycle is normal, there is logged nothing unusual or > suspicious (nothing is seen in log at all). > In thread dump (see attachment) there is seen, that utilized are mostly > thread with names virtualhost-default-iopool-XX. Typical stacktrace looks > like following: > {noformat} > "virtualhost-default-iopool-39" #92 prio=5 os_prio=0 tid=0x00007f47ec335800 > nid=0x37196 waiting on condition > [0x00007f476a4e3000]"virtualhost-default-iopool-39" #92 prio=5 os_prio=0 > tid=0x00007f47ec335800 nid=0x37196 waiting on condition [0x00007f476a4e3000] > java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native > Method) - parking to wait for <0x00000000f39105d0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:532) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464) > at > org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$18/422330142.run(Unknown > Source) at java.lang.Thread.run(Thread.java:748)} > {noformat} > The main symptom of an error is rising CPU usage, which can reach up to 90% > in case, when several connections are "orphaned". Additional factor leading > to the problem is disabled keep-alive option for a connection or long > keep-alive interval. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org