[
https://issues.apache.org/jira/browse/QPID-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331563#comment-17331563
]
Alex Rudyy commented on QPID-8514:
----------------------------------
Hi [~daniel.kirilyuk],
I committed a change
[5db3f7c|https://gitbox.apache.org/repos/asf?p=qpid-broker-j.git;h=5db3f7c] for
a ring overflow policy. It allows to delete the next available entry after
least significant one (oldest in case of standard queue) when the least
significant entry is acquired by the consumer using acquired non-stealable
state. The Broker cannot delete the messages in acquired non-stealable state.
As result, previously, the same message was returned by the queue (as the
deletion did happen), causing the loop in the ring overflow policy. The
implemented change should prevent the CPU spinning in the ring policy loop on
running into acquired non-stealable message.
IMHO, it should be a reasonable compromise, as a message transition into
acquired non-stealable state happens when a consumer settles the message. The
non-transacted messaging use cases should not affected by the reported issue,
as the message deletion happens immediately after settling. Thus, the message
would be in acquired non-stealable state for a very short amount time. As for
transacted messaging use cases, the consumer can commit its transaction with
some delay after messages being settled. Thus, the transacted settled message
can be in acquired non-stealable state for some time until consuming
transaction is committed. I strongly believe that messages consumed using
Session.CLIENT_ACKNOWLEDGE are not effected by the issue. Though, if you see
the issue manifestation for Session.CLIENT_ACKNOWLEDGE please let me know.
The idea behind the acquired non-stealable state is that message cannot be
deleted on the broker side when it is in the acquired state. It was added to
solve some concurrency issues on concurrent message deletion from multiple
threads (for example, consumer thread and management thread performing queue
clean up).
Potentially, the issue can be solved by not putting the message into acquired
non-stealable state on settling transacted message. For example, transacted
messages can remain in the acquired steabale state until transaction is
committed. Though, that might require a lot more changes to the code. Please
feel free to try to implement it, if the provided implementation is not
acceptable for you.
Kind Regards,
Alex
> High CPU usage and stucked connections
> --------------------------------------
>
> Key: QPID-8514
> URL: https://issues.apache.org/jira/browse/QPID-8514
> Project: Qpid
> Issue Type: Bug
> Components: Broker-J
> Affects Versions: qpid-java-broker-8.0.4
> Reporter: Daniil Kirilyuk
> Priority: Minor
> Attachments: QPID-8514 - broker-log.zip,
> RingPolicyConcurrencyTest.java, threaddump.zip
>
>
> There were around 600 connections in state CLOSE__WAIT:
> {noformat}
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep TIME_WAIT | wc -l
> 38
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep CLOSE_WAIT | wc -l
> 625
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep -v CLOSE_WAIT | grep
> -V TIME_WAIT | wc -l
> 7
> {noformat}
> CPU usage raises up to 90% and higher.
> Connections couldn't be closed by webgui. Broker log contains message below,
> but connections were not closed.
> {noformat}
> 2021-03-26 15:12:28,474 INFO [Broker-Config] (q.m.c.model_delete) -
> [mng:3hpU+miq(admin@/127.0.0.1:62743)]
> [con:751(TEST@/127.0.0.1:40422/default)] CON-1007 : Connection close
> initiated by operator
> {noformat}
> Thread dumps shows around 49 threads with same stacktrace related to ring
> queue
> {noformat}
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.qpid.server.queue.QueueEntryImpl.addStateChangeListener(QueueEntryImpl.java:674)
> at
> org.apache.qpid.server.queue.QueueEntryImpl.acquireOrSteal(QueueEntryImpl.java:316)
> at
> org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1913)
> at
> org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1863)
> at
> org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.checkOverflow(RingOverflowPolicyHandler.java:102)
> at
> org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.access$000(RingOverflowPolicyHandler.java:44)
> at
> org.apache.qpid.server.queue.RingOverflowPolicyHandler.checkOverflow(RingOverflowPolicyHandler.java:41)
> at
> org.apache.qpid.server.queue.AbstractQueue.doEnqueue(AbstractQueue.java:1382)
> at
> org.apache.qpid.server.queue.AbstractQueue.enqueue(AbstractQueue.java:1290)
> at
> org.apache.qpid.server.message.RoutingResult$1.postCommit(RoutingResult.java:135)
> at
> org.apache.qpid.server.txn.AsyncAutoCommitTransaction$3.postCommit(AsyncAutoCommitTransaction.java:309)
> at
> org.apache.qpid.server.txn.AsyncCommand.complete(AsyncCommand.java:84)
> at
> org.apache.qpid.server.protocol.v1_0.StandardReceivingLinkEndpoint.receiveComplete(StandardReceivingLinkEndpoint.java:588)
> at
> org.apache.qpid.server.protocol.v1_0.Session_1_0.lambda$receivedComplete$5(Session_1_0.java:1343)
> at
> org.apache.qpid.server.protocol.v1_0.Session_1_0$$Lambda$142/959267941.accept(Unknown
> Source)
> at java.lang.Iterable.forEach(Iterable.java:75)
> at
> org.apache.qpid.server.protocol.v1_0.Session_1_0.receivedComplete(Session_1_0.java:1343)
> at
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.lambda$receivedComplete$11(AMQPConnection_1_0Impl.java:1349)
> at
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl$$Lambda$141/1774823837.run(Unknown
> Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.receivedComplete(AMQPConnection_1_0Impl.java:1347)
> at
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.onReceive(AMQPConnection_1_0Impl.java:1328)
> at
> org.apache.qpid.server.transport.AbstractAMQPConnection.lambda$received$2(AbstractAMQPConnection.java:576)
> at
> org.apache.qpid.server.transport.AbstractAMQPConnection$$Lambda$119/6380061.run(Unknown
> Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at
> org.apache.qpid.server.transport.AbstractAMQPConnection.received(AbstractAMQPConnection.java:571)
> at
> org.apache.qpid.server.transport.MultiVersionProtocolEngine.received(MultiVersionProtocolEngine.java:135)
> at
> org.apache.qpid.server.transport.NonBlockingConnection.processAmqpData(NonBlockingConnection.java:611)
> at
> org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.processData(NonBlockingConnectionPlainDelegate.java:58)
> at
> org.apache.qpid.server.transport.NonBlockingConnection.doRead(NonBlockingConnection.java:496)
> at
> org.apache.qpid.server.transport.NonBlockingConnection.doWork(NonBlockingConnection.java:270)
> at
> org.apache.qpid.server.transport.NetworkConnectionScheduler.processConnection(NetworkConnectionScheduler.java:134)
> at
> org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.processConnection(SelectorThread.java:575)
> at
> org.apache.qpid.server.transport.SelectorThread$SelectionTask.performSelect(SelectorThread.java:366)
> at
> org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(SelectorThread.java:97)
> at
> org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:533)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
> at
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$10/1487500813.run(Unknown
> Source)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Also ring queue behaves strangely: queue "{{DeadLetterQueue"}} has set
> maximum message depth to {{1}}, but in there were 55 message with
> {{Acquired}} state.
> First message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.381 (00:00
> UTC)}}.
> Last message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.683 (00:00
> UTC)}}.
> These times correlates to timed out drain request at 16:52:00,410, because
> default drain request is set 60 seconds:
> {noformat}
> ^[[0m^[[0m2021-03-25 16:52:00,408 INFO [org.apache.qpid.jms.JmsSession]
> (QpidJMS Connection Executor: ID:986cd915-d072-4711-bd71-50a74aab4be8:1) A
> JMS MessageConsumer has been closed: JmsConsumerInfo: {
> ID:986cd915-d072-4711-bd71-50a74aab4be8:1:1:1, destination =
> broadcast.XX.PRISMA_LIVERiskLimitTrigger }
> ^[[0m^[[33m2021-03-25 16:52:00,410 WARN
> [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel
> (camel-1) thread #3 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger])
> Setup of JMS message listener invoker failed for destination
> 'broadcast.XX.PRISMA_LIVERiskLimitTrigger' - trying to recover. Cause: Remote
> did not respond to a drain request in time
> ^[[0m^[[33m2021-03-25 16:52:01,414 WARN
> [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel
> (camel-1) thread #6 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger])
> Setup of JMS message listener invoker failed for destination
> 'broadcast.XX.PRISMA_LIVERiskLimitTrigger'
> {noformat}
> After the problem was triggered broker started to reject client connections:
> {noformat}
> 2021-03-25 17:55:43,716 INFO [Broker-Config] (q.m.c.open) -
> [con:901(/127.0.0.1:56012)] CON-1001 : Open : Destination :
> amqp(127.0.0.1:22101) : Protocol Version : 1.0
> 2021-03-25 17:55:43,745 INFO [IO-/127.0.0.1:56012] (q.m.c.open) -
> [con:901(ARP@/127.0.0.1:56012/default)] CON-1001 : Open : Destination :
> amqp(172.16.75.197:22101) : Protocol Version : 1.0 : Client ID :
> ID:af50d087-0610-406a-9792-ffa5bf6b68b4:5 : Client Version : 0.51.0 : Client
> Product : QpidJMS
> 2021-03-25 17:55:43,747 INFO [IO-/127.0.0.1:56012] (q.m.a.denied) -
> [con:901(ARP@/127.0.0.1:56012/default)] ACL-1002 : Denied : Access
> Virtualhost NAME=default,VIRTUALHOST_NAME=default
> 2021-03-25 17:55:43,755 INFO [Broker-Config] (q.m.c.dropped_connection) -
> [con:901(ARP@/127.0.0.1:56012/default)] CON-1004 : Connection dropped
> {noformat}
> Issue seems to be quite rare: same test ran in a loop over several days and
> this issue was not triggered.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]