[ 
https://issues.apache.org/jira/browse/QPID-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922322#comment-17922322
 ] 

Ram Mantripragada commented on QPID-8681:
-----------------------------------------

[~rgodfrey] As I mentioned earlier, [~sudheersana]  has invested his 
significant time and effort in identifying the root cause and developing a 
solution for the performance issue we encountered, which is crucial to our 
specific use case. My intention in providing detailed information in the Jira 
ticket, along with the proposed solution was to help the maintainers like you 
quickly understand the problem and solution so we could get buy-in to move 
forward with Sudheer's PR which he raised after few hours of the ticket. It 
seems that this may have led to some confusion, because of which Sudheer is not 
able to move ahead with his PR impacting his righteous credit. My honest 
intention here is to ensure that Sudheer receives the appropriate credit for 
his work and should not suffer because of this confusion. I kindly request you 
to revert the recent direct commit you made to the main branch (based on the 
screenshot), as this will only take a few minutes of your valuable time. 
Sudheer will soon submit a PR including the details of testing he has performed 
and validation that the issue is not happening anymore after this change for 
our scenario. These details in the PR would also be very useful for someone who 
try to understand the importance of the change later when comes across the PR.  
This approach will let Sudheer to get his righteous credit for the contribution 
and the effort he put into resolving this issue, which will be important for 
him within our organization as well. Thank you again for your ongoing 
contributions to the community, and for your understanding of this request. I 
hope you’ll consider this in the right spirit of mutual collaboration, and I 
appreciate your help in ensuring Sudheer’s hard work is recognized.

> Addressing lock contention in Sorted Queues under high load by optimizing 
> property fetching
> -------------------------------------------------------------------------------------------
>
>                 Key: QPID-8681
>                 URL: https://issues.apache.org/jira/browse/QPID-8681
>             Project: Qpid
>          Issue Type: Improvement
>          Components: Broker-J
>    Affects Versions: qpid-java-broker-7.0.9
>            Reporter: Ram Mantripragada
>            Assignee: Robert Godfrey
>            Priority: Critical
>              Labels: contention, performance
>             Fix For: qpid-java-broker-9.2.1
>
>         Attachments: image-2025-01-27-22-41-16-668.png, 
> image-2025-01-27-22-42-13-739.png
>
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> *Summary:*
> Apache Qpid Broker J provides Sorted Queues, which allow users to implement a 
> message re-enqueue with delay feature. This feature is crucial in scenarios 
> where messages dequeued from regular queues cannot be processed immediately 
> due to certain preconditions (e.g., available resources, concurrency limits). 
> These messages are re-enqueued to Sorted Queues, where they are sorted based 
> on their delay expiry time. Periodic jobs then check these Sorted Queues for 
> expired messages and move them back to the regular queues for processing.
> Under high load conditions (e.g., a re-enqueue rate of ~7,000 messages per 
> second), we observed that the broker experiences contention issues, causing 
> it to stop responding to REST API calls (which time out if REST timeouts are 
> set on the client side otherwise just hung around). These APIs are used to 
> periodically fetch queue statistics.
> *Analysis:*
> By analyzing Java Flight Recorder (JFR) data, we identified the root cause of 
> the contention:
>  # REST API calls to retrieve queue depths required querying some specific 
> predefined properties from the broker.
>  # However, for each requested property, the broker was inadvertently 
> fetching all 26 properties, resulting in repeated and excessive lock 
> acquisition attempts on the Sorted Queue data structure.
>  # Among the requested properties, the *oldestMessageAge* property (used for 
> delay queues) significantly contributed to the contention by increasing the 
> number of lock requests.
>  # On the application side, querying for the *oldestMessageAge* property is 
> unnecessary when dealing with delay queues, so avoiding this query will 
> further mitigate the contention issue.
> *Stack Trace:*
> The following stack trace illustrates the contention observed during the 
> issue:
> {code:java}
> at 
> org.apache.qpid.server.queue.SortedQueueEntryList.next(SortedQueueEntryList.java:292)
>     at 
> org.apache.qpid.server.queue.SortedQueueEntryList$QueueEntryIteratorImpl.atTail(SortedQueueEntryList.java:686)
>     at 
> org.apache.qpid.server.queue.SortedQueueEntryList$QueueEntryIteratorImpl.advance(SortedQueueEntryList.java:698)
>     at 
> org.apache.qpid.server.queue.SortedQueueEntryList.getOldestEntry(SortedQueueEntryList.java:350)
>     at 
> org.apache.qpid.server.queue.AbstractQueue.getOldestMessageArrivalTime(AbstractQueue.java:1518)
>     at 
> org.apache.qpid.server.queue.AbstractQueue.getOldestMessageAge(AbstractQueue.java:1546)
>     at jdk.internal.reflect.GeneratedMethodAccessor167.invoke(Unknown 
> Source:-1)    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:566)    at 
> org.apache.qpid.server.model.ConfiguredObjectMethodAttributeOrStatistic.getValue(ConfiguredObjectMethodAttributeOrStatistic.java:68)
>     at 
> org.apache.qpid.server.model.ConfiguredObjectMethodStatistic.getValue(ConfiguredObjectMethodStatistic.java:26)
>     at 
> org.apache.qpid.server.model.AbstractConfiguredObject.getStatistics(AbstractConfiguredObject.java:3181)
>     at 
> org.apache.qpid.server.queue.SortedQueueImplWithAccessChecking.getStatistics(SortedQueueImplWithAccessChecking.java:42)
>     at 
> org.apache.qpid.server.model.AbstractConfiguredObject.getStatistics(AbstractConfiguredObject.java:3168)
>     at 
> org.apache.qpid.server.management.plugin.servlet.query.ConfiguredObjectExpressionFactory$ConfiguredObjectPropertyExpression.getValue(ConfiguredObjectExpressionFactory.java:312)
>     at 
> org.apache.qpid.server.management.plugin.servlet.query.ConfiguredObjectExpressionFactory$ConfiguredObjectPropertyExpression.evaluate(ConfiguredObjectExpressionFactory.java:285)
>     at 
> org.apache.qpid.server.management.plugin.servlet.query.ConfiguredObjectExpressionFactory$ConfiguredObjectPropertyExpression.evaluate(ConfiguredObjectExpressionFactory.java:272)
>     at 
> org.apache.qpid.server.filter.ComparisonExpression.evaluate(ComparisonExpression.java:388)
>     at 
> org.apache.qpid.server.filter.ComparisonExpression.matches(ComparisonExpression.java:580)
>     at 
> org.apache.qpid.server.management.plugin.servlet.query.ConfiguredObjectQuery.filterObjects(ConfiguredObjectQuery.java:210)
>     at 
> org.apache.qpid.server.management.plugin.servlet.query.ConfiguredObjectQuery.<init>(ConfiguredObjectQuery.java:86)
>     at 
> org.apache.qpid.server.management.plugin.servlet.rest.QueryServlet.performQuery(QueryServlet.java:93)
>     at 
> org.apache.qpid.server.management.plugin.servlet.rest.QueryServlet.doGet(QueryServlet.java:56)
>     at 
> org.apache.qpid.server.management.plugin.servlet.rest.AbstractServlet.doGet(AbstractServlet.java:128)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)    at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) {code}
>  * Java Flight Recorder (JFR) data
> !image-2025-01-27-22-41-16-668.png!
> *Steps to Reproduce:*
>  # Configure a Qpid Broker with Sorted Queues.
>  # Enable REST APIs for queue statistics retrieval.
>  # Simulate high load by re-enqueuing ~7,000 messages per second to the 
> Sorted Queues.
>  # Monitor broker performance and REST API response times.
> *Expected Behavior:* The broker should handle high re-enqueue rates without 
> contention issues, and REST API calls for queue statistics should not time 
> out.
> *Actual Behavior:* Under high load conditions, the broker experiences 
> contention on the Sorted Queue data structure, leading to REST API timeouts.
> *Proposed Solution:* 
> To address this issue, we propose optimizing the property fetching mechanism 
> in the Qpid Broker in ConfiguredObjectExpressionFactory.java:
>  # Modify the broker code (ConfiguredObjectExpressionFactory) to retrieve 
> only the specifically requested properties rather than all 26 properties.  
> !image-2025-01-27-22-42-13-739.png!
>  # Avoid querying the oldestMessageAge property on the application side when 
> dealing with delay queues, as it is not required for processing.
>  # These optimizations will reduce the load on the Sorted Queue's locking 
> mechanism and prevent redundant data fetches, thereby improving the broker's 
> responsiveness under high load conditions.
> We have tested the proposed changes in our environment and observed 
> significant performance improvements, including reduced lock contention and 
> faster REST API responses under high load.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

Reply via email to