janhoy commented on PR #4236:
URL: https://github.com/apache/solr/pull/4236#issuecomment-4242851188

   New development: I installed an instrumented version of Solr in client test 
environment, where the deadlock had occurred last time after about two weeks. 
The instrumented version would print additional log lines for semaphore 
statistics, thread stats and try to detect leaks by monitoring which threads 
did not release their permit. It would also print error logs for the two 
suspected code paths which are patched in this PR:
   - Jetty's double registration of `onRequestQueued`
   - CompleatableFuture retry path
   
   Here is a sample of some log prints
   
   ```
   14.4.2026 08:59:38 WARN  Http2SolrClient$AsyncTracker 
event=async_tracker_stats permits=1000 permits_max=1000 permits_used=0 
inflight=0 net_outstanding=0 acquires_total=2811 releases_total=2811 
threads_running=19 threads_waiting=88 threads_timed_waiting=20 
threads_blocked=0 threads_total=127
   14.4.2026 09:00:38 WARN  Http2SolrClient$AsyncTracker 
event=async_tracker_stats permits=1000 permits_max=1000 permits_used=0 
inflight=0 net_outstanding=0 acquires_total=2811 releases_total=2811 
threads_running=19 threads_waiting=88 threads_timed_waiting=20 
threads_blocked=0 threads_total=127
   14.4.2026 09:01:12 ERROR Http2SolrClient$AsyncTracker 
event=double_registration_prevented method=POST 
url="http://my-host:8983/solr/my-collection_shard1_replica_n6/select"; 
permits_available=998 permits_max=1000 msg="Jetty fired queuedListener twice 
for same Request — permit leak prevented by idempotency guard"
   14.4.2026 09:01:12 ERROR Http2SolrClient$AsyncTracker 
event=double_registration_prevented method=POST 
url="http://my-host:8983/solr/my-collection_shard1_replica_n6/select"; 
permits_available=998 permits_max=1000 msg="Jetty fired queuedListener twice 
for same Request — permit leak prevented by idempotency guard"
   14.4.2026 09:01:38 WARN  Http2SolrClient$AsyncTracker 
event=async_tracker_stats permits=1000 permits_max=1000 permits_used=0 
inflight=0 net_outstanding=0 acquires_total=2815 releases_total=2815 
threads_running=19 threads_waiting=87 threads_timed_waiting=21 
threads_blocked=0 threads_total=127
   14.4.2026 09:02:38 WARN  Http2SolrClient$AsyncTracker 
event=async_tracker_stats permits=1000 permits_max=1000 permits_used=0 
inflight=0 net_outstanding=0 acquires_total=2815 releases_total=2815 
threads_running=19 threads_waiting=87 threads_timed_waiting=21 
threads_blocked=0 threads_total=127
   ```
   
   This cluster has been running with some test traffic, in a real k8s env with 
linkerd mesh, for some 3 days. And during that time the 
`double_registration_prevented` event occurred 223 times. This rhymes well with 
full depletion of the 1000 permits during two weeks that we experienced last 
time (223/3*14=9217). 
   <img width="1653" height="136" alt="Skjermbilde 2026-04-14 kl  11 26 56" 
src="https://github.com/user-attachments/assets/b0fa6509-df58-457d-a4c2-c0112d1a0e0f";
 />
   
   So that is a strong indication that the Jetty doble firing was the main root 
cause in our case.
   
   I'll clean up this PR branch and make it ready for merge and back-port. This 
PR included
   
   - Fix the the Jetty double-fire issue by adding PERMIT_ACQUIRED_ATTR 
Idempotency guard
   - Dispatch error-retry in CompletableFuture to an executor
   - Add a new metric gauge to keep an eye on async permits: 
`solr.http.client.async_permits`
   - Make the semaphore size configurable through sysprop 
`solr.http.client.async_requests.max`
   
   I plan to merge during this week unless concerns are voiced


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to