janhoy commented on PR #4236: URL: https://github.com/apache/solr/pull/4236#issuecomment-4242851188
New development: I installed an instrumented version of Solr in client test environment, where the deadlock had occurred last time after about two weeks. The instrumented version would print additional log lines for semaphore statistics, thread stats and try to detect leaks by monitoring which threads did not release their permit. It would also print error logs for the two suspected code paths which are patched in this PR: - Jetty's double registration of `onRequestQueued` - CompleatableFuture retry path Here is a sample of some log prints ``` 14.4.2026 08:59:38 WARN Http2SolrClient$AsyncTracker event=async_tracker_stats permits=1000 permits_max=1000 permits_used=0 inflight=0 net_outstanding=0 acquires_total=2811 releases_total=2811 threads_running=19 threads_waiting=88 threads_timed_waiting=20 threads_blocked=0 threads_total=127 14.4.2026 09:00:38 WARN Http2SolrClient$AsyncTracker event=async_tracker_stats permits=1000 permits_max=1000 permits_used=0 inflight=0 net_outstanding=0 acquires_total=2811 releases_total=2811 threads_running=19 threads_waiting=88 threads_timed_waiting=20 threads_blocked=0 threads_total=127 14.4.2026 09:01:12 ERROR Http2SolrClient$AsyncTracker event=double_registration_prevented method=POST url="http://my-host:8983/solr/my-collection_shard1_replica_n6/select" permits_available=998 permits_max=1000 msg="Jetty fired queuedListener twice for same Request — permit leak prevented by idempotency guard" 14.4.2026 09:01:12 ERROR Http2SolrClient$AsyncTracker event=double_registration_prevented method=POST url="http://my-host:8983/solr/my-collection_shard1_replica_n6/select" permits_available=998 permits_max=1000 msg="Jetty fired queuedListener twice for same Request — permit leak prevented by idempotency guard" 14.4.2026 09:01:38 WARN Http2SolrClient$AsyncTracker event=async_tracker_stats permits=1000 permits_max=1000 permits_used=0 inflight=0 net_outstanding=0 acquires_total=2815 releases_total=2815 threads_running=19 threads_waiting=87 threads_timed_waiting=21 threads_blocked=0 threads_total=127 14.4.2026 09:02:38 WARN Http2SolrClient$AsyncTracker event=async_tracker_stats permits=1000 permits_max=1000 permits_used=0 inflight=0 net_outstanding=0 acquires_total=2815 releases_total=2815 threads_running=19 threads_waiting=87 threads_timed_waiting=21 threads_blocked=0 threads_total=127 ``` This cluster has been running with some test traffic, in a real k8s env with linkerd mesh, for some 3 days. And during that time the `double_registration_prevented` event occurred 223 times. This rhymes well with full depletion of the 1000 permits during two weeks that we experienced last time (223/3*14=9217). <img width="1653" height="136" alt="Skjermbilde 2026-04-14 kl 11 26 56" src="https://github.com/user-attachments/assets/b0fa6509-df58-457d-a4c2-c0112d1a0e0f" /> So that is a strong indication that the Jetty doble firing was the main root cause in our case. I'll clean up this PR branch and make it ready for merge and back-port. This PR included - Fix the the Jetty double-fire issue by adding PERMIT_ACQUIRED_ATTR Idempotency guard - Dispatch error-retry in CompletableFuture to an executor - Add a new metric gauge to keep an eye on async permits: `solr.http.client.async_permits` - Make the semaphore size configurable through sysprop `solr.http.client.async_requests.max` I plan to merge during this week unless concerns are voiced -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
