Mark Robert Miller created SOLR-18244:
-----------------------------------------

             Summary: Fix concurrency bugs in HttpShardHandler / 
ParallelHttpShardHandler that can hang searches or crash coordinator
                 Key: SOLR-18244
                 URL: https://issues.apache.org/jira/browse/SOLR-18244
             Project: Solr
          Issue Type: Bug
            Reporter: Mark Robert Miller


Fix several concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that
could leave search threads parked in HttpShardHandler.take() indefinitely or 
crash the
coordinator under load. (1) ParallelHttpShardHandler.cancelAll and 
makeShardRequest
now synchronize their submitFutures operations on the same monitor that
HttpShardHandler uses, so the canceled flag, responseFutureMap, the responses 
queue's CANCELLATION_NOTIFICATION, and submitFutures all transition atomically 
— matching the base class's cancellation invariant. (2) 
RejectedExecutionException from
commExecutor (queue full or executor shutting down) is now caught and routed 
through recordShardSubmitError as a SERVICE_UNAVAILABLE shard failure instead 
of propagating synchronously out of submit() and crashing SearchHandler's 
distributed loop — preserving shards.tolerant=true semantics under thread pool 
saturation. (3) The
Parallel runnable now reads its own outer CompletableFuture via an 
AtomicReference
holder and short-circuits when cancelled before scheduling, avoiding a wasted
lbClient.requestAsync that super.makeShardRequest would just immediately cancel.
(4) The inner whenComplete in HttpShardHandler.makeShardRequest now wraps 
response processing in a try/catch — any throwable (including from a subclass
transformResponse) is recorded on the ShardResponse and queued under the 
canceled
monitor, so a thrown response handler can no longer leave responseFutureMap 
populated with an entry that nothing will ever enqueue a response for. (5) 
HttpShardHandler.take() now polls the responses queue with a 50ms timeout 
instead of blocking indefinitely, closing a fast-path deadlock: when a subclass 
async tracker like submitFutures drains AFTER the coordinator re-evaluated 
responsesPending()==true and entered responses.take(), no further response will 
ever arrive to wake it. The polling loop re-checks responsesPending() 
periodically and exits cleanly when the tracker drains.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to