[ 
https://issues.apache.org/jira/browse/SOLR-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated SOLR-18244:
--------------------------------------
    Description: 
Fix several concurrency bugs in HttpShardHandler / ParallelHttpShardHandler 
that could leave search threads parked in HttpShardHandler.take() indefinitely 
or crash the coordinator under load.

(1) ParallelHttpShardHandler.cancelAll and makeShardRequest now synchronize 
their submitFutures operations on the same monitor that HttpShardHandler uses, 
so the canceled flag, responseFutureMap, the responses queue's 
CANCELLATION_NOTIFICATION, and submitFutures all transition atomically — 
matching the base class's cancellation invariant.

(2) RejectedExecutionException from commExecutor (queue full or executor 
shutting down) is now caught and routed through recordShardSubmitError as a 
SERVICE_UNAVAILABLE shard failure instead of propagating synchronously out of 
submit() and crashing SearchHandler's distributed loop — preserving 
shards.tolerant=true semantics under thread pool saturation.

(3) The Parallel runnable now reads its own outer CompletableFuture via an 
AtomicReference holder and short-circuits when cancelled before scheduling, 
avoiding a wasted lbClient.requestAsync that super.makeShardRequest would just 
immediately cancel.


(4) The inner whenComplete in HttpShardHandler.makeShardRequest now wraps 
response processing in a try/catch — any throwable (including from a subclass
transformResponse) is recorded on the ShardResponse and queued under the 
canceled monitor, so a thrown response handler can no longer leave 
responseFutureMap populated with an entry that nothing will ever enqueue a 
response for. (5) HttpShardHandler.take() now polls the responses queue with a 
50ms timeout instead of blocking indefinitely, closing a fast-path deadlock: 
when a subclass async tracker like submitFutures drains AFTER the coordinator 
re-evaluated responsesPending()==true and entered responses.take(), no further 
response will ever arrive to wake it. The polling loop re-checks 
responsesPending() periodically and exits cleanly when the tracker drains.

  was:
Fix several concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that
could leave search threads parked in HttpShardHandler.take() indefinitely or 
crash the
coordinator under load. (1) ParallelHttpShardHandler.cancelAll and 
makeShardRequest
now synchronize their submitFutures operations on the same monitor that
HttpShardHandler uses, so the canceled flag, responseFutureMap, the responses 
queue's CANCELLATION_NOTIFICATION, and submitFutures all transition atomically 
— matching the base class's cancellation invariant. (2) 
RejectedExecutionException from
commExecutor (queue full or executor shutting down) is now caught and routed 
through recordShardSubmitError as a SERVICE_UNAVAILABLE shard failure instead 
of propagating synchronously out of submit() and crashing SearchHandler's 
distributed loop — preserving shards.tolerant=true semantics under thread pool 
saturation. (3) The
Parallel runnable now reads its own outer CompletableFuture via an 
AtomicReference
holder and short-circuits when cancelled before scheduling, avoiding a wasted
lbClient.requestAsync that super.makeShardRequest would just immediately cancel.
(4) The inner whenComplete in HttpShardHandler.makeShardRequest now wraps 
response processing in a try/catch — any throwable (including from a subclass
transformResponse) is recorded on the ShardResponse and queued under the 
canceled
monitor, so a thrown response handler can no longer leave responseFutureMap 
populated with an entry that nothing will ever enqueue a response for. (5) 
HttpShardHandler.take() now polls the responses queue with a 50ms timeout 
instead of blocking indefinitely, closing a fast-path deadlock: when a subclass 
async tracker like submitFutures drains AFTER the coordinator re-evaluated 
responsesPending()==true and entered responses.take(), no further response will 
ever arrive to wake it. The polling loop re-checks responsesPending() 
periodically and exits cleanly when the tracker drains.


> Fix concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that can 
> hang searches or crash coordinator
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-18244
>                 URL: https://issues.apache.org/jira/browse/SOLR-18244
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Robert Miller
>            Priority: Major
>
> Fix several concurrency bugs in HttpShardHandler / ParallelHttpShardHandler 
> that could leave search threads parked in HttpShardHandler.take() 
> indefinitely or crash the coordinator under load.
> (1) ParallelHttpShardHandler.cancelAll and makeShardRequest now synchronize 
> their submitFutures operations on the same monitor that HttpShardHandler 
> uses, so the canceled flag, responseFutureMap, the responses queue's 
> CANCELLATION_NOTIFICATION, and submitFutures all transition atomically — 
> matching the base class's cancellation invariant.
> (2) RejectedExecutionException from commExecutor (queue full or executor 
> shutting down) is now caught and routed through recordShardSubmitError as a 
> SERVICE_UNAVAILABLE shard failure instead of propagating synchronously out of 
> submit() and crashing SearchHandler's distributed loop — preserving 
> shards.tolerant=true semantics under thread pool saturation.
> (3) The Parallel runnable now reads its own outer CompletableFuture via an 
> AtomicReference holder and short-circuits when cancelled before scheduling, 
> avoiding a wasted lbClient.requestAsync that super.makeShardRequest would 
> just immediately cancel.
> (4) The inner whenComplete in HttpShardHandler.makeShardRequest now wraps 
> response processing in a try/catch — any throwable (including from a subclass
> transformResponse) is recorded on the ShardResponse and queued under the 
> canceled monitor, so a thrown response handler can no longer leave 
> responseFutureMap populated with an entry that nothing will ever enqueue a 
> response for. (5) HttpShardHandler.take() now polls the responses queue with 
> a 50ms timeout instead of blocking indefinitely, closing a fast-path 
> deadlock: when a subclass async tracker like submitFutures drains AFTER the 
> coordinator re-evaluated responsesPending()==true and entered 
> responses.take(), no further response will ever arrive to wake it. The 
> polling loop re-checks responsesPending() periodically and exits cleanly when 
> the tracker drains.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to