[ 
https://issues.apache.org/jira/browse/SOLR-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092332#comment-18092332
 ] 

Jason Gerlowski commented on SOLR-18244:
----------------------------------------

(Just to avoid any ambiguity later, this appears on branch_9x as 
346b68c63516257e5a61d563e814d23f00a34be8.  The commit message is missing 
SOLR-18244 so gitbot just didn't pick it up is all.)

> Fix concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that can 
> hang searches or crash coordinator
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-18244
>                 URL: https://issues.apache.org/jira/browse/SOLR-18244
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Robert Miller
>            Assignee: Mark Robert Miller
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 10.1, 9.11
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Fix several concurrency bugs in ParallelHttpShardHandler
> - Catch RejectedExecutionException from commExecutor and route through
> recordShardSubmitError so saturation/shutdown does not propagate
> synchronously and abandon already-submitted shard requests; this
> preserves shards.tolerant semantics and returns a 503 instead of 500.
> - Synchronize submitFutures registration and removal on the same monitor
> HttpShardHandler uses for responseFutureMap so cancelAll observes a
> consistent view of both maps.
> - Skip work in the runAsync runnable when the outer future has already
> been cancelled, avoiding a wasted lbClient.requestAsync call.
> - In HttpShardHandler.takeCompletedIncludingErrors, replace the blocking
> responses.take() with a 50ms poll so subclasses that gate
> responsesPending() on an async tracker (e.g. submitFutures) cannot
> cause a lost-wakeup hang when the tracker drains between the pending
> check and the queue wait.
> - Wrap the inner whenComplete body in a try/catch so a thrown
> transformResponse (or other failure inside the lambda) still enqueues
> a response and unblocks take() instead of stranding the consumer.
> - Expose cancellationLock() and isCanceled() on HttpShardHandler so
> subclasses can synchronize on the same monitor that already guards
> the canceled flag, responseFutureMap, and the responses queue.
> - Synchronize ParallelHttpShardHandler.responsesPending() on the
> cancellation monitor. Without the lock, take()'s loop performs three
> unsynchronized isEmpty() reads (responseFutureMap, responses,
> submitFutures) and can transiently observe all three as empty even
> though super.makeShardRequest's put and outer.whenComplete's
> submitFutures.remove are causally ordered. The intermediate state is
> consistent with each individual read but never with the cross-map
> invariant, so take() exits with null and the pending response (which
> arrives moments later via the inner whenComplete) is silently dropped
> by SearchHandler. Acquiring the lock serializes the three reads with
> every state-mutating operation.
> - Stop using submitFutures.isEmpty() as the loop guard in
> responsesPending(); use an exact AtomicInteger (inFlightSubmits)
> instead. ConcurrentHashMap.size()/isEmpty() are documented estimates:
> sumCount() = baseCount + sum(counterCells), and under contended
> put/remove from many threads the per-cell deltas can settle at a
> non-zero value while the table is physically empty. When that
> happens, isEmpty() returns false on a logically empty map,
> responsesPending() stays true, and the parent thread parks in
> responses.take() forever. AtomicInteger.get() is exact under any
> concurrency. The counter is incremented before runAsync, decremented
> unconditionally in the runAsync future's whenComplete finally, and
> also decremented in the RejectedExecutionException catch (no future)
> and the canceled-before-track early return (no whenComplete). The
> submitFutures map is retained only as the iteration target for
> cancelAll.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to