Pierre Salagnac created SOLR-17754:
--------------------------------------
Summary: Race condition in overseer main loop leaves
collection/shard/replica locked
Key: SOLR-17754
URL: https://issues.apache.org/jira/browse/SOLR-17754
Project: Solr
Issue Type: Task
Affects Versions: 9.8
Reporter: Pierre Salagnac
We hit a rare race condition in overseer main loop, but it led to a full stuck
overseer. All operations submitted to the collection API were then never
dequeued and never processed until a restart of the overseer.
For the race condition to reproduce, we need at least 100 concurrent tasks
running in the collection API. This bug is reproducible only when the overseer
is under high load.
*Note on overseer threading*
The overseer main loop in class {{OverseerTaskProcessor}} processes tasks that
were previously enqueued to the overseer distributed queue (stored in
Zookeeper). This class has a producer/consumer pattern:
- a single thread dequeues tasks from the distributed queue (named later the
_main loop_ thread)
- before submitting the task, the _main loop_ thread tries to lock on the
collection/shard/replica (depending on the task type). If the lock was already
taken because another task is already running on same resource, the task is not
submitted and will be reconsidered during a later iteration.
- then a pool with up to 100 threads executes the tasks (named later the
_runner_ threads).
- at the end of the task, the _runner_ thread unlocks the
collection/shard/replica.
Since the pool uses a {{{}SynchronousQueue{}}}, it is not allowed to submit
more than 100 concurrent tasks to the pool, otherwise the task is rejected by
the pool and an exception is raised (see the exception below). To avoid this
exception, the _main loop_ threads makes sure it does not submit more than 100
concurrent tasks by maintaining the IDs of the current running tasks in an
in-memory collection. Each time a task is being submitted, the task ID is added
into collection {{{}runningTasks{}}}. Then, once the _runner_ thread completed
the task (same in case of a failure), it removes the task ID from this
collection. Before submitting a task to the _runner_ pool, the _main loop_
thread checks the size of collection runningTasks. In case there are already
100 items in the collection, the task is not submitted but instead added into
another collection ({{{}blockedTasks{}}}) so it will be submitted to the pool
in a later iteration, once the size of runningTask will be below 100 (full
mechanism not detailed here).
*Root cause race condition*
The bug is that the list of IDs in collection {{runningTasks}} is not fully
aligned with list of _runner_ threads currently executing a task. The instant
right after the _runner_ thread removed the ID, the thread is still running and
hasn't returned back to the pool. But since the task ID is not in the
collection anymore, the _main loop_ thread thinks we have a room available to
submit a new task to the pool. There is a time window where the _main loop_
thread may submit a 101th task the pool. Bellow exception is raised:
{code:java}
java.util.concurrent.RejectedExecutionException: Task
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda/0x00000070013f2a30@53c6a1bb
rejected from
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1f9c1653[Running,
pool size = 100, active threads = 100, queued tasks = 0, completed tasks = 100]
at
java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2081)
~[?:?]
at
java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
~[?:?]
at
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376)
~[?:?]
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:361)
~[main/:?]
at
org.apache.solr.cloud.OverseerTaskProcessor.run(OverseerTaskProcessor.java:376)
~[main/:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
{code}
*Impact*
Once we hit the above exception, the lock on the collection/shard/replica is
not released (it was acquired before submitting a task to the _runner_ thread
pool). This means that all the future tasks that require to lock on the same
collection/shard/replica won't be processed and will remaining the Zookeeper
queue forever, until the overseer is restarted.
Another impact is the overseer does not read more than 1000 tasks from the
queue. Consequently, if more than 1000 are blocked by the unreleased lock,
later tasks will not be read (and not processed) even if they don't require the
same lock. After 1000 tasks waiting for the lock, the overseer is fully
blocked. It may take a long time between the root race condition and reaching a
fully blocked overseer (days/weeks depending on the number of submitted API
calls).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]