w1049 opened a new issue, #18157:
URL: https://github.com/apache/tvm/issues/18157

   ### Expected behavior
   
   The `PopenPoolExecutor` should shut down gracefully without causing 
deadlocks, especially when `__del__` is invoked during garbage collection.
   
   ### Actual behavior
   
   Calling `_threadpool.shutdown()` within `PopenPoolExecutor.__del__()` can 
lead to a deadlock. The `shutdown()` function attemps calls `join()` on all 
threads, which in turn attempts to acquire `_shutdown_locks_lock`. Due to the 
unpredictable timing of garbage collection, the `__del__()` function might try 
to acquire `_shutdown_locks_lock` while the current thread already holds it 
(e.g., if that thread is creating new threads), leading to a deadlock.
   
   This issue arises because the builder (and other components using a pool) 
relies on Python's `del` statement for pool cleanup, rather than explicitly 
calling a `shutdown()` method.
   According to Python's 
[documentation](https://docs.python.org/3/reference/datamodel.html#object.__del__):
   > Note `del x` doesn’t directly call `x.__del__()` — the former decrements 
the reference count for `x` by one, and the latter is only called when `x`’s 
reference count reaches zero.
   
   > `__del__()` can be invoked when arbitrary code is being executed, 
including from any arbitrary thread. If `__del__()` needs to take a lock or 
invoke any other blocking resource, it may deadlock as the resource may already 
be taken by the code that gets interrupted to execute `__del__()`.
   
   ### Environment
   - Operating System: 22.04.1-Ubuntu
   - Python v3.12.9
   - TVM version: commit 5e12a5cc97793a4bdbbe8cc213e82b80e704a3c9 
   
   ### Steps to reproduce
   Run a builder in loop:
   ```python
   import tvm
   from tvm.meta_schedule.builder import LocalBuilder, BuilderInput
   from tvm.script import relax as R
   import datetime
   
   
   # A simple Relax function
   @R.function
   def relax_matmul(
       data: R.Tensor(("n", 784), dtype="float32"),
       w0: R.Tensor((784, 128), dtype="float32"),
   ) -> R.Tensor(("n", 128), dtype="float32"):
       with R.dataflow():
           lv0 = R.matmul(data, w0)
           R.output(lv0)
       return lv0
   
   
   mod = tvm.IRModule.from_expr(relax_matmul)
   target = tvm.target.Target("nvidia/geforce-rtx-4090")
   builder = LocalBuilder()
   
   while True:
       builder_inputs = [BuilderInput(mod, target) for _ in 
range(builder.max_workers)]
       builder_results = builder.build(builder_inputs)
   
       timestamp = datetime.datetime.now().strftime("%H:%M:%S")
       print(f"[{timestamp}] Build Successful")
   ```
   
   Adding logging statements before and after `del` in local_builder.py, and 
within `__del__`, demonstrates the behavior: (the first number on each line is 
thread.ident)
   ```
   139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1c535130>...
   139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object 
at 0x7efb1c535130>...
   139623815455936: After del.
   [12:36:51] Build Successful
   139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1bb82150>...
   139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object 
at 0x7efb1bb82150>...
   139623815455936: After del.
   [12:37:06] Build Successful
   139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1b99c170>...
   139613146445568: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1c535130>'s thread pool...
   139613146445568: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1bb82150>'s thread pool...
   139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object 
at 0x7efb1b99c170>...
   139623815455936: After del.
   [12:37:48] Build Successful
   139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1c037ce0>...
   139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object 
at 0x7efb1c037ce0>...
   139623815455936: After del.
   [12:38:03] Build Successful
   139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1bb82150>...
   139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object 
at 0x7efb1bb82150>...
   139623815455936: After del.
   [12:38:18] Build Successful
   139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1b9821b0>...
   139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object 
at 0x7efb1b9821b0>...
   139623815455936: After del.
   [12:38:33] Build Successful
   139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1b9762a0>...
   139611577763584: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1c037ce0>'s thread pool...
   139611577763584: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor 
object at 0x7efb1bb82150>'s thread pool...
   139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object 
at 0x7efb1b9762a0>...
   139623815455936: After del.
   ```
   
   The deadlock may occur randomly after a long time (it took me several hours 
to identify the cause).
   To immediately trigger the deadlock, modify `threading.py` by adding 
`gc.collect()` within the `_maintain_shutdown_locks()` function:
   ```python
   def _maintain_shutdown_locks():
       """
       Drop any shutdown locks that don't correspond to running threads anymore.
   
       Calling this from time to time avoids an ever-growing _shutdown_locks
       set when Thread objects are not joined explicitly. See bpo-37788.
   
       This must be called with _shutdown_locks_lock acquired.
       """
       # If a lock was released, the corresponding thread has exited
       import gc
       gc.collect()
       to_remove = [lock for lock in _shutdown_locks if not lock.locked()]
       _shutdown_locks.difference_update(to_remove)
   ```
   
   The call stack at the time of the deadlock (captured via `py-spy dump`) is 
as follows:
   ```
   Thread 744194 (idle): "MainThread"
       wait (threading.py:355)
       wait (threading.py:655)
       start (threading.py:1001) <- Start a new thread (Dummy-1)
       _adjust_thread_count (concurrent/futures/thread.py:203)
       submit (concurrent/futures/thread.py:180)
       map (concurrent/futures/_base.py:608)
       map_with_error_catching (tvm/contrib/popen_pool.py:452)
       build (tvm/meta_schedule/builder/local_builder.py:164)
       method (tvm/meta_schedule/utils.py:76)
       build (tvm/meta_schedule/builder/builder.py:129)
       <module> (pool_bug.py:25)
   Thread 746704 (idle): "Dummy-1"
       _stop (threading.py:1106) <- try to acquire the lock again
       _wait_for_tstate_lock (threading.py:1173)
       join (threading.py:1151)
       shutdown (concurrent/futures/thread.py:239)
       __del__ (tvm/contrib/popen_pool.py:381) <- GC interrupts
       _maintain_shutdown_locks (threading.py:868)
       _set_tstate_lock (threading.py:1057) <- acquire the lock
       _bootstrap_inner (threading.py:1063)
       _bootstrap (threading.py:1034) <- Starting...
   ```
   
   ### Triage
   
   * needs-triage
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to