w1049 opened a new issue, #18157: URL: https://github.com/apache/tvm/issues/18157
### Expected behavior The `PopenPoolExecutor` should shut down gracefully without causing deadlocks, especially when `__del__` is invoked during garbage collection. ### Actual behavior Calling `_threadpool.shutdown()` within `PopenPoolExecutor.__del__()` can lead to a deadlock. The `shutdown()` function attemps calls `join()` on all threads, which in turn attempts to acquire `_shutdown_locks_lock`. Due to the unpredictable timing of garbage collection, the `__del__()` function might try to acquire `_shutdown_locks_lock` while the current thread already holds it (e.g., if that thread is creating new threads), leading to a deadlock. This issue arises because the builder (and other components using a pool) relies on Python's `del` statement for pool cleanup, rather than explicitly calling a `shutdown()` method. According to Python's [documentation](https://docs.python.org/3/reference/datamodel.html#object.__del__): > Note `del x` doesn’t directly call `x.__del__()` — the former decrements the reference count for `x` by one, and the latter is only called when `x`’s reference count reaches zero. > `__del__()` can be invoked when arbitrary code is being executed, including from any arbitrary thread. If `__del__()` needs to take a lock or invoke any other blocking resource, it may deadlock as the resource may already be taken by the code that gets interrupted to execute `__del__()`. ### Environment - Operating System: 22.04.1-Ubuntu - Python v3.12.9 - TVM version: commit 5e12a5cc97793a4bdbbe8cc213e82b80e704a3c9 ### Steps to reproduce Run a builder in loop: ```python import tvm from tvm.meta_schedule.builder import LocalBuilder, BuilderInput from tvm.script import relax as R import datetime # A simple Relax function @R.function def relax_matmul( data: R.Tensor(("n", 784), dtype="float32"), w0: R.Tensor((784, 128), dtype="float32"), ) -> R.Tensor(("n", 128), dtype="float32"): with R.dataflow(): lv0 = R.matmul(data, w0) R.output(lv0) return lv0 mod = tvm.IRModule.from_expr(relax_matmul) target = tvm.target.Target("nvidia/geforce-rtx-4090") builder = LocalBuilder() while True: builder_inputs = [BuilderInput(mod, target) for _ in range(builder.max_workers)] builder_results = builder.build(builder_inputs) timestamp = datetime.datetime.now().strftime("%H:%M:%S") print(f"[{timestamp}] Build Successful") ``` Adding logging statements before and after `del` in local_builder.py, and within `__del__`, demonstrates the behavior: (the first number on each line is thread.ident) ``` 139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c535130>... 139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c535130>... 139623815455936: After del. [12:36:51] Build Successful 139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>... 139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>... 139623815455936: After del. [12:37:06] Build Successful 139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b99c170>... 139613146445568: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c535130>'s thread pool... 139613146445568: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>'s thread pool... 139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b99c170>... 139623815455936: After del. [12:37:48] Build Successful 139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c037ce0>... 139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c037ce0>... 139623815455936: After del. [12:38:03] Build Successful 139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>... 139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>... 139623815455936: After del. [12:38:18] Build Successful 139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9821b0>... 139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9821b0>... 139623815455936: After del. [12:38:33] Build Successful 139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9762a0>... 139611577763584: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c037ce0>'s thread pool... 139611577763584: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>'s thread pool... 139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9762a0>... 139623815455936: After del. ``` The deadlock may occur randomly after a long time (it took me several hours to identify the cause). To immediately trigger the deadlock, modify `threading.py` by adding `gc.collect()` within the `_maintain_shutdown_locks()` function: ```python def _maintain_shutdown_locks(): """ Drop any shutdown locks that don't correspond to running threads anymore. Calling this from time to time avoids an ever-growing _shutdown_locks set when Thread objects are not joined explicitly. See bpo-37788. This must be called with _shutdown_locks_lock acquired. """ # If a lock was released, the corresponding thread has exited import gc gc.collect() to_remove = [lock for lock in _shutdown_locks if not lock.locked()] _shutdown_locks.difference_update(to_remove) ``` The call stack at the time of the deadlock (captured via `py-spy dump`) is as follows: ``` Thread 744194 (idle): "MainThread" wait (threading.py:355) wait (threading.py:655) start (threading.py:1001) <- Start a new thread (Dummy-1) _adjust_thread_count (concurrent/futures/thread.py:203) submit (concurrent/futures/thread.py:180) map (concurrent/futures/_base.py:608) map_with_error_catching (tvm/contrib/popen_pool.py:452) build (tvm/meta_schedule/builder/local_builder.py:164) method (tvm/meta_schedule/utils.py:76) build (tvm/meta_schedule/builder/builder.py:129) <module> (pool_bug.py:25) Thread 746704 (idle): "Dummy-1" _stop (threading.py:1106) <- try to acquire the lock again _wait_for_tstate_lock (threading.py:1173) join (threading.py:1151) shutdown (concurrent/futures/thread.py:239) __del__ (tvm/contrib/popen_pool.py:381) <- GC interrupts _maintain_shutdown_locks (threading.py:868) _set_tstate_lock (threading.py:1057) <- acquire the lock _bootstrap_inner (threading.py:1063) _bootstrap (threading.py:1034) <- Starting... ``` ### Triage * needs-triage -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
