tqchen commented on PR #11434: URL: https://github.com/apache/tvm/pull/11434#issuecomment-1136477607
Coming back with some diggings. My initial reaction after seeing the comments was yes it could indeed be the case where python cannot interrupt c++. Because it is indeed true that python signal handler function in c++ won't be caught until getting back to python interpreter. Here is the fun part, I end up trying to type up a solution via c++ multi-threading, then after writing test cases to confirm the behavior I realized that actually the original implementation might work fine. Here is why: python's multi-threading is backed by real system threads and they are guarded by GIL. So although one thread enters the FFI as a long running function, another thread(the watcher) can continue to run in python interpreter (as the GIL has been released by the long running function) without a problem, as a result the timeout signal back to the parent process will continue to function, then the parent process signals kill to the popen worker. The two test cases I created for the alternative solution end up shows the original impl is working as intended. https://github.com/apache/tvm/compare/main...tqchen:popen2 See the following busy counting function(in C++) as a proxy for long running c++ functions ```c++ TVM_REGISTER_GLOBAL("testing.busy_counting").set_body_typed([](double value, int repeat, int number) { double sum = 0.0; for (int i = 0; i < number; ++i) { for (int j = 0; j < repeat; ++j) { sum += value; } LOG(INFO) << "Finished counting for " << (i + 1) << " iter"; } }); ``` Running the following debug script yields: ```python from tvm.contrib.popen_pool import PopenWorker import tvm.testing._ffi_api proc = PopenWorker() proc.send(tvm.testing.busy_counting, [0.1, 1<<25, 10], timeout=100) proc.recv() proc.send(tvm.testing.busy_counting, [0.1, 1<<25, 100], timeout=0.2) proc.recv() ``` ``` [18:06:34] ffi_testing.cc:180: Finished counting for 1 iter [18:06:34] ffi_testing.cc:180: Finished counting for 2 iter [18:06:34] ffi_testing.cc:180: Finished counting for 3 iter [18:06:34] ffi_testing.cc:180: Finished counting for 4 iter [18:06:35] ffi_testing.cc:180: Finished counting for 5 iter [18:06:35] ffi_testing.cc:180: Finished counting for 6 iter [18:06:35] ffi_testing.cc:180: Finished counting for 7 iter [18:06:35] ffi_testing.cc:180: Finished counting for 8 iter [18:06:35] ffi_testing.cc:180: Finished counting for 9 iter [18:06:35] ffi_testing.cc:180: Finished counting for 10 iter ---------see here is the second iteration where timeout get triggered in c++ run---- [18:06:35] ffi_testing.cc:180: Finished counting for 1 iter [18:06:35] ffi_testing.cc:180: Finished counting for 2 iter Traceback (most recent call last): File "debug_popen.py", line 12, in <module> proc.recv() File "/home/tqchen/github/tvm/python/tvm/contrib/popen_pool.py", line 297, in recv raise TimeoutError() TimeoutError ``` So perhaps there is over-speculation in this case. @tkonolige if you encountered a repro case, it would be good to dig deeper and see if there are other reasons behind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
