tqchen commented on PR #11434:
URL: https://github.com/apache/tvm/pull/11434#issuecomment-1136477607

   Coming back with some diggings.
   
   My initial reaction after seeing the comments was yes it could indeed be the 
case where python cannot interrupt c++. Because it is indeed true that python 
signal handler function in c++ won't be caught until getting back to python 
interpreter.
   
   Here is the fun part, I end up trying to type up a solution via c++ 
multi-threading, then after writing test cases to confirm the behavior I 
realized that actually the original implementation might work fine.
   
   Here is why: python's multi-threading is backed by real system threads and 
they are guarded by GIL. So although one thread enters the FFI as a long 
running function, another thread(the watcher) can continue to run in python 
interpreter (as the GIL has been released by the long running function) without 
a problem, as a result the timeout signal back to the parent process will 
continue to function, then the parent process signals kill to the popen worker.
   
   The two test cases I created for the alternative solution end up shows the 
original impl is working as intended.
   
   https://github.com/apache/tvm/compare/main...tqchen:popen2
   
   See the following busy counting function(in C++) as a proxy for long running 
c++ functions
   ```c++
   TVM_REGISTER_GLOBAL("testing.busy_counting").set_body_typed([](double value, 
int repeat, int number) {
     double sum = 0.0;
     for (int i = 0; i < number; ++i) {
       for (int j = 0; j < repeat; ++j) {
         sum += value;
       }
       LOG(INFO) << "Finished counting for " <<  (i + 1)  << " iter";
     }
   });
   ```
   
   Running the following debug script yields:
   ```python
   from tvm.contrib.popen_pool import PopenWorker
   import tvm.testing._ffi_api
   
   proc = PopenWorker()
   
   proc.send(tvm.testing.busy_counting, [0.1, 1<<25, 10], timeout=100)
   proc.recv()
   
   
   proc.send(tvm.testing.busy_counting, [0.1, 1<<25, 100], timeout=0.2)
   proc.recv()
   ```
   
   ```
   [18:06:34] ffi_testing.cc:180: Finished counting for 1 iter
   [18:06:34] ffi_testing.cc:180: Finished counting for 2 iter
   [18:06:34] ffi_testing.cc:180: Finished counting for 3 iter
   [18:06:34] ffi_testing.cc:180: Finished counting for 4 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 5 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 6 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 7 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 8 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 9 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 10 iter
   ---------see here is the second iteration where timeout get triggered in c++ 
run----
   [18:06:35] ffi_testing.cc:180: Finished counting for 1 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 2 iter
   Traceback (most recent call last):
     File "debug_popen.py", line 12, in <module>
       proc.recv()
     File "/home/tqchen/github/tvm/python/tvm/contrib/popen_pool.py", line 297, 
in recv
       raise TimeoutError()
   TimeoutError
   
   ```
   
   
   So perhaps there is over-speculation in this case. @tkonolige if you 
encountered a repro case, it would be good to dig deeper and see if there are 
other reasons behind.
   
   
    
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to