akarbown commented on issue #18255:
URL: 
https://github.com/apache/incubator-mxnet/issues/18255#issuecomment-726718920


   When I run the tests with the following cmdline: ```python3 -m pytest -sv 
failing_test_case``` I've got the following output:
   
   ```
   File "/usr/local/lib/python3.6/dist-packages/flaky/flaky_pINTEL MKL ERROR: 
/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_avx512.so:
 undefined symbol: mkl_sparse_optimize_bsr_trsm_i8.
   Intel MKL FATAL ERROR: Cannot load libmkl_avx512.so or libmkl_def.so.
   ```
   
   I've checked that it's enough to LD_PRELOAD libmkl_rt.so to fix the issue. 
However, when I link libmkl_rt.so (or compile mxnet with 
MKL_USE_SINGLE_DYNAMIC_LIBRARY=1) I get the same problem as it's described 
[#17641](https://github.com/apache/incubator-mxnet/issues/17641). It's because 
of multiple openmp libraries linked into MXNET. It seems to be a catch-22 
situation.
   
   Assuming, that we compile MXNET with MKL_USE_SINGLE_DYNAMIC_LIBRARY=1 we 
will get the problem with linked multiple openmp which could probably be worked 
around with the following solutions:
        1. Setting up the KMP_DUPLICATE_LIB_OK=TRUE helps while running each 
test separately but not while running all of the unit tests in a row (it 
results in a hang => see conclusions).
        2. I tried to use MKL_THREADING_LAYER and the test passes for 'tbb' and 
'sequential' but not for 'intel' or 'GNU' (because it couldn't find the 
libgomp.so and do the fallback to the libiopm5.so).
        3. Compile MXNET with USE_OPENMP=0 causes that running unit tests would 
probably would last "forever" (I assume it's not under consideration).
        4. Finally, I'm checking the following solution, but it requires adding 
the following line in the CMakeLists.txt (or compile ```-DUSE_BLAS=MKL``` 
because STREQUAL is case sensitive):
   ```diff --git a/CMakeLists.txt b/CMakeLists.txt
   index 07075d752..1555f3f40 100644
   --- a/CMakeLists.txt
   +++ b/CMakeLists.txt
   @@ -411,6 +411,7 @@ if(USE_OPENMP)
         AND SYSTEM_ARCHITECTURE STREQUAL "x86_64"
         AND NOT CMAKE_BUILD_TYPE STREQUAL "Distribution"
         AND NOT BLAS STREQUAL "MKL"
   +     AND NOT BLAS STREQUAL "mkl"
         AND NOT MSVC
         AND NOT CMAKE_CROSSCOMPILING)
        load_omp()
   ```
   
   All the ~20 tests that were failing passed without any issues (except for 
the one test case: test_optimizer.py::test_lamb). To be more precise, it's 
compiled with the following cmdline: 
   ```cmake -GNinja -DMKL_USE_SINGLE_DYNAMIC_LIBRARY=1 -DUSE_MKLDNN=1 
-DMKL_USE_STATIC_LIBS=0 -DUSE_CUDA=0 -DUSE_BLAS=mkl -DUSE_OPENMP=0 
-DCMAKE_BUILD_TYPE=Debug ..; ninja```
        
   Conclusions:
   - While running all the unit tests in a row with the command line: 
```../runtime_functions.sh unittest_ubuntu_python3_cpu``` after 84% of the 
executed tests it hangs (looks like deadlock or race condition, but I need more 
time to investigate the issue). 
   - The failing test cases (the ~19 tests that I found as the reproduction of 
the issue) ran one by one passes.
   - Moreover, I've also found some symbol issues while loading libiomp5.so:
        ```
         error: symbol lookup error: undefined symbol: ompt_start_tool
         error: symbol lookup error: undefined symbol: scalable_malloc (fatal)
        ```
   - I've also set up unite tests run times (just to compare):
   
   Command   line: ../runtime_funcions.sh   unittest_ubuntu_python3_cpu with 
the following options: | Time
   -- | --
   LD_PRELOAD=<path_to_the_library>/libmkl_rt.so | real    34m2.185s
   MKL_THREADING_LAYER=sequential | real    33m19.842s
   MKL_THREADING_LAYER=tbb | real    33m35.528s
   MKL_THREADING_LAYER=sequential  KMP_HW_SUBSET=64c,1t | real    25m26.099s
   MKL_THREADING_LAYER=intel | hangs
   KMP_DUPLICATE_LIB_OK=TRUE | hangs
   Without any extraordinary options | hangs
   
   
        
   Now I want to concentrate on root causing the hang issue.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to