akarbown commented on issue #18255: URL: https://github.com/apache/incubator-mxnet/issues/18255#issuecomment-726718920
When I run the tests with the following cmdline: ```python3 -m pytest -sv failing_test_case``` I've got the following output: ``` File "/usr/local/lib/python3.6/dist-packages/flaky/flaky_pINTEL MKL ERROR: /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_avx512.so: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8. Intel MKL FATAL ERROR: Cannot load libmkl_avx512.so or libmkl_def.so. ``` I've checked that it's enough to LD_PRELOAD libmkl_rt.so to fix the issue. However, when I link libmkl_rt.so (or compile mxnet with MKL_USE_SINGLE_DYNAMIC_LIBRARY=1) I get the same problem as it's described [#17641](https://github.com/apache/incubator-mxnet/issues/17641). It's because of multiple openmp libraries linked into MXNET. It seems to be a catch-22 situation. Assuming, that we compile MXNET with MKL_USE_SINGLE_DYNAMIC_LIBRARY=1 we will get the problem with linked multiple openmp which could probably be worked around with the following solutions: 1. Setting up the KMP_DUPLICATE_LIB_OK=TRUE helps while running each test separately but not while running all of the unit tests in a row (it results in a hang => see conclusions). 2. I tried to use MKL_THREADING_LAYER and the test passes for 'tbb' and 'sequential' but not for 'intel' or 'GNU' (because it couldn't find the libgomp.so and do the fallback to the libiopm5.so). 3. Compile MXNET with USE_OPENMP=0 causes that running unit tests would probably would last "forever" (I assume it's not under consideration). 4. Finally, I'm checking the following solution, but it requires adding the following line in the CMakeLists.txt (or compile ```-DUSE_BLAS=MKL``` because STREQUAL is case sensitive): ```diff --git a/CMakeLists.txt b/CMakeLists.txt index 07075d752..1555f3f40 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -411,6 +411,7 @@ if(USE_OPENMP) AND SYSTEM_ARCHITECTURE STREQUAL "x86_64" AND NOT CMAKE_BUILD_TYPE STREQUAL "Distribution" AND NOT BLAS STREQUAL "MKL" + AND NOT BLAS STREQUAL "mkl" AND NOT MSVC AND NOT CMAKE_CROSSCOMPILING) load_omp() ``` All the ~20 tests that were failing passed without any issues (except for the one test case: test_optimizer.py::test_lamb). To be more precise, it's compiled with the following cmdline: ```cmake -GNinja -DMKL_USE_SINGLE_DYNAMIC_LIBRARY=1 -DUSE_MKLDNN=1 -DMKL_USE_STATIC_LIBS=0 -DUSE_CUDA=0 -DUSE_BLAS=mkl -DUSE_OPENMP=0 -DCMAKE_BUILD_TYPE=Debug ..; ninja``` Conclusions: - While running all the unit tests in a row with the command line: ```../runtime_functions.sh unittest_ubuntu_python3_cpu``` after 84% of the executed tests it hangs (looks like deadlock or race condition, but I need more time to investigate the issue). - The failing test cases (the ~19 tests that I found as the reproduction of the issue) ran one by one passes. - Moreover, I've also found some symbol issues while loading libiomp5.so: ``` error: symbol lookup error: undefined symbol: ompt_start_tool error: symbol lookup error: undefined symbol: scalable_malloc (fatal) ``` - I've also set up unite tests run times (just to compare): Command line: ../runtime_funcions.sh unittest_ubuntu_python3_cpu with the following options: | Time -- | -- LD_PRELOAD=<path_to_the_library>/libmkl_rt.so | real 34m2.185s MKL_THREADING_LAYER=sequential | real 33m19.842s MKL_THREADING_LAYER=tbb | real 33m35.528s MKL_THREADING_LAYER=sequential KMP_HW_SUBSET=64c,1t | real 25m26.099s MKL_THREADING_LAYER=intel | hangs KMP_DUPLICATE_LIB_OK=TRUE | hangs Without any extraordinary options | hangs Now I want to concentrate on root causing the hang issue. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
