[apache/incubator-mxnet] [RFC] Use TVMOp with GPU & Build without libcuda.so in CI (#18716)

Jinbo Ci Tue, 14 Jul 2020 23:52:21 -0700

## Problem 1: TVMOp doesn't work well with GPU builds 
[#17840](https://github.com/apache/incubator-mxnet/issues/17840)


### The error message:

```python
>>> import mxnet as mx
>>> x = mx.np.array([[0, 1], [1, 1], [2, 2]], ctx=mx.gpu())
>>> idx = x < 2
>>> x[idx]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/Documents/mxnet/python/mxnet/numpy/multiarray.py", line 
1013, in __lt__
    return less(self, other)
  File "/home/ubuntu/Documents/mxnet/python/mxnet/numpy/multiarray.py", line 
8672, in less
    return _mx_nd_np.less(x1, x2, out)
  File "/home/ubuntu/Documents/mxnet/python/mxnet/ndarray/numpy/_op.py", line 
6869, in less
    return _api_internal.less(x1, x2, out)
  File "/home/ubuntu/Documents/mxnet/python/mxnet/_ffi/_ctypes/function.py", 
line 115, in __call__
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../3rdparty/tvm/src/runtime/module.cc", line 125
  File "../3rdparty/tvm/src/runtime/library_module.cc", line 94
TVMError: Check failed: ret == 0 (-1 vs. 0) : Check failed: f != nullptr: 
Cannot find function less_scalar_gpufloat32_2bool_2_kernel0 in the imported 
modules or global registry
```

### Root cause:

In 
[mxnet/contrib/tvmop/compile.py](https://github.com/apache/incubator-mxnet/blob/master/contrib/tvmop/compile.py),
 only `function_binary` `(llvm Module)`is saved in `libtvmop.so`. The 
`imported_modules` `(cuda Module)`is not saved. So TVM cannot import any gpu 
functions and cannot find `less_scalar_gpufloat32_2bool_2_kernel0.`

```python
(Pdb) func_binary
Module(llvm, 55d7ce519d48)
(Pdb) func_binary.imported_modules[0]
Module(cuda, 55d7c7a09818)
```

### Solution ([Github 
PR](https://github.com/apache/incubator-mxnet/pull/18678)):

* Save `imported_modules[0]` to `libtvmop.cubin:`
* Define `Import` function (using `TVMOpModule->Import` ):
* Import `cubin_module` to `global_module`:
* Outputs:

```python
>>> import mxnet as mx
>>> x = mx.np.array([[0, 1], [1, 1], [2, 2]], ctx=mx.gpu())
[10:19:41] ../src/base.cc:80: cuDNN lib mismatch: linked-against version 7605 
!= compiled-against version 7501. Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this 
warning.
[10:19:41] ../src/base.cc:84: Upgrade advisory: this mxnet has been built 
against cuDNN lib version 7501, which is older than the oldest version tested 
by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
>>> idx = x < 2
>>> x[idx]
array([0., 1., 1., 1.], ctx=gpu(0))
```

## Problem 2: CI Checks: libcuda.so does exist on the machine builds mxnet

### The error message:

When running 
[unix-gpu](http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18526/18/pipeline)
 checks:

```python
[2020-06-18T08:26:18.355Z] Traceback (most recent call last):
[2020-06-18T08:26:18.355Z]   File "/work/mxnet/contrib/tvmop/compile.py", line 
20, in <module>
[2020-06-18T08:26:18.355Z]     import tvm
[2020-06-18T08:26:18.355Z]   File 
"/work/mxnet/3rdparty/tvm/python/tvm/__init__.py", line 27, in <module>
[2020-06-18T08:26:18.355Z]     from . import tensor
[2020-06-18T08:26:18.355Z]   File 
"/work/mxnet/3rdparty/tvm/python/tvm/tensor.py", line 20, in <module>
[2020-06-18T08:26:18.355Z]     from ._ffi.object import Object, 
register_object, ObjectGeneric, \
[2020-06-18T08:26:18.355Z]   File 
"/work/mxnet/3rdparty/tvm/python/tvm/_ffi/object.py", line 24, in <module>
[2020-06-18T08:26:18.355Z]     from .base import _FFI_MODE, _RUNTIME_ONLY, 
check_call, _LIB, c_str
[2020-06-18T08:26:18.355Z]   File 
"/work/mxnet/3rdparty/tvm/python/tvm/_ffi/base.py", line 65, in <module>
[2020-06-18T08:26:18.355Z]     _LIB, _LIB_NAME = _load_lib()
[2020-06-18T08:26:18.355Z]   File 
"/work/mxnet/3rdparty/tvm/python/tvm/_ffi/base.py", line 57, in _load_lib
[2020-06-18T08:26:18.355Z]     lib = ctypes.CDLL(lib_path[0], 
ctypes.RTLD_GLOBAL)
[2020-06-18T08:26:18.355Z]   File "/usr/lib/python3.6/ctypes/__init__.py", line 
348, in __init__
[2020-06-18T08:26:18.355Z]     self._handle = _dlopen(self._name, mode)
[2020-06-18T08:26:18.355Z] OSError: libcuda.so.1: cannot open shared object 
file: No such file or directory
```

### Root cause:

The unix-gpu machine that **builds** mxnet does not have libcuda.so

### Solution 1:

Link `libtvm.so` with `stub/libcuda.so` on the machine that **builds** CI 
Checks.

### Solution 1 Pros/Cons/Workloads:

* **Pros**: Solve the issue easily.
* **Cons**: Violates the effort of removing `libcuda.so` totally, (would be 
great if someone can elaborate the motivation behind it).
* **Workloads**: ~1 week

### Solution 2 (Possible) ([Github 
PR](https://github.com/jinboci/incubator-tvm/pull/2)) :

TVM links libcuda.so because it invokes CUDA driver API during runtime. While 
these functions are not executed during compile-time. Therefore it is possible 
to remove them for compile-only purpose. 
I have made a prototype to remove the linkage of libcuda.so from libtvm.so:

* Set `target_link_libraries` of tvm and tvm_runtime differently. 
(CMakeLists.txt)
* Set an variable `CUDA_COMPILE_ONLY` to be `ON` to indicate “Building 
libtvm.so without libcuda.so” (CMakeLists.txt)
* When `CUDA_COMPILE_ONLY` is `ON`, add compilation definition 
`-DCUDA_COMPILE_ONLY` (CMakeLists.txt)
* When `CUDA_COMPILE_ONLY` is defined (when compiling libtvm.so), ignore any 
cuXXX CUDA Driver API functions: (cmake/modules/CUDA.cmake, 
src/runtime/cuda/cuda_common.h, src/runtime/cuda/cuda_device_api.cc, 
src/runtime/cuda/cuda_module.cc)

### Solution 2 Pros/Cons/Workloads:

* **Pros**: Not depend on libcuda.so

* **Cons**: After unlinking libtvm.so with libcuda.so, the CI checks still 
cannot pass. [GPU CUDA 
RTC](http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18678/8/pipeline)
 (When `-DUSE_TVM_OP=ON`) outputs the error message:

```python
[2020-07-13T09:30:36.870Z] /usr/bin/ld: warning: libcuda.so.1, needed by 
/work/build/3rdparty/tvm/libtvm_runtime.so, not found (try using -rpath or 
-rpath-link)
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: 
undefined reference to `cuMemsetD32_v2'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: 
undefined reference to `cuModuleLoadData'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: 
undefined reference to `cuLaunchKernel'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: 
undefined reference to `cuModuleGetFunction'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: 
undefined reference to `cuModuleUnload'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: 
undefined reference to `cuGetErrorName'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: 
undefined reference to `cuDeviceGetName'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: 
undefined reference to `cuModuleGetGlobal_v2'
[2020-07-13T09:30:36.870Z] collect2: error: ld returned 1 exit status
[2020-07-13T09:30:36.870Z] ninja: build stopped: subcommand failed.
[2020-07-13T09:30:36.870Z] 2020-07-13 09:30:30,154 - root - INFO - Waiting for 
status of container e95e5c4ca642 for 600 s.
[2020-07-13T09:30:36.870Z] 2020-07-13 09:30:30,377 - root - INFO - Container 
exit status: {'Error': None, 'StatusCode': 1}
[2020-07-13T09:30:36.870Z] 2020-07-13 09:30:30,377 - root - ERROR - Container 
exited with an error 😞
[2020-07-13T09:30:36.870Z] 2020-07-13 09:30:30,377 - root - INFO - Executed 
command for reproduction:
[2020-07-13T09:30:36.870Z]
```

It seems unlinking `libcuda.so` from `libtvm.so` is not enough and we should 
also unlink `libcuda.so` from `libtvm_runtime.so`. However, `libtvm_runtime` 
does require `libcuda.so` during runtime. if we remove the linkage on build 
instances, `tvm_runtime` is not able to run on test instances.
In order to fully address the problem, we have two options,

1. Build two versions of tvm, one links `libcuda.so` for compiling the tvm 
operators; another does not link `libcuda.so` which will be transferred to test 
instances for tvmop tests.
2. We do `dlopen(“libcuda.so”)` in tvm.

* **Workloads**: 
    * option 1: ~1 week to modify tvm, 1.5-2 weeks modify CI
    * option 2: ~2 weeks to modify tvm
    * Both options require big surgery on TVM, thus contributing back to 
upstream tvm might be difficult. Moreover, there’s also risk that tvm community 
would push back our suggestion. Even though they agree to do so, it might cost 
~2 weeks to upstream the changes, and another ~1.5 weeks to sync mxnet’s tvm 
with updated apache/tvm.

## Proposal:

Given the fact that it might take another 4 - 6 weeks to fully address the CI 
problem, we propose to,

* Submit the fix for problem 1.
* Link stub/libcuda.so and enable 1 instance for testing. If this is not 
acceptable, can we keep disabling the tvmop CI for now as it is not an 
essential component?
* Open an issue/RFC in MXNet and TVM to track the remaining problems.

* * *

## Comments:

Also, When setting `-DUSE_TVM_OP=OFF` [the CI checks would be 
stuck](http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18678/10/pipeline).
 The output of GPU CUDA RTC looks like:

```python
[2020-07-13T18:04:12.876Z] + md5sum build/3rdparty/tvm/libtvm_runtime.so
[2020-07-13T18:04:12.876Z] md5sum: build/3rdparty/tvm/libtvm_runtime.so: No 
such file or directory
[2020-07-13T18:04:12.876Z] + ls -lh build/3rdparty/tvm/libtvm_runtime.so
[2020-07-13T18:04:12.876Z] ls: cannot access 
'build/3rdparty/tvm/libtvm_runtime.so': No such file or directory
[2020-07-13T18:04:12.876Z] + md5sum build/libtvmop.so
[2020-07-13T18:04:12.876Z] md5sum: build/libtvmop.so: No such file or directory
[2020-07-13T18:04:12.876Z] + ls -lh build/libtvmop.so
[2020-07-13T18:04:12.876Z] ls: cannot access 'build/libtvmop.so': No such file 
or directory
[2020-07-13T18:04:12.876Z] + md5sum build/tvmop.conf
[2020-07-13T18:04:12.876Z] md5sum: build/tvmop.conf: No such file or directory
[2020-07-13T18:04:12.876Z] + ls -lh build/tvmop.conf
[2020-07-13T18:04:12.876Z] ls: cannot access 'build/tvmop.conf': No such file 
or directory
[2020-07-13T18:04:12.876Z] + md5sum build/tests/mxnet_unit_tests
[2020-07-13T18:04:12.876Z] 66aa8c8a37ffaaa9692ae98bda88491c  
build/tests/mxnet_unit_tests
[2020-07-13T18:04:12.876Z] + ls -lh build/tests/mxnet_unit_tests
[2020-07-13T18:04:12.876Z] -rwxr-xr-x 1 jenkins_slave jenkins_slave 34M Jul 13 
18:04 build/tests/mxnet_unit_tests
[2020-07-13T18:04:12.876Z] + md5sum build/3rdparty/openmp/runtime/src/libomp.so
[2020-07-13T18:04:12.876Z] 819a0c986ae9e233b0a9525e71c906d9  
build/3rdparty/openmp/runtime/src/libomp.so
[2020-07-13T18:04:12.876Z] + ls -lh build/3rdparty/openmp/runtime/src/libomp.so
[2020-07-13T18:04:12.876Z] -rwxr-xr-x 1 jenkins_slave jenkins_slave 1.1M Jul 13 
17:56 build/3rdparty/openmp/runtime/src/libomp.so
[2020-07-13T18:04:12.876Z] + return 0
```


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/18716

[apache/incubator-mxnet] [RFC] Use TVMOp with GPU & Build without libcuda.so in CI (#18716)

Reply via email to