khaotik opened a new issue #20702:
URL: https://github.com/apache/incubator-mxnet/issues/20702
## Description
I think this is a variant of #8029 and #16736, somehow related to `_FusedOp`
on GPU.
### Error Message
```text
[23:52:58]
/home/khaotik/WKSP/dev/incubator-mxnet/src/storage/storage.cc:202: Using Pooled
(Naive) StorageManager for GPU
[23:52:58]
/home/khaotik/WKSP/dev/incubator-mxnet/src/storage/storage.cc:202: Using Pooled
(Naive) StorageManager for CPU
Traceback (most recent call last):
File "test.py", line 91, in <module>
v_y = fn.forward(v_x)
File
"/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/gluon/block.py", line
1821, in forward
return self._call_cached_op(x, *args)
File
"/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/gluon/block.py", line
1267, in _call_cached_op
out = self._cached_op(*cargs)
File
"/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/_ctypes/cached_op.py",
line 126, in __call__
check_call(_LIB.MXInvokeCachedOp(
File "/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/base.py", line
246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
[bt] (6)
/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(MXInvokeCachedOp+0x21b)
[0x7fa4638b05bb]
[bt] (5)
/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::CachedOp::Forward(std::shared_ptr<mxnet::CachedOp>
const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&,
std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&,
mxnet::Context const&)+0x356) [0x7fa463a05f86]
[bt] (4)
/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::CachedOp::GetCachedOpState(mxnet::Context
const&)+0x142) [0x7fa4639fbca2]
[bt] (3)
/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::CachedOp::CachedOpState::CachedOpState(mxnet::Context
const&, nnvm::Graph const&, nnvm::Graph const&, bool)+0x1220) [0x7fa463a1f590]
[bt] (2)
/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(nnvm::Graph::indexed_graph()
const+0x3b) [0x7fa46fb33c8b]
[bt] (1)
/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(nnvm::IndexedGraph::IndexedGraph(nnvm::Graph
const&)+0x502) [0x7fa46fb31b72]
[bt] (0)
/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(+0xe485022)
[0x7fa46fb31022]
File
"/home/khaotik/WKSP/dev/incubator-mxnet/3rdparty/tvm/nnvm/src/core/graph.cc",
line 107
MXNetError: Check failed: it != node2index_.end(): control dep not found in
graph
```
## To Reproduce
Here's a simplified version of my original model that raised this error.
```python
import mxnet as mx
import mxnet.symbol as ms
def _tensorAt(s_inp, i:int):
'''symbolic s[i]'''
return ms.squeeze(ms.slice(s_inp, i, i+1), axis=0)
def _splineInterpolation(t, cv_list, degree=3):
'''
Simplified NURBS interpolation
Returns:
a weighted linear combination from a slice of `cv_list` up to
`degree+1` terms
weights are positive and sums to 1.0
Args:
t: float
t <= -1 -> gives the first element of cv_list
t >= len(cv_list) -> gives the last element of cv_list
cv_list: list
list of items that can be added or multiplied by a float number
both symbolic and numeric value would work
degree: int
must be positive
Example:
>>> _splineInterpolation(t=-1., cv_list=[a,b,c,d,e,f])
a
>>> _splineInterpolation(t=-0.4, cv_list=[a,b,c,d,e,f])
0.964*a + 0.036*b
>>> _splineInterpolation(t=2.6, cv_list=[a,b,c,d,e,f])
0.010666*b + 0.41666*c + 0.538666*d + 0.036*e
'''
from math import floor
assert(isinstance(degree,int) and degree > 0)
num_cv = len(cv_list)
assert(num_cv > 0)
mt = t % 1.
ct = 1. - mt
coeff_list = [mt, ct]
for i in range(1,degree):
wgt = [(mt+j) / (i+1) for j in range(i+1)]
new_coeff_list = [0.] * (i+2)
new_coeff_list[0] = coeff_list[0] * wgt[0]
new_coeff_list[-1] = coeff_list[-1] * (1.-wgt[-1])
for j in range(1,i+1):
new_coeff_list[j] = coeff_list[j]*wgt[j] +
coeff_list[j-1]*(1.-wgt[j-1])
coeff_list = new_coeff_list
coeff_di = dict()
for i in range(degree+1):
knot_idx = min(num_cv-1, max(0, floor(t-(degree>>1)+i)))
if knot_idx in coeff_di:
coeff_di[knot_idx] += coeff_list[degree-i]
else:
coeff = coeff_list[degree-i]
if coeff > 0.:
coeff_di[knot_idx] = coeff
# make sure weight sum is 1, normalize numeric error
scale = 1./sum(coeff_di.values())
for k in coeff_di:
coeff_di[k] *= scale
if len(coeff_di) > 1:
return sum((cv_list[i]*c if c!=1.0 else cv_list[i]) for i,c in
enumerate(coeff_di))
else:
return cv_list[list(coeff_di.keys())[0]]
# constants
BATCH_SIZE=2
NDIM = 16
CTX = mx.gpu()
PARAM_DEPTH=8
RESNET_DEPTH=16
# build/bind symbolic model
s_x = ms.var('x', shape=(BATCH_SIZE,NDIM,), dtype='float32');
s_w0 = ms.var('w0', shape=(PARAM_DEPTH,NDIM,NDIM))
s_w0_li = [_tensorAt(s_w0,i) for i in range(PARAM_DEPTH)]
s_mid = s_x
for i in range(RESNET_DEPTH):
spline_t = (i*(PARAM_DEPTH+1))/(RESNET_DEPTH-1) - 1.
s_res = s_mid
s_res = ms.dot(s_res, _splineInterpolation(spline_t, s_w0_li))
s_res = ms.relu(s_res)
s_mid = s_mid + s_res
s_y = s_mid
fn = mx.gluon.SymbolBlock((s_y,),(s_x,))
fn.initialize(ctx=CTX)
# run
scale = NDIM**(-0.5)
v_x = mx.nd.random_uniform(-scale,scale, shape=(BATCH_SIZE,NDIM), ctx=CTX)
v_y = fn.forward(v_x)
```
### Steps to reproduce
Run the above script with current master branch
(5d247f13fcf5e55b094c5deb90ede1d3a03cc9ac) with GPU.
## What have you tried to solve it?
- The code works with environment `MXNET_USE_FUSION=0`
- The code works if using `CTX = mx.cpu()` instead
- The code works with smaller constant in python code `RESNET_DEPTH=6`
## Environment
- Single GPU GTX 1080 ti on ubuntu 20.04
- anaconda python 3.8
- mxnet is built from source via current master branch
(5d247f13fcf5e55b094c5deb90ede1d3a03cc9ac)
***We recommend using our script for collecting the diagnostic information
with the following command***
`curl --retry 10 -s
https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
| python3`
<details>
<summary>Environment Information</summary>
```
----------Python Info----------
Version : 3.8.3
Compiler : GCC 7.3.0
Build : ('default', 'Jul 2 2020 16:21:59')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 21.0.1
Directory : /home/khaotik/anaconda3/lib/python3.8/site-packages/pip
----------MXNet Info-----------
Version : 2.0.0
Directory : /media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet
Num GPUs : 1
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-5.11.0-40-generic-x86_64-with-glibc2.10
system : Linux
node : KKST2
release : 5.11.0-40-generic
version : #44~20.04.1-Ubuntu SMP Wed Oct 20 19:04:34 UTC 2021
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz
Stepping: 10
CPU MHz: 4116.771
CPU max MHz: 4300.0000
CPU min MHz: 800.0000
BogoMIPS: 7200.00
Virtualization: VT-x
L1d cache: 192 KiB
L1i cache: 192 KiB
L2 cache: 1.5 MiB
L3 cache: 9 MiB
NUMA node0 CPU(s): 0-5
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX vulnerable,
SMT disabled
Vulnerability Mds: Vulnerable; SMT disabled
Vulnerability Meltdown: Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and
usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds: Vulnerable
Vulnerability Tsx async abort: Vulnerable
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca cmov pat pse36 clflush dts ac
pi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc art arch_p
erfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dte
s64 monitor ds_cpl vmx smx est tm2 ssse3
sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2
apic movbe popcnt tsc_deadline_timer aes
xsave avx f16c rdrand lahf_lm abm 3dnowprefet
ch cpuid_fault invpcid_single ssbd ibrs
ibpb stibp tpr_shadow vnmi flexpriority ept vp
id ept_ad fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm mpx rdseed adx
smap clflushopt intel_pt xsaveopt xsavec
xgetbv1 xsaves dtherm ida arat pln pts hwp hw
p_notify hwp_act_window hwp_epp md_clear
flush_l1d
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0001
sec, LOAD: 1.0473 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0009
sec, LOAD: 0.8119 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0003 sec, LOAD:
3.1800 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0010 sec, LOAD: 0.6069 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0002 sec, LOAD: 0.6927 sec.
Timing for FashionMNIST:
https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
DNS: 0.0003 sec, LOAD: 1.9676 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0009 sec, LOAD:
1.5478 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403:
Forbidden, DNS finished in 0.0009582042694091797 sec.
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]