khaotik opened a new issue #20702: URL: https://github.com/apache/incubator-mxnet/issues/20702
## Description I think this is a variant of #8029 and #16736, somehow related to `_FusedOp` on GPU. ### Error Message ```text [23:52:58] /home/khaotik/WKSP/dev/incubator-mxnet/src/storage/storage.cc:202: Using Pooled (Naive) StorageManager for GPU [23:52:58] /home/khaotik/WKSP/dev/incubator-mxnet/src/storage/storage.cc:202: Using Pooled (Naive) StorageManager for CPU Traceback (most recent call last): File "test.py", line 91, in <module> v_y = fn.forward(v_x) File "/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/gluon/block.py", line 1821, in forward return self._call_cached_op(x, *args) File "/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/gluon/block.py", line 1267, in _call_cached_op out = self._cached_op(*cargs) File "/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/_ctypes/cached_op.py", line 126, in __call__ check_call(_LIB.MXInvokeCachedOp( File "/media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/base.py", line 246, in check_call raise get_last_ffi_error() mxnet.base.MXNetError: Traceback (most recent call last): [bt] (6) /media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(MXInvokeCachedOp+0x21b) [0x7fa4638b05bb] [bt] (5) /media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::CachedOp::Forward(std::shared_ptr<mxnet::CachedOp> const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::Context const&)+0x356) [0x7fa463a05f86] [bt] (4) /media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::CachedOp::GetCachedOpState(mxnet::Context const&)+0x142) [0x7fa4639fbca2] [bt] (3) /media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::CachedOp::CachedOpState::CachedOpState(mxnet::Context const&, nnvm::Graph const&, nnvm::Graph const&, bool)+0x1220) [0x7fa463a1f590] [bt] (2) /media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(nnvm::Graph::indexed_graph() const+0x3b) [0x7fa46fb33c8b] [bt] (1) /media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(nnvm::IndexedGraph::IndexedGraph(nnvm::Graph const&)+0x502) [0x7fa46fb31b72] [bt] (0) /media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet/../../build/libmxnet.so(+0xe485022) [0x7fa46fb31022] File "/home/khaotik/WKSP/dev/incubator-mxnet/3rdparty/tvm/nnvm/src/core/graph.cc", line 107 MXNetError: Check failed: it != node2index_.end(): control dep not found in graph ``` ## To Reproduce Here's a simplified version of my original model that raised this error. ```python import mxnet as mx import mxnet.symbol as ms def _tensorAt(s_inp, i:int): '''symbolic s[i]''' return ms.squeeze(ms.slice(s_inp, i, i+1), axis=0) def _splineInterpolation(t, cv_list, degree=3): ''' Simplified NURBS interpolation Returns: a weighted linear combination from a slice of `cv_list` up to `degree+1` terms weights are positive and sums to 1.0 Args: t: float t <= -1 -> gives the first element of cv_list t >= len(cv_list) -> gives the last element of cv_list cv_list: list list of items that can be added or multiplied by a float number both symbolic and numeric value would work degree: int must be positive Example: >>> _splineInterpolation(t=-1., cv_list=[a,b,c,d,e,f]) a >>> _splineInterpolation(t=-0.4, cv_list=[a,b,c,d,e,f]) 0.964*a + 0.036*b >>> _splineInterpolation(t=2.6, cv_list=[a,b,c,d,e,f]) 0.010666*b + 0.41666*c + 0.538666*d + 0.036*e ''' from math import floor assert(isinstance(degree,int) and degree > 0) num_cv = len(cv_list) assert(num_cv > 0) mt = t % 1. ct = 1. - mt coeff_list = [mt, ct] for i in range(1,degree): wgt = [(mt+j) / (i+1) for j in range(i+1)] new_coeff_list = [0.] * (i+2) new_coeff_list[0] = coeff_list[0] * wgt[0] new_coeff_list[-1] = coeff_list[-1] * (1.-wgt[-1]) for j in range(1,i+1): new_coeff_list[j] = coeff_list[j]*wgt[j] + coeff_list[j-1]*(1.-wgt[j-1]) coeff_list = new_coeff_list coeff_di = dict() for i in range(degree+1): knot_idx = min(num_cv-1, max(0, floor(t-(degree>>1)+i))) if knot_idx in coeff_di: coeff_di[knot_idx] += coeff_list[degree-i] else: coeff = coeff_list[degree-i] if coeff > 0.: coeff_di[knot_idx] = coeff # make sure weight sum is 1, normalize numeric error scale = 1./sum(coeff_di.values()) for k in coeff_di: coeff_di[k] *= scale if len(coeff_di) > 1: return sum((cv_list[i]*c if c!=1.0 else cv_list[i]) for i,c in enumerate(coeff_di)) else: return cv_list[list(coeff_di.keys())[0]] # constants BATCH_SIZE=2 NDIM = 16 CTX = mx.gpu() PARAM_DEPTH=8 RESNET_DEPTH=16 # build/bind symbolic model s_x = ms.var('x', shape=(BATCH_SIZE,NDIM,), dtype='float32'); s_w0 = ms.var('w0', shape=(PARAM_DEPTH,NDIM,NDIM)) s_w0_li = [_tensorAt(s_w0,i) for i in range(PARAM_DEPTH)] s_mid = s_x for i in range(RESNET_DEPTH): spline_t = (i*(PARAM_DEPTH+1))/(RESNET_DEPTH-1) - 1. s_res = s_mid s_res = ms.dot(s_res, _splineInterpolation(spline_t, s_w0_li)) s_res = ms.relu(s_res) s_mid = s_mid + s_res s_y = s_mid fn = mx.gluon.SymbolBlock((s_y,),(s_x,)) fn.initialize(ctx=CTX) # run scale = NDIM**(-0.5) v_x = mx.nd.random_uniform(-scale,scale, shape=(BATCH_SIZE,NDIM), ctx=CTX) v_y = fn.forward(v_x) ``` ### Steps to reproduce Run the above script with current master branch (5d247f13fcf5e55b094c5deb90ede1d3a03cc9ac) with GPU. ## What have you tried to solve it? - The code works with environment `MXNET_USE_FUSION=0` - The code works if using `CTX = mx.cpu()` instead - The code works with smaller constant in python code `RESNET_DEPTH=6` ## Environment - Single GPU GTX 1080 ti on ubuntu 20.04 - anaconda python 3.8 - mxnet is built from source via current master branch (5d247f13fcf5e55b094c5deb90ede1d3a03cc9ac) ***We recommend using our script for collecting the diagnostic information with the following command*** `curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3` <details> <summary>Environment Information</summary> ``` ----------Python Info---------- Version : 3.8.3 Compiler : GCC 7.3.0 Build : ('default', 'Jul 2 2020 16:21:59') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 21.0.1 Directory : /home/khaotik/anaconda3/lib/python3.8/site-packages/pip ----------MXNet Info----------- Version : 2.0.0 Directory : /media/LNXDATA/WKSP/dev/incubator-mxnet/python/mxnet Num GPUs : 1 Hashtag not found. Not installed from pre-built package. ----------System Info---------- Platform : Linux-5.11.0-40-generic-x86_64-with-glibc2.10 system : Linux node : KKST2 release : 5.11.0-40-generic version : #44~20.04.1-Ubuntu SMP Wed Oct 20 19:04:34 UTC 2021 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz Stepping: 10 CPU MHz: 4116.771 CPU max MHz: 4300.0000 CPU min MHz: 800.0000 BogoMIPS: 7200.00 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 9 MiB NUMA node0 CPU(s): 0-5 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX vulnerable, SMT disabled Vulnerability Mds: Vulnerable; SMT disabled Vulnerability Meltdown: Vulnerable Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled Vulnerability Srbds: Vulnerable Vulnerability Tsx async abort: Vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts ac pi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_p erfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dte s64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2 apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefet ch cpuid_fault invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vp id ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hw p_notify hwp_act_window hwp_epp md_clear flush_l1d ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0001 sec, LOAD: 1.0473 sec. Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0009 sec, LOAD: 0.8119 sec. Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0003 sec, LOAD: 3.1800 sec. Timing for D2L: http://d2l.ai, DNS: 0.0010 sec, LOAD: 0.6069 sec. Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0002 sec, LOAD: 0.6927 sec. Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0003 sec, LOAD: 1.9676 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0009 sec, LOAD: 1.5478 sec. Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0009582042694091797 sec. ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org