access2rohit opened a new issue #20011:
URL: https://github.com/apache/incubator-mxnet/issues/20011


   ## Description
   unix-gpu has some flaky tests on `Python3:GPU` and `cpp package GPU 
Makefile` they fail quite frequenty even without any code that touches them.
   
   
   ## Occurrences
   `Python3:GPU` failing test:
   ```
   [2021-03-11T18:04:29.187Z] test_operator_gpu.test_kernel_error_checking ... 
[18:04:24] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
   
   [2021-03-11T18:04:32.459Z] Process SpawnProcess-1:
   
   [2021-03-11T18:04:32.460Z] Traceback (most recent call last):
   
   [2021-03-11T18:04:32.460Z]   File 
"/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
   
   [2021-03-11T18:04:32.460Z]     self.run()
   
   [2021-03-11T18:04:32.460Z]   File 
"/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
   
   [2021-03-11T18:04:32.460Z]     self._target(*self._args, **self._kwargs)
   
   [2021-03-11T18:04:32.460Z]   File 
"/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2238, in 
kernel_error_check_imperative
   
   [2021-03-11T18:04:32.460Z]     c = (a / b).asnumpy()
   
   [2021-03-11T18:04:32.460Z]   File 
"/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", 
line 354, in __truediv__
   
   [2021-03-11T18:04:32.460Z]     return divide(self, other)
   
   [2021-03-11T18:04:32.460Z]   File 
"/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", 
line 3820, in divide
   
   [2021-03-11T18:04:32.460Z]     _internal._rdiv_scalar)
   
   [2021-03-11T18:04:32.460Z]   File 
"/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", 
line 3576, in _ufunc_helper
   
   [2021-03-11T18:04:32.460Z]     return fn_array(lhs, rhs)
   
   [2021-03-11T18:04:32.460Z]   File "<string>", line 52, in broadcast_div
   
   [2021-03-11T18:04:32.460Z]   File "mxnet/cython/ndarray.pyx", line 219, in 
mxnet._cy3.ndarray._imperative_invoke
   
   [2021-03-11T18:04:32.460Z]   File "mxnet/cython/./base.pyi", line 58, in 
mxnet._cy3.ndarray.CALL
   
   [2021-03-11T18:04:32.460Z] mxnet.base.MXNetError: Traceback (most recent 
call last):
   
   [2021-03-11T18:04:32.460Z]   [bt] (9) 
/usr/local/bin/python3(_PyEval_EvalFrameDefault+0x44b2) [0x561b1fe37ac2]
   
   [2021-03-11T18:04:32.460Z]   [bt] (8) 
/usr/local/bin/python3(_PyCFunction_FastCallKeywords+0x20) [0x561b1fdc3de0]
   
   [2021-03-11T18:04:32.460Z]   [bt] (7) 
/usr/local/bin/python3(_PyMethodDef_RawFastCallKeywords+0x250) [0x561b1fdc4050]
   
   [2021-03-11T18:04:32.460Z]   [bt] (6) 
/work/mxnet/tests/python/unittest/../../../python/mxnet/_cy3/ndarray.cpython-37m-x86_64-linux-gnu.so(+0x14699)
 [0x7eff14049699]
   
   [2021-03-11T18:04:32.460Z]   [bt] (5) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x8b) 
[0x7eff8be0653b]
   
   [2021-03-11T18:04:32.460Z]   [bt] (4) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, 
int, void**, int*, void***, int, char const**, char const**)+0x543) 
[0x7eff8be04c73]
   
   [2021-03-11T18:04:32.460Z]   [bt] (3) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context
 const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&)+0xe6) [0x7eff8b566836]
   
   [2021-03-11T18:04:32.460Z]   [bt] (2) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context
 const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0x140e) 
[0x7eff8b560b6e]
   
   [2021-03-11T18:04:32.460Z]   [bt] (1) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::BinaryBroadcastShape(nnvm::NodeAttrs
 const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, 
std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x38e) 
[0x7eff86af62ae]
   
   [2021-03-11T18:04:32.460Z]   [bt] (0) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72)
 [0x7eff8682df82]
   
   [2021-03-11T18:04:32.460Z]   File 
"src/operator/numpy/linalg/./../../tensor/elemwise_binary_broadcast_op.h", line 
68
   
   [2021-03-11T18:04:32.460Z] MXNetError: Check failed: l == 1 || r == 1: 
operands could not be broadcast together with shapes [3] [0]
   
   [2021-03-11T18:04:32.460Z] [18:04:28] src/engine/naive_engine.cc:74: Engine 
shutdown
   
   [2021-03-11T18:04:34.985Z] [18:04:30] src/engine/engine.cc:55: MXNet start 
using engine: NaiveEngine
   
   [2021-03-11T18:04:38.257Z] Process SpawnProcess-2:
   
   [2021-03-11T18:04:38.257Z] Traceback (most recent call last):
   
   [2021-03-11T18:04:38.257Z]   File 
"/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
   
   [2021-03-11T18:04:38.257Z]     self.run()
   
   [2021-03-11T18:04:38.257Z]   File 
"/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
   
   [2021-03-11T18:04:38.257Z]     self._target(*self._args, **self._kwargs)
   
   [2021-03-11T18:04:38.257Z]   File 
"/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2247, in 
kernel_error_check_symbolic
   
   [2021-03-11T18:04:38.257Z]     'b':mx.nd.array([],ctx=mx.gpu(0))})
   
   [2021-03-11T18:04:38.257Z]   File 
"/work/mxnet/tests/python/unittest/../../../python/mxnet/symbol/symbol.py", 
line 2119, in bind
   
   [2021-03-11T18:04:38.257Z]     ctypes.byref(handle)))
   
   [2021-03-11T18:04:38.257Z]   File 
"/work/mxnet/tests/python/unittest/../../../python/mxnet/base.py", line 246, in 
check_call
   
   [2021-03-11T18:04:38.257Z]     raise get_last_ffi_error()
   
   [2021-03-11T18:04:38.257Z] mxnet.base.MXNetError: Traceback (most recent 
call last):
   
   [2021-03-11T18:04:38.257Z]   [bt] (8) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorBindEX+0x8f5) 
[0x7f1e070e99f5]
   
   [2021-03-11T18:04:38.257Z]   [bt] (7) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Executor::Bind(nnvm::Symbol,
 mxnet::Context const&, std::map<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, mxnet::Context, 
std::less<std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > >, 
std::allocator<std::pair<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > 
const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, 
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, 
std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, 
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, 
mxnet::Executor*)+0x219) [0x7f1e071f1139]
   
   [2021-03-11T18:04:38.257Z]   [bt] (6) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol,
 mxnet::Context const&, std::map<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, mxnet::Context, 
std::less<std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > >, 
std::allocator<std::pair<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > 
const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, 
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, 
std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, 
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, 
mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, 
nnvm::NodeEntryHash, nnvm::NodeEntryEqual, 
std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > 
const&)+0x120c) [0x7f1e071e4a0c]
   
   [2021-03-11T18:04:38.257Z]   [bt] (5) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::InferShape(nnvm::Graph&&,
 std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >&&, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
const&)+0x69) [0x7f1e071c08a9]
   
   [2021-03-11T18:04:38.257Z]   [bt] (4) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f05d99) [0x7f1e071bdd99]
   
   [2021-03-11T18:04:38.257Z]   [bt] (3) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f0242b) [0x7f1e071ba42b]
   
   [2021-03-11T18:04:38.257Z]   [bt] (2) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<2, 
1>(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, 
std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, 
std::allocator<mxnet::TShape> >*)+0x5ab) [0x7f1e0266305b]
   
   [2021-03-11T18:04:38.257Z]   [bt] (1) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::ElemwiseAttrHelper<mxnet::TShape,
 &mxnet::op::shape_is_none, &mxnet::op::shape_assign, true, 
&mxnet::op::shape_string[abi:cxx11], -1, -1>(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&, 
std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, 
std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, mxnet::TShape 
const&)::{lambda(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > 
const&, unsigned long, char const*)#1}::operator()(std::vector<mxnet::TShape, 
std::allocator<mxnet::TShape> > const&, unsigned long, char const*) 
const+0x1276) [0x7f1e01bc6126]
   
   [2021-03-11T18:04:38.257Z]   [bt] (0) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72)
 [0x7f1e01b59f82]
   
   [2021-03-11T18:04:38.257Z] MXNetError: Error in operator _div0: [18:04:33] 
src/operator/numpy/linalg/./../../tensor/../elemwise_op_common.h:135: Check 
failed: assign(&dattr, vec.at(i)): Incompatible attr in node _div0 at 1-th 
input: expected [3], got [0]
   [2021-03-11T18:04:38.257Z] ok (11.0016s)
   ```
   
   `cpp package GPU Makefile` failing test:
   
   ```
   [2021-03-11T18:29:20.262Z] [18:29:15] 
cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput 
symbol testing, executor should be able to bind without label.
   
   [2021-03-11T18:29:20.262Z] 
   
   [2021-03-11T18:29:20.262Z] Segmentation fault: 11
   
   [2021-03-11T18:29:20.262Z] 
   
   [2021-03-11T18:29:20.262Z] 
   
   [2021-03-11T18:29:20.262Z] Segmentation fault: 11
   
   [2021-03-11T18:29:20.262Z] 
   
   [2021-03-11T18:29:20.262Z] 
   
   [2021-03-11T18:29:20.262Z] Segmentation fault: 11
   
   ```
   ## Next Steps
   Since ther eblocking the PRs and making CI unstable. Immediate action is to 
disable them and investigate


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to