access2rohit opened a new issue #20011:
URL: https://github.com/apache/incubator-mxnet/issues/20011
## Description
unix-gpu has some flaky tests on `Python3:GPU` and `cpp package GPU
Makefile` they fail quite frequenty even without any code that touches them.
## Occurrences
`Python3:GPU` failing test:
```
[2021-03-11T18:04:29.187Z] test_operator_gpu.test_kernel_error_checking ...
[18:04:24] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[2021-03-11T18:04:32.459Z] Process SpawnProcess-1:
[2021-03-11T18:04:32.460Z] Traceback (most recent call last):
[2021-03-11T18:04:32.460Z] File
"/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
[2021-03-11T18:04:32.460Z] self.run()
[2021-03-11T18:04:32.460Z] File
"/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
[2021-03-11T18:04:32.460Z] self._target(*self._args, **self._kwargs)
[2021-03-11T18:04:32.460Z] File
"/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2238, in
kernel_error_check_imperative
[2021-03-11T18:04:32.460Z] c = (a / b).asnumpy()
[2021-03-11T18:04:32.460Z] File
"/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py",
line 354, in __truediv__
[2021-03-11T18:04:32.460Z] return divide(self, other)
[2021-03-11T18:04:32.460Z] File
"/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py",
line 3820, in divide
[2021-03-11T18:04:32.460Z] _internal._rdiv_scalar)
[2021-03-11T18:04:32.460Z] File
"/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py",
line 3576, in _ufunc_helper
[2021-03-11T18:04:32.460Z] return fn_array(lhs, rhs)
[2021-03-11T18:04:32.460Z] File "<string>", line 52, in broadcast_div
[2021-03-11T18:04:32.460Z] File "mxnet/cython/ndarray.pyx", line 219, in
mxnet._cy3.ndarray._imperative_invoke
[2021-03-11T18:04:32.460Z] File "mxnet/cython/./base.pyi", line 58, in
mxnet._cy3.ndarray.CALL
[2021-03-11T18:04:32.460Z] mxnet.base.MXNetError: Traceback (most recent
call last):
[2021-03-11T18:04:32.460Z] [bt] (9)
/usr/local/bin/python3(_PyEval_EvalFrameDefault+0x44b2) [0x561b1fe37ac2]
[2021-03-11T18:04:32.460Z] [bt] (8)
/usr/local/bin/python3(_PyCFunction_FastCallKeywords+0x20) [0x561b1fdc3de0]
[2021-03-11T18:04:32.460Z] [bt] (7)
/usr/local/bin/python3(_PyMethodDef_RawFastCallKeywords+0x250) [0x561b1fdc4050]
[2021-03-11T18:04:32.460Z] [bt] (6)
/work/mxnet/tests/python/unittest/../../../python/mxnet/_cy3/ndarray.cpython-37m-x86_64-linux-gnu.so(+0x14699)
[0x7eff14049699]
[2021-03-11T18:04:32.460Z] [bt] (5)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x8b)
[0x7eff8be0653b]
[2021-03-11T18:04:32.460Z] [bt] (4)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*,
int, void**, int*, void***, int, char const**, char const**)+0x543)
[0x7eff8be04c73]
[2021-03-11T18:04:32.460Z] [bt] (3)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context
const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*,
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*,
std::allocator<mxnet::NDArray*> > const&)+0xe6) [0x7eff8b566836]
[2021-03-11T18:04:32.460Z] [bt] (2)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context
const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*,
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*,
std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0x140e)
[0x7eff8b560b6e]
[2021-03-11T18:04:32.460Z] [bt] (1)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::BinaryBroadcastShape(nnvm::NodeAttrs
const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*,
std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x38e)
[0x7eff86af62ae]
[2021-03-11T18:04:32.460Z] [bt] (0)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72)
[0x7eff8682df82]
[2021-03-11T18:04:32.460Z] File
"src/operator/numpy/linalg/./../../tensor/elemwise_binary_broadcast_op.h", line
68
[2021-03-11T18:04:32.460Z] MXNetError: Check failed: l == 1 || r == 1:
operands could not be broadcast together with shapes [3] [0]
[2021-03-11T18:04:32.460Z] [18:04:28] src/engine/naive_engine.cc:74: Engine
shutdown
[2021-03-11T18:04:34.985Z] [18:04:30] src/engine/engine.cc:55: MXNet start
using engine: NaiveEngine
[2021-03-11T18:04:38.257Z] Process SpawnProcess-2:
[2021-03-11T18:04:38.257Z] Traceback (most recent call last):
[2021-03-11T18:04:38.257Z] File
"/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
[2021-03-11T18:04:38.257Z] self.run()
[2021-03-11T18:04:38.257Z] File
"/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
[2021-03-11T18:04:38.257Z] self._target(*self._args, **self._kwargs)
[2021-03-11T18:04:38.257Z] File
"/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2247, in
kernel_error_check_symbolic
[2021-03-11T18:04:38.257Z] 'b':mx.nd.array([],ctx=mx.gpu(0))})
[2021-03-11T18:04:38.257Z] File
"/work/mxnet/tests/python/unittest/../../../python/mxnet/symbol/symbol.py",
line 2119, in bind
[2021-03-11T18:04:38.257Z] ctypes.byref(handle)))
[2021-03-11T18:04:38.257Z] File
"/work/mxnet/tests/python/unittest/../../../python/mxnet/base.py", line 246, in
check_call
[2021-03-11T18:04:38.257Z] raise get_last_ffi_error()
[2021-03-11T18:04:38.257Z] mxnet.base.MXNetError: Traceback (most recent
call last):
[2021-03-11T18:04:38.257Z] [bt] (8)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorBindEX+0x8f5)
[0x7f1e070e99f5]
[2021-03-11T18:04:38.257Z] [bt] (7)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Executor::Bind(nnvm::Symbol,
mxnet::Context const&, std::map<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >, mxnet::Context,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > >
const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&,
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&,
std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&,
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&,
mxnet::Executor*)+0x219) [0x7f1e071f1139]
[2021-03-11T18:04:38.257Z] [bt] (6)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol,
mxnet::Context const&, std::map<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >, mxnet::Context,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > >
const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&,
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&,
std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&,
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&,
mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray,
nnvm::NodeEntryHash, nnvm::NodeEntryEqual,
std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > >
const&)+0x120c) [0x7f1e071e4a0c]
[2021-03-11T18:04:38.257Z] [bt] (5)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::InferShape(nnvm::Graph&&,
std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >&&,
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
const&)+0x69) [0x7f1e071c08a9]
[2021-03-11T18:04:38.257Z] [bt] (4)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f05d99) [0x7f1e071bdd99]
[2021-03-11T18:04:38.257Z] [bt] (3)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f0242b) [0x7f1e071ba42b]
[2021-03-11T18:04:38.257Z] [bt] (2)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<2,
1>(nnvm::NodeAttrs const&, std::vector<mxnet::TShape,
std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape,
std::allocator<mxnet::TShape> >*)+0x5ab) [0x7f1e0266305b]
[2021-03-11T18:04:38.257Z] [bt] (1)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::ElemwiseAttrHelper<mxnet::TShape,
&mxnet::op::shape_is_none, &mxnet::op::shape_assign, true,
&mxnet::op::shape_string[abi:cxx11], -1, -1>(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*,
std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, mxnet::TShape
const&)::{lambda(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >
const&, unsigned long, char const*)#1}::operator()(std::vector<mxnet::TShape,
std::allocator<mxnet::TShape> > const&, unsigned long, char const*)
const+0x1276) [0x7f1e01bc6126]
[2021-03-11T18:04:38.257Z] [bt] (0)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72)
[0x7f1e01b59f82]
[2021-03-11T18:04:38.257Z] MXNetError: Error in operator _div0: [18:04:33]
src/operator/numpy/linalg/./../../tensor/../elemwise_op_common.h:135: Check
failed: assign(&dattr, vec.at(i)): Incompatible attr in node _div0 at 1-th
input: expected [3], got [0]
[2021-03-11T18:04:38.257Z] ok (11.0016s)
```
`cpp package GPU Makefile` failing test:
```
[2021-03-11T18:29:20.262Z] [18:29:15]
cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput
symbol testing, executor should be able to bind without label.
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z] Segmentation fault: 11
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z] Segmentation fault: 11
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z]
[2021-03-11T18:29:20.262Z] Segmentation fault: 11
```
## Next Steps
Since ther eblocking the PRs and making CI unstable. Immediate action is to
disable them and investigate
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]