barry-jin opened a new issue #20979:
URL: https://github.com/apache/incubator-mxnet/issues/20979
## Description
test_convolution_large_c is a flaky test that severely blocks the cd
pipeline.
Here is the error message:
```
[2022-03-24T04:07:35.481Z] =================================== FAILURES
===================================
[2022-03-24T04:07:35.481Z] ___________________________
test_convolution_large_c ___________________________
[2022-03-24T04:07:35.481Z]
[2022-03-24T04:07:35.481Z] @pytest.mark.serial
[2022-03-24T04:07:35.481Z] def test_convolution_large_c():
[2022-03-24T04:07:35.481Z] problematic_c = 64 * 1024
[2022-03-24T04:07:35.481Z] # The convolution accumulates many
values, so scale the input magnitude.
[2022-03-24T04:07:35.481Z] scale = 0.1
[2022-03-24T04:07:35.481Z] def test_1D_with_width(width, grad_req):
[2022-03-24T04:07:35.481Z] ctx_list = [{'ctx': mx.gpu(0),
'conv_data': (1, problematic_c, width), 'type_dict': {'conv_data': np.float32}},
[2022-03-24T04:07:35.481Z] {'ctx': mx.gpu(0),
'conv_data': (1, problematic_c, width), 'type_dict': {'conv_data': np.float64}}]
[2022-03-24T04:07:35.481Z] sym =
mx.sym.Convolution(layout='NCW', num_filter=8, kernel=(2,), name='conv')
[2022-03-24T04:07:35.481Z] check_consistency([sym, sym],
ctx_list, grad_req=grad_req, scale=scale)
[2022-03-24T04:07:35.481Z]
[2022-03-24T04:07:35.481Z] def test_2D_with_width(width, grad_req):
[2022-03-24T04:07:35.481Z] ctx_list = [{'ctx': mx.gpu(0),
'conv_data': (1, problematic_c, 2, width), 'type_dict': {'conv_data':
np.float32}},
[2022-03-24T04:07:35.481Z] {'ctx': mx.gpu(0),
'conv_data': (1, problematic_c, 2, width), 'type_dict': {'conv_data':
np.float64}}]
[2022-03-24T04:07:35.481Z] sym =
mx.sym.Convolution(layout='NCHW', num_filter=4, kernel=(2,2), name='conv')
[2022-03-24T04:07:35.481Z] check_consistency([sym, sym],
ctx_list, grad_req=grad_req, scale=scale)
[2022-03-24T04:07:35.481Z]
[2022-03-24T04:07:35.481Z] # Run with different data tensor shapes
to run cudnnFind() multiple times.
[2022-03-24T04:07:35.481Z] # First, populate algo and op caches with
models that always use cudnnFind() (req == 'write').
[2022-03-24T04:07:35.481Z] # Then run models that must avoid cached
cudnnFind() results in some cases (req == 'add').
[2022-03-24T04:07:35.481Z] widths = [4, 16, 64]
[2022-03-24T04:07:35.481Z] for req in ['write', 'add']:
[2022-03-24T04:07:35.481Z] for width in widths:
[2022-03-24T04:07:35.481Z] test_1D_with_width(width, req)
[2022-03-24T04:07:35.481Z] > test_2D_with_width(width, req)
[2022-03-24T04:07:35.481Z]
[2022-03-24T04:07:35.481Z] tests/python/gpu/test_operator_gpu.py:688:
[2022-03-24T04:07:35.481Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2022-03-24T04:07:35.481Z] tests/python/gpu/test_operator_gpu.py:679: in
test_2D_with_width
[2022-03-24T04:07:35.481Z] check_consistency([sym, sym], ctx_list,
grad_req=grad_req, scale=scale)
[2022-03-24T04:07:35.481Z] python/mxnet/test_utils.py:1673: in
check_consistency
[2022-03-24T04:07:35.481Z] assert_almost_equal(arr, gtarr, rtol=rt,
atol=at, equal_nan=equal_nan)
[2022-03-24T04:07:35.481Z] python/mxnet/test_utils.py:689: in
assert_almost_equal
[2022-03-24T04:07:35.481Z] a = a.asnumpy()
[2022-03-24T04:07:35.481Z] python/mxnet/ndarray/ndarray.py:2640: in asnumpy
[2022-03-24T04:07:35.481Z] check_call(_LIB.MXNDArraySyncCopyToCPU(
[2022-03-24T04:07:35.481Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2022-03-24T04:07:35.481Z]
[2022-03-24T04:07:35.481Z] ret = -1
[2022-03-24T04:07:35.481Z]
[2022-03-24T04:07:35.481Z] def check_call(ret):
[2022-03-24T04:07:35.481Z] """Check the return value of C API call.
[2022-03-24T04:07:35.481Z]
[2022-03-24T04:07:35.481Z] This function will raise an exception
when an error occurs.
[2022-03-24T04:07:35.481Z] Wrap every API call with this function.
[2022-03-24T04:07:35.481Z]
[2022-03-24T04:07:35.481Z] Parameters
[2022-03-24T04:07:35.481Z] ----------
[2022-03-24T04:07:35.481Z] ret : int
[2022-03-24T04:07:35.481Z] return value from API calls.
[2022-03-24T04:07:35.481Z] """
[2022-03-24T04:07:35.481Z] if ret != 0:
[2022-03-24T04:07:35.481Z] > raise get_last_ffi_error()
[2022-03-24T04:07:35.481Z] E mxnet.base.MXNetError: Traceback
(most recent call last):
[2022-03-24T04:07:35.481Z] E [bt] (14)
/usr/lib64/libc.so.6(clone+0x6d) [0x7f96c59728dd]
[2022-03-24T04:07:35.481Z] E [bt] (13)
/usr/lib64/libpthread.so.0(+0x7ea5) [0x7f96c6352ea5]
[2022-03-24T04:07:35.481Z] E [bt] (12)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x130006ef) [0x7f94897ce6ef]
[2022-03-24T04:07:35.481Z] E [bt] (11)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void
(std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > >
>::_M_run()+0x32) [0x7f9478ac97e2]
[2022-03-24T04:07:35.481Z] E [bt] (10)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void
(std::shared_ptr<dmlc::ManualEvent>),
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*,
bool)::{lambda()#4}::operator()()
const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data
const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x36) [0x7f9478ad2086]
[2022-03-24T04:07:35.481Z] E [bt] (9)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(void
mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context,
bool,
mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*,
std::shared_ptr<dmlc::ManualEvent> const&)+0x530) [0x7f9478ad1c80]
[2022-03-24T04:07:35.481Z] E [bt] (8)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
mxnet::engine::OprBlock*, mxnet::engine::CallbackOnStart,
mxnet::engine::CallbackOnComplete)+0x5bd) [0x7f9478acae3d]
[2022-03-24T04:07:35.481Z] E [bt] (7)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void
(mxnet::RunContext, mxnet::engine::CallbackOnStart,
mxnet::engine::CallbackOnComplete),
mxnet::engine::ThreadedEngine::BulkFlush()::{lambda(mxnet::RunContext,
mxnet::engine::CallbackOnStart,
mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&,
mxnet::RunContext&&, mxnet::engine::CallbackOnStart&&,
mxnet::engine::CallbackOnComplete&&)+0xc1) [0x7f9478ac2ee1]
[2022-03-24T04:07:35.481Z] E [bt] (6)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void
(mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void
(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob,
std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType,
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob,
std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*,
nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*,
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*,
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource,
std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*,
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*,
std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int,
std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::a
llocator<mxnet::OpReqType> >
const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&,
mxnet::RunContext&&)+0x17) [0x7f9478b40577]
[2022-03-24T04:07:35.481Z] E [bt] (5)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void
(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob,
std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType,
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob,
std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*,
nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*,
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*,
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource,
std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*,
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*,
std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int,
std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType,
std::allocator<mxnet::OpReqType> > const&)::{lambda(mxn
et::RunContext)#1}::operator()(mxnet::RunContext) const+0x259) [0x7f9478b3fe29]
[2022-03-24T04:07:35.481Z] E [bt] (4)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(void
mxnet::op::ConvolutionGradCompute<mshadow::gpu>(nnvm::NodeAttrs const&,
mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob>
> const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> >
const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x577)
[0x7f94813a2d93]
[2022-03-24T04:07:35.481Z] E [bt] (3)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(bool
mxnet::op::cudnn::Exec<mxnet::op::cudnn::ConvDgrad, mxnet::TBlob const&,
mxnet::TBlob const&, mxnet::TBlob const&>(mxnet::OpContext const&,
mxnet::op::cudnn::ConvDgrad::Param const&, mxnet::TBlob const&, mxnet::TBlob
const&, mxnet::TBlob const&)+0x1f7) [0x7f94813b9a75]
[2022-03-24T04:07:35.481Z] E [bt] (2)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::cudnn::ConvDgrad::Make(mxnet::OpContext
const&, mxnet::op::cudnn::ConvParam const&, mxnet::TBlob const&, mxnet::TBlob
const&, mxnet::TBlob const&)+0x27c) [0x7f94792d05cc]
[2022-03-24T04:07:35.481Z] E [bt] (1)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::cudnn::SelectPlan(mxnet::OpContext
const&, mxnet::op::cudnn::ConvParam const&, std::unique_ptr<void*,
mxnet::cudnn_cxx::DescriptorDestroyer>, unsigned long,
std::function<std::string ()> const&, std::vector<long, std::allocator<long> >
const&, std::vector<void*, std::allocator<void*> > const&, long, std::string
const&)+0x426) [0x7f94792cf156]
[2022-03-24T04:07:35.481Z] E [bt] (0)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::cudnn_cxx::FindTopPlans(std::vector<std::unique_ptr<void*,
mxnet::cudnn_cxx::DescriptorDestroyer>, std::allocator<std::unique_ptr<void*,
mxnet::cudnn_cxx::DescriptorDestroyer> > >&&, unsigned long, cudnnContext*,
std::unique_ptr<void*, mxnet::cudnn_cxx::DescriptorDestroyer> const&,
std::function<std::optional<float> (float)>)+0x86a) [0x7f9478a0aaea]
[2022-03-24T04:07:35.481Z] E File
"../src/common/cuda/cudnn_cxx.cc", line 232
[2022-03-24T04:07:35.481Z] E cuDNN: CUDNN_STATUS_INTERNAL_ERROR
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]