barry-jin opened a new issue #20979:
URL: https://github.com/apache/incubator-mxnet/issues/20979


   ## Description
   
   test_convolution_large_c is a flaky test that severely blocks the cd 
pipeline. 
   
   Here is the error message: 
   
   ```
   [2022-03-24T04:07:35.481Z] =================================== FAILURES 
===================================
   [2022-03-24T04:07:35.481Z] ___________________________ 
test_convolution_large_c ___________________________
   [2022-03-24T04:07:35.481Z] 
   [2022-03-24T04:07:35.481Z]     @pytest.mark.serial
   [2022-03-24T04:07:35.481Z]     def test_convolution_large_c():
   [2022-03-24T04:07:35.481Z]         problematic_c = 64 * 1024
   [2022-03-24T04:07:35.481Z]         # The convolution accumulates many 
values, so scale the input magnitude.
   [2022-03-24T04:07:35.481Z]         scale = 0.1
   [2022-03-24T04:07:35.481Z]         def test_1D_with_width(width, grad_req):
   [2022-03-24T04:07:35.481Z]             ctx_list = [{'ctx': mx.gpu(0), 
'conv_data': (1, problematic_c, width), 'type_dict': {'conv_data': np.float32}},
   [2022-03-24T04:07:35.481Z]                         {'ctx': mx.gpu(0), 
'conv_data': (1, problematic_c, width), 'type_dict': {'conv_data': np.float64}}]
   [2022-03-24T04:07:35.481Z]             sym = 
mx.sym.Convolution(layout='NCW', num_filter=8, kernel=(2,), name='conv')
   [2022-03-24T04:07:35.481Z]             check_consistency([sym, sym], 
ctx_list, grad_req=grad_req, scale=scale)
   [2022-03-24T04:07:35.481Z]     
   [2022-03-24T04:07:35.481Z]         def test_2D_with_width(width, grad_req):
   [2022-03-24T04:07:35.481Z]             ctx_list = [{'ctx': mx.gpu(0), 
'conv_data': (1, problematic_c, 2, width), 'type_dict': {'conv_data': 
np.float32}},
   [2022-03-24T04:07:35.481Z]                         {'ctx': mx.gpu(0), 
'conv_data': (1, problematic_c, 2, width), 'type_dict': {'conv_data': 
np.float64}}]
   [2022-03-24T04:07:35.481Z]             sym = 
mx.sym.Convolution(layout='NCHW', num_filter=4, kernel=(2,2), name='conv')
   [2022-03-24T04:07:35.481Z]             check_consistency([sym, sym], 
ctx_list, grad_req=grad_req, scale=scale)
   [2022-03-24T04:07:35.481Z]     
   [2022-03-24T04:07:35.481Z]         # Run with different data tensor shapes 
to run cudnnFind() multiple times.
   [2022-03-24T04:07:35.481Z]         # First, populate algo and op caches with 
models that always use cudnnFind() (req == 'write').
   [2022-03-24T04:07:35.481Z]         # Then run models that must avoid cached 
cudnnFind() results in some cases (req == 'add').
   [2022-03-24T04:07:35.481Z]         widths = [4, 16, 64]
   [2022-03-24T04:07:35.481Z]         for req in ['write', 'add']:
   [2022-03-24T04:07:35.481Z]             for width in widths:
   [2022-03-24T04:07:35.481Z]                 test_1D_with_width(width, req)
   [2022-03-24T04:07:35.481Z] >               test_2D_with_width(width, req)
   [2022-03-24T04:07:35.481Z] 
   [2022-03-24T04:07:35.481Z] tests/python/gpu/test_operator_gpu.py:688: 
   [2022-03-24T04:07:35.481Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
   [2022-03-24T04:07:35.481Z] tests/python/gpu/test_operator_gpu.py:679: in 
test_2D_with_width
   [2022-03-24T04:07:35.481Z]     check_consistency([sym, sym], ctx_list, 
grad_req=grad_req, scale=scale)
   [2022-03-24T04:07:35.481Z] python/mxnet/test_utils.py:1673: in 
check_consistency
   [2022-03-24T04:07:35.481Z]     assert_almost_equal(arr, gtarr, rtol=rt, 
atol=at, equal_nan=equal_nan)
   [2022-03-24T04:07:35.481Z] python/mxnet/test_utils.py:689: in 
assert_almost_equal
   [2022-03-24T04:07:35.481Z]     a = a.asnumpy()
   [2022-03-24T04:07:35.481Z] python/mxnet/ndarray/ndarray.py:2640: in asnumpy
   [2022-03-24T04:07:35.481Z]     check_call(_LIB.MXNDArraySyncCopyToCPU(
   [2022-03-24T04:07:35.481Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
   [2022-03-24T04:07:35.481Z] 
   [2022-03-24T04:07:35.481Z] ret = -1
   [2022-03-24T04:07:35.481Z] 
   [2022-03-24T04:07:35.481Z]     def check_call(ret):
   [2022-03-24T04:07:35.481Z]         """Check the return value of C API call.
   [2022-03-24T04:07:35.481Z]     
   [2022-03-24T04:07:35.481Z]         This function will raise an exception 
when an error occurs.
   [2022-03-24T04:07:35.481Z]         Wrap every API call with this function.
   [2022-03-24T04:07:35.481Z]     
   [2022-03-24T04:07:35.481Z]         Parameters
   [2022-03-24T04:07:35.481Z]         ----------
   [2022-03-24T04:07:35.481Z]         ret : int
   [2022-03-24T04:07:35.481Z]             return value from API calls.
   [2022-03-24T04:07:35.481Z]         """
   [2022-03-24T04:07:35.481Z]         if ret != 0:
   [2022-03-24T04:07:35.481Z] >           raise get_last_ffi_error()
   [2022-03-24T04:07:35.481Z] E           mxnet.base.MXNetError: Traceback 
(most recent call last):
   [2022-03-24T04:07:35.481Z] E             [bt] (14) 
/usr/lib64/libc.so.6(clone+0x6d) [0x7f96c59728dd]
   [2022-03-24T04:07:35.481Z] E             [bt] (13) 
/usr/lib64/libpthread.so.0(+0x7ea5) [0x7f96c6352ea5]
   [2022-03-24T04:07:35.481Z] E             [bt] (12) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x130006ef) [0x7f94897ce6ef]
   [2022-03-24T04:07:35.481Z] E             [bt] (11) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void
 (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > 
>::_M_run()+0x32) [0x7f9478ac97e2]
   [2022-03-24T04:07:35.481Z] E             [bt] (10) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void 
(std::shared_ptr<dmlc::ManualEvent>), 
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lambda()#4}::operator()() 
const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data
 const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x36) [0x7f9478ad2086]
   [2022-03-24T04:07:35.481Z] E             [bt] (9) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(void 
mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context,
 bool, 
mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*,
 std::shared_ptr<dmlc::ManualEvent> const&)+0x530) [0x7f9478ad1c80]
   [2022-03-24T04:07:35.481Z] E             [bt] (8) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
 mxnet::engine::OprBlock*, mxnet::engine::CallbackOnStart, 
mxnet::engine::CallbackOnComplete)+0x5bd) [0x7f9478acae3d]
   [2022-03-24T04:07:35.481Z] E             [bt] (7) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void 
(mxnet::RunContext, mxnet::engine::CallbackOnStart, 
mxnet::engine::CallbackOnComplete), 
mxnet::engine::ThreadedEngine::BulkFlush()::{lambda(mxnet::RunContext, 
mxnet::engine::CallbackOnStart, 
mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, 
mxnet::RunContext&&, mxnet::engine::CallbackOnStart&&, 
mxnet::engine::CallbackOnComplete&&)+0xc1) [0x7f9478ac2ee1]
   [2022-03-24T04:07:35.481Z] E             [bt] (6) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void 
(mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void 
(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, 
std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, 
std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, 
nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, 
std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, 
std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::a
 llocator<mxnet::OpReqType> > 
const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, 
mxnet::RunContext&&)+0x17) [0x7f9478b40577]
   [2022-03-24T04:07:35.481Z] E             [bt] (5) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void
 (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, 
std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, 
std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, 
nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, 
std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, 
std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&)::{lambda(mxn
 et::RunContext)#1}::operator()(mxnet::RunContext) const+0x259) [0x7f9478b3fe29]
   [2022-03-24T04:07:35.481Z] E             [bt] (4) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(void 
mxnet::op::ConvolutionGradCompute<mshadow::gpu>(nnvm::NodeAttrs const&, 
mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> 
> const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > 
const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x577) 
[0x7f94813a2d93]
   [2022-03-24T04:07:35.481Z] E             [bt] (3) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(bool 
mxnet::op::cudnn::Exec<mxnet::op::cudnn::ConvDgrad, mxnet::TBlob const&, 
mxnet::TBlob const&, mxnet::TBlob const&>(mxnet::OpContext const&, 
mxnet::op::cudnn::ConvDgrad::Param const&, mxnet::TBlob const&, mxnet::TBlob 
const&, mxnet::TBlob const&)+0x1f7) [0x7f94813b9a75]
   [2022-03-24T04:07:35.481Z] E             [bt] (2) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::cudnn::ConvDgrad::Make(mxnet::OpContext
 const&, mxnet::op::cudnn::ConvParam const&, mxnet::TBlob const&, mxnet::TBlob 
const&, mxnet::TBlob const&)+0x27c) [0x7f94792d05cc]
   [2022-03-24T04:07:35.481Z] E             [bt] (1) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::cudnn::SelectPlan(mxnet::OpContext
 const&, mxnet::op::cudnn::ConvParam const&, std::unique_ptr<void*, 
mxnet::cudnn_cxx::DescriptorDestroyer>, unsigned long, 
std::function<std::string ()> const&, std::vector<long, std::allocator<long> > 
const&, std::vector<void*, std::allocator<void*> > const&, long, std::string 
const&)+0x426) [0x7f94792cf156]
   [2022-03-24T04:07:35.481Z] E             [bt] (0) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::cudnn_cxx::FindTopPlans(std::vector<std::unique_ptr<void*,
 mxnet::cudnn_cxx::DescriptorDestroyer>, std::allocator<std::unique_ptr<void*, 
mxnet::cudnn_cxx::DescriptorDestroyer> > >&&, unsigned long, cudnnContext*, 
std::unique_ptr<void*, mxnet::cudnn_cxx::DescriptorDestroyer> const&, 
std::function<std::optional<float> (float)>)+0x86a) [0x7f9478a0aaea]
   [2022-03-24T04:07:35.481Z] E             File 
"../src/common/cuda/cudnn_cxx.cc", line 232
   [2022-03-24T04:07:35.481Z] E           cuDNN: CUDNN_STATUS_INTERNAL_ERROR
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to