josephevans opened a new issue #19877: URL: https://github.com/apache/incubator-mxnet/issues/19877
## Description On the v1.x pipeline, we are seeing the following test failures consistently: in tests/python/unittest/test_gluon_data.py: test_multi_worker_dataloader_release_pool test_multi_worker_forked_data_loader ## Occurrences https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19872/7/pipeline/293/#step-776-log-1725 https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19872/4/pipeline/296 Test failure logs: ``` [2021-02-10T01:39:46.205Z] test_gluon_data.test_multi_worker_dataloader_release_pool ... terminate called after throwing an instance of 'dmlc::Error' [2021-02-10T01:39:46.205Z] what(): [01:39:41] src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0) : [2021-02-10T01:39:46.205Z] Stack trace: [2021-02-10T01:39:46.205Z] [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f191fc63b61] [2021-02-10T01:39:46.205Z] [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle const&)+0xd3) [0x7f192522fdf3] [2021-02-10T01:39:46.205Z] [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98) [0x7f1925237348] [2021-02-10T01:39:46.205Z] [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69) [0x7f1925232ce9] [2021-02-10T01:39:46.205Z] [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5ade409) [0x7f1924b21409] [2021-02-10T01:39:46.205Z] [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x61d3c50) [0x7f1925216c50] [2021-02-10T01:39:46.205Z] [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xa50) [0x7f1925210440] [2021-02-10T01:39:46.205Z] [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x349) [0x7f192522c9d9] [2021-02-10T01:39:46.205Z] [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x42b) [0x7f1925219f5b] [2021-02-10T01:39:46.205Z] [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0xd8) [0x7f1925216948] [2021-02-10T01:39:46.461Z] /work/runtime_functions.sh: line 1008: 6 Aborted (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_unittest.xml --verbose ``` ``` [2021-02-09T22:11:59.574Z] ====================================================================== [2021-02-09T22:11:59.574Z] ERROR: test_gluon_data.test_multi_worker_forked_data_loader [2021-02-09T22:11:59.574Z] ---------------------------------------------------------------------- [2021-02-09T22:11:59.574Z] Traceback (most recent call last): [2021-02-09T22:11:59.574Z] File "/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest [2021-02-09T22:11:59.574Z] self.test(*self.arg) [2021-02-09T22:11:59.574Z] File "/work/mxnet/tests/python/unittest/common.py", line 226, in test_new [2021-02-09T22:11:59.574Z] mx.nd.waitall() [2021-02-09T22:11:59.574Z] File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 211, in waitall [2021-02-09T22:11:59.574Z] check_call(_LIB.MXNDArrayWaitAll()) [2021-02-09T22:11:59.574Z] File "/work/mxnet/python/mxnet/base.py", line 246, in check_call [2021-02-09T22:11:59.574Z] raise get_last_ffi_error() [2021-02-09T22:11:59.574Z] mxnet.base.MXNetError: Traceback (most recent call last): [2021-02-09T22:11:59.574Z] [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0xd8) [0x7f0df6da1c48] [2021-02-09T22:11:59.574Z] [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x42b) [0x7f0df6da525b] [2021-02-09T22:11:59.574Z] [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x349) [0x7f0df6db7e69] [2021-02-09T22:11:59.574Z] [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xa50) [0x7f0df6d9b740] [2021-02-09T22:11:59.574Z] [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x63dbf50) [0x7f0df6da1f50] [2021-02-09T22:11:59.574Z] [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5cde545) [0x7f0df66a4545] [2021-02-09T22:11:59.574Z] [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69) [0x7f0df6dbe0b9] [2021-02-09T22:11:59.574Z] [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98) [0x7f0df6dc2718] [2021-02-09T22:11:59.574Z] [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle const&)+0xcf) [0x7f0df6dbb27f] [2021-02-09T22:11:59.574Z] [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f0df16c59e1] [2021-02-09T22:11:59.574Z] File "src/storage/./cpu_shared_storage_manager.h", line 218 [2021-02-09T22:11:59.574Z] MXNetError: Check failed: count >= 0 (-1 vs. 0) : ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
