Re: segmentation fault in master using mkdlnn

Pedro Larroy Thu, 03 May 2018 06:59:19 -0700

Hi

Managed to get a stack trace:


+ nosetests-2.7 --verbose
tests/python/unittest/test_module.py:test_forward_reshape
[WARNING] *** module-level seed is set: all tests running deterministically
***
[INFO] Setting module np/mx/python random seeds, use
MXNET_MODULE_SEED=812478194 to reproduce.
[WARNING] *** test-level seed set: all "@with_seed()" tests run
deterministically ***
test_module.test_forward_reshape ... [INFO] Setting test np/mx/python
random seeds, use MXNET_TEST_SEED=11 to reproduce.
[13:54:40] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 81920 bytes
with malloc directly
[13:54:40] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 576000 bytes
with malloc directly
/work/mxnet/python/mxnet/module/base_module.py:66: UserWarning: Data
provided by label_shapes don't match names specified by label_names ([] vs.
['softmax_label'])
  warnings.warn(msg)

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a)
[0x7f7fed68e8fa]
[bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x309619f)
[0x7f7ff029b19f]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f801aa774b0]
[bt] (3)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::NDArray::GetMKLDNNData()
const+0x637) [0x7f7fefde2a57]
[bt] (4)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::NDArray::GetMKLDNNDataReorder(mkldnn::memory::primitive_desc
const&) const+0x33c) [0x7f7fefde512c]
[bt] (5)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForward(nnvm::NodeAttrs
const&, mxnet::OpContext const&, std::vector<mxnet::NDArray,
std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType,
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray,
std::allocator<mxnet::NDArray> > const&)+0x26e0) [0x7f7fed68b150]
[bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x28da1ce)
[0x7f7fefadf1ce]
[bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x29eaed7)
[0x7f7fefbefed7]
[bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x29eafc1)
[0x7f7fefbeffc1]
[bt] (9)
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
mxnet::engine::OprBlock*)+0xcb5) [0x7f7ff01b1f65]
ok


On Thu, May 3, 2018 at 3:57 PM, Pedro Larroy <pedro.larroy.li...@gmail.com>
wrote:

> @Chris seems intel inspector requires purchasing right? maybe some of us
> already owns a license and can execute the test that fails intermittently?
>  test_module.py:test_forward_reshape
>
> On Thu, May 3, 2018 at 3:49 PM, Pedro Larroy <pedro.larroy.li...@gmail.com
> > wrote:
>
>> It's very difficult to reproduce, non-deterministic. We were also running
>> without signal handlers in CI so there are no stack traces unfortunately.
>>
>> Care to elaborate why valgrind doesn't work with Python?
>>
>>
>>
>> On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1...@gmail.com> wrote:
>>
>>> can we build it in CI？segfault doesn't happen infrequently.
>>>
>>> 2018年5月2日 下午11:34，"Chris Olivier" <cjolivie...@gmail.com>写道：
>>>
>>> > you can try Intel Inspector, which is like an enhanced version of
>>> valgrind
>>> > with a GUI and whatnot.
>>> >
>>> > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zhengda1...@gmail.com> wrote:
>>> >
>>> > > valgrind doesn't work with Python. also, valgrind doesn't support
>>> some
>>> > > CPU instructions used by MXNet (I think some instructions related to
>>> > > random generator).
>>> > >
>>> > >
>>> > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>>> bhavintha...@gmail.com>
>>> > > wrote:
>>> > > > Have you tried running with valgrind to get some clues on the
>>> > root-cause?
>>> > > >
>>> > > > Bhavin Thaker.
>>> > > >
>>> > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1...@gmail.com>
>>> wrote:
>>> > > >
>>> > > >> It might also be possible that this isn't an MKLDNN bug.
>>> > > >> I just saw a similar memory error without MKLDNN build.
>>> > > >>
>>> > > >>
>>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>> > incubator-mxnet/detail/PR-10783/1/pipeline
>>> > > >>
>>> > > >> Best,
>>> > > >> Da
>>> > > >>
>>> > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dzz...@amazon.com>
>>> wrote:
>>> > > >> > There might be a race condition that causes the memory error.
>>> > > >> > It might be caused by this PR:
>>> > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
>>> > > >> > This PR removes MKLDNN memory from NDArray.
>>> > > >> > However, I don't know why this causes memory error. If someone
>>> is
>>> > > using
>>> > > >> the memory, it should still hold the memory with shared pointer.
>>> > > >> > But I do see the memory error increase after this PR is merged.
>>> > > >> >
>>> > > >> > Best,
>>> > > >> > Da
>>> > > >> >
>>> > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>>> pedro.larroy.li...@gmail.com>
>>> > > >> wrote:
>>> > > >> >
>>> > > >> >     I couldn't reproduce locally with:
>>> > > >> >
>>> > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>>> > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>>> > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>>> > > >> >
>>> > > >> >
>>> > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>>> > > >> pedro.larroy.li...@gmail.com>
>>> > > >> >     wrote:
>>> > > >> >
>>> > > >> >     > Hi
>>> > > >> >     >
>>> > > >> >     > Seems master is not running  anymore, there's a
>>> segmentation
>>> > > fault
>>> > > >> using
>>> > > >> >     > MKDLNN-CPU
>>> > > >> >     >
>>> > > >> >     >
>>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>> > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
>>> > > >> >     >
>>> > > >> >     >
>>> > > >> >     > I see my PRs failing with a similar error.
>>> > > >> >     >
>>> > > >> >     > Pedro
>>> > > >> >     >
>>> > > >> >
>>> > > >> >
>>> > > >>
>>> > >
>>> >
>>>
>>
>>
>

Re: segmentation fault in master using mkdlnn

Reply via email to