Re: segmentation fault in master using mkdlnn

Marco de Abreu Thu, 03 May 2018 11:49:48 -0700

Da, it seems like you have a problem with your internet connection, leading
to a timeout to the keyserver.


-Marco

On Thu, May 3, 2018 at 8:20 PM, Anirudh <anirudh2...@gmail.com> wrote:

> Hi Pedro and Da,
>
> I am not sure how to install mkldnn with cmake. But for make to reproduce
> you can do the following:
>
> make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> export MXNET_TEST_SEED=11
> export MXNET_MODULE_SEED=812478194
> export MXNET_TEST_COUNT=10000
> nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape
>
> I was able to reproduce on master, now trying on 1.2 branch.
>
> Anirudh
>
>
> On Thu, May 3, 2018 at 10:17 AM, Zheng, Da <dzz...@amazon.com> wrote:
>
> > Hello Pedro,
> >
> > I tried your instructions. It seems I can't run the docker in EC2
> > instances.
> > Where did you reproduce the error?
> >
> > Thanks,
> > Da
> >
> > + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> > + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
> > gpg: directory `/root/.gnupg' created
> > gpg: new configuration file `/root/.gnupg/gpg.conf' created
> > gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active
> during
> > this run
> > gpg: keyring `/root/.gnupg/secring.gpg' created
> > gpg: keyring `/root/.gnupg/pubring.gpg' created
> > gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
> > gpg: keyserver timed out
> > gpg: keyserver receive failed: keyserver error
> > The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
> > Traceback (most recent call last):
> >   File "ci/build.py", line 263, in <module>
> >     sys.exit(main())
> >   File "ci/build.py", line 197, in main
> >     build_docker(platform, docker_binary)
> >   File "ci/build.py", line 73, in build_docker
> >     check_call(cmd)
> >   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
> >     raise CalledProcessError(retcode, cmd)
> > subprocess.CalledProcessError: Command '['docker', 'build', '-f',
> > 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
> > '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status
> 2
> >
> >
> > On 5/3/18, 8:01 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com>
> wrote:
> >
> >     Hi Da
> >
> >     Reproduction instructions:
> >
> >     On the host:
> >
> >     Adjust core pattern:
> >
> >     $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
> >
> >
> >     Use the following patch:
> >
> >     ===============
> >
> >     diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
> >     --- a/3rdparty/mkldnn
> >     +++ b/3rdparty/mkldnn
> >     @@ -1 +1 @@
> >     -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
> >     +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
> >     diff --git a/ci/docker/runtime_functions.sh
> > b/ci/docker/runtime_functions.sh
> >     index 027e287..62649c9 100755
> >     --- a/ci/docker/runtime_functions.sh
> >     +++ b/ci/docker/runtime_functions.sh
> >     @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
> >          # https://github.com/apache/incubator-mxnet/issues/10026
> >          #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
> >          export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> >     -    nosetests-2.7 --verbose tests/python/unittest
> >     -    nosetests-2.7 --verbose tests/python/train
> >     -    nosetests-2.7 --verbose tests/python/quantization
> >     +    export MXNET_TEST_SEED=11
> >     +    export MXNET_MODULE_SEED=812478194
> >     +    pwd
> >     +    export MXNET_TEST_COUNT=10000
> >     +    ulimit -c unlimited
> >     +    ulimit -c
> >     +    while nosetests-2.7 --verbose
> >     tests/python/unittest/test_module.py:test_forward_reshape; do echo
> > round;
> >     done
> >     +    #nosetests-2.7 --verbose tests/python/train
> >     +    #nosetests-2.7 --verbose tests/python/quantization
> >      }
> >
> >      unittest_ubuntu_python3_cpu() {
> >
> >
> >
> >     ==============
> >
> >     Build and execute the test, make sure the repo is clean
> >
> >     $ ci/docker/runtime_functions.sh clean_repo
> >
> >     $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> >
> >
> >     Once it crashes it will stop.
> >
> >     Then go in the container:
> >
> >
> >     $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
> >
> >     A core should be there.
> >
> >     you might need to install gdb as root by executing the previous
> command
> >     without uid so you can use apt-get.
> >
> >
> >
> >
> >     Good luck.
> >
> >
> >
> >
> >
> >
> >
> >     On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dzz...@amazon.com> wrote:
> >
> >     > Thanks a lot for locating the error.
> >     > Could you tell me How you reproduce the error?
> >     >
> >     > On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com>
> > wrote:
> >     >
> >     >     Looks like a problem in mkl's same_shape
> >     >
> >     >     the pointer to mkldnn::memory::desc &desc  looks invalid.
> >     >
> >     >     (More stack frames follow...)
> >     >     (gdb) p desc
> >     >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading
> > variable>
> >     >     (gdb) p dtype
> >     >     $2 = 0
> >     >     (gdb) p shape
> >     >     $3 = (const mxnet::TShape &) @0x7f3905a58b50:
> > {<nnvm::Tuple<long>> =
> >     >     {static kStackCache = <optimized out>, ndim_ = 2,
> > num_heap_allocated_
> >     > = 0,
> >     >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ =
> > 0x0}, <No
> >     > data
> >     >     fields>}
> >     >     (gdb)
> >     >
> >     >
> >     >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzz...@amazon.com>
> > wrote:
> >     >
> >     >     > There are a few problems with valgrind, which makes it not an
> > ideal
> >     > tool
> >     >     > for mxnet with python interface.
> >     >     >
> >     >     > First, valgrind generates a huge number of irrelevant
> > messages, most
> >     > of
> >     >     > them from in Python itself.
> >     >     >
> >     >     > Second, valgrind can't emulate all CPU instructions. I
> > remember that
> >     > when
> >     >     > I run valgrind with mxnet, valgrind exits with a strange
> > error. I
> >     > later on
> >     >     > found that it was caused by an unsupported CPU instructions.
> >     >     >
> >     >     > Third, valgrind doesn't support multithreading well. As far
> as
> > I
> >     > know,
> >     >     > valgrind runs everything in a single thread even if the
> > program uses
> >     >     > multi-threading. An error like this, which is likely caused
> by
> > race
> >     >     > condition, can't be caught by valgrind.
> >     >     >
> >     >     > I used to use Address Sanitizer for memory errors. This tool
> > is much
> >     >     > faster and can work with multi-threads. However, it doesn't
> > work with
> >     >     > Python for some reason.
> >     >     >
> >     >     > One thing we potentially can do is to use memory checker for
> > C++ unit
> >     >     > tests. Not sure it'll cover all memory errors we want.
> >     >     >
> >     >     > Best,
> >     >     > Da
> >     >     >
> >     >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <
> > pedro.larroy.li...@gmail.com>
> >     > wrote:
> >     >     >
> >     >     >     It's very difficult to reproduce, non-deterministic. We
> > were also
> >     >     > running
> >     >     >     without signal handlers in CI so there are no stack
> traces
> >     >     > unfortunately.
> >     >     >
> >     >     >     Care to elaborate why valgrind doesn't work with Python?
> >     >     >
> >     >     >
> >     >     >
> >     >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <
> > zhengda1...@gmail.com>
> >     >     > wrote:
> >     >     >
> >     >     >     > can we build it in CI？segfault doesn't happen
> > infrequently.
> >     >     >     >
> >     >     >     > 2018年5月2日 下午11:34，"Chris Olivier" <
> cjolivie...@gmail.com
> > >写道：
> >     >     >     >
> >     >     >     > > you can try Intel Inspector, which is like an
> enhanced
> >     > version of
> >     >     >     > valgrind
> >     >     >     > > with a GUI and whatnot.
> >     >     >     > >
> >     >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
> >     > zhengda1...@gmail.com>
> >     >     > wrote:
> >     >     >     > >
> >     >     >     > > > valgrind doesn't work with Python. also, valgrind
> > doesn't
> >     >     > support some
> >     >     >     > > > CPU instructions used by MXNet (I think some
> > instructions
> >     >     > related to
> >     >     >     > > > random generator).
> >     >     >     > > >
> >     >     >     > > >
> >     >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
> >     >     > bhavintha...@gmail.com>
> >     >     >     > > > wrote:
> >     >     >     > > > > Have you tried running with valgrind to get some
> > clues
> >     > on the
> >     >     >     > > root-cause?
> >     >     >     > > > >
> >     >     >     > > > > Bhavin Thaker.
> >     >     >     > > > >
> >     >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
> >     > zhengda1...@gmail.com
> >     >     > >
> >     >     >     > wrote:
> >     >     >     > > > >
> >     >     >     > > > >> It might also be possible that this isn't an
> > MKLDNN bug.
> >     >     >     > > > >> I just saw a similar memory error without MKLDNN
> > build.
> >     >     >     > > > >>
> >     >     >     > > > >>
> >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> >     >     > organizations/jenkins/
> >     >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
> >     >     >     > > > >>
> >     >     >     > > > >> Best,
> >     >     >     > > > >> Da
> >     >     >     > > > >>
> >     >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
> >     > dzz...@amazon.com>
> >     >     >     > wrote:
> >     >     >     > > > >> > There might be a race condition that causes
> the
> > memory
> >     >     > error.
> >     >     >     > > > >> > It might be caused by this PR:
> >     >     >     > > > >> > https://github.com/apache/
> > incubator-mxnet/pull/10706/
> >     > files
> >     >     >     > > > >> > This PR removes MKLDNN memory from NDArray.
> >     >     >     > > > >> > However, I don't know why this causes memory
> > error. If
> >     >     > someone is
> >     >     >     > > > using
> >     >     >     > > > >> the memory, it should still hold the memory with
> > shared
> >     >     > pointer.
> >     >     >     > > > >> > But I do see the memory error increase after
> > this PR
> >     > is
> >     >     > merged.
> >     >     >     > > > >> >
> >     >     >     > > > >> > Best,
> >     >     >     > > > >> > Da
> >     >     >     > > > >> >
> >     >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
> >     >     >     > pedro.larroy.li...@gmail.com>
> >     >     >     > > > >> wrote:
> >     >     >     > > > >> >
> >     >     >     > > > >> >     I couldn't reproduce locally with:
> >     >     >     > > > >> >
> >     >     >     > > > >> >     ci/build.py -p ubuntu_cpu
> >     > /work/runtime_functions.sh
> >     >     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py
> > --platform
> >     >     > ubuntu_cpu
> >     >     >     > > > >> >     /work/runtime_functions.sh
> >     > unittest_ubuntu_python2_cpu
> >     >     >     > > > >> >
> >     >     >     > > > >> >
> >     >     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro
> > Larroy <
> >     >     >     > > > >> pedro.larroy.li...@gmail.com>
> >     >     >     > > > >> >     wrote:
> >     >     >     > > > >> >
> >     >     >     > > > >> >     > Hi
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     > Seems master is not running  anymore,
> > there's a
> >     >     > segmentation
> >     >     >     > > > fault
> >     >     >     > > > >> using
> >     >     >     > > > >> >     > MKDLNN-CPU
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     >
> >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> >     >     > organizations/jenkins/
> >     >     >     > > > >> >     > incubator-mxnet/detail/master/
> > 801/pipeline/662
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     > I see my PRs failing with a similar
> error.
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >     > Pedro
> >     >     >     > > > >> >     >
> >     >     >     > > > >> >
> >     >     >     > > > >> >
> >     >     >     > > > >>
> >     >     >     > > >
> >     >     >     > >
> >     >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >
> >     >
> >     >
> >
> >
> >
>

Re: segmentation fault in master using mkdlnn

Reply via email to