Re: segmentation fault in master using mkdlnn

Anirudh Thu, 03 May 2018 11:20:32 -0700

Hi Pedro and Da,

I am not sure how to install mkldnn with cmake. But for make to reproduce
you can do the following:


make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
export MXNET_TEST_SEED=11
export MXNET_MODULE_SEED=812478194
export MXNET_TEST_COUNT=10000
nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape

I was able to reproduce on master, now trying on 1.2 branch.

Anirudh


On Thu, May 3, 2018 at 10:17 AM, Zheng, Da <dzz...@amazon.com> wrote:

> Hello Pedro,
>
> I tried your instructions. It seems I can't run the docker in EC2
> instances.
> Where did you reproduce the error?
>
> Thanks,
> Da
>
> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
> gpg: directory `/root/.gnupg' created
> gpg: new configuration file `/root/.gnupg/gpg.conf' created
> gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during
> this run
> gpg: keyring `/root/.gnupg/secring.gpg' created
> gpg: keyring `/root/.gnupg/pubring.gpg' created
> gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
> gpg: keyserver timed out
> gpg: keyserver receive failed: keyserver error
> The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
> Traceback (most recent call last):
>   File "ci/build.py", line 263, in <module>
>     sys.exit(main())
>   File "ci/build.py", line 197, in main
>     build_docker(platform, docker_binary)
>   File "ci/build.py", line 73, in build_docker
>     check_call(cmd)
>   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
>     raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '['docker', 'build', '-f',
> 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
> '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2
>
>
> On 5/3/18, 8:01 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com> wrote:
>
>     Hi Da
>
>     Reproduction instructions:
>
>     On the host:
>
>     Adjust core pattern:
>
>     $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
>
>
>     Use the following patch:
>
>     ===============
>
>     diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
>     --- a/3rdparty/mkldnn
>     +++ b/3rdparty/mkldnn
>     @@ -1 +1 @@
>     -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
>     +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
>     diff --git a/ci/docker/runtime_functions.sh
> b/ci/docker/runtime_functions.sh
>     index 027e287..62649c9 100755
>     --- a/ci/docker/runtime_functions.sh
>     +++ b/ci/docker/runtime_functions.sh
>     @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
>          # https://github.com/apache/incubator-mxnet/issues/10026
>          #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
>          export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
>     -    nosetests-2.7 --verbose tests/python/unittest
>     -    nosetests-2.7 --verbose tests/python/train
>     -    nosetests-2.7 --verbose tests/python/quantization
>     +    export MXNET_TEST_SEED=11
>     +    export MXNET_MODULE_SEED=812478194
>     +    pwd
>     +    export MXNET_TEST_COUNT=10000
>     +    ulimit -c unlimited
>     +    ulimit -c
>     +    while nosetests-2.7 --verbose
>     tests/python/unittest/test_module.py:test_forward_reshape; do echo
> round;
>     done
>     +    #nosetests-2.7 --verbose tests/python/train
>     +    #nosetests-2.7 --verbose tests/python/quantization
>      }
>
>      unittest_ubuntu_python3_cpu() {
>
>
>
>     ==============
>
>     Build and execute the test, make sure the repo is clean
>
>     $ ci/docker/runtime_functions.sh clean_repo
>
>     $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
>
>     Once it crashes it will stop.
>
>     Then go in the container:
>
>
>     $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
>
>     A core should be there.
>
>     you might need to install gdb as root by executing the previous command
>     without uid so you can use apt-get.
>
>
>
>
>     Good luck.
>
>
>
>
>
>
>
>     On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dzz...@amazon.com> wrote:
>
>     > Thanks a lot for locating the error.
>     > Could you tell me How you reproduce the error?
>     >
>     > On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com>
> wrote:
>     >
>     >     Looks like a problem in mkl's same_shape
>     >
>     >     the pointer to mkldnn::memory::desc &desc  looks invalid.
>     >
>     >     (More stack frames follow...)
>     >     (gdb) p desc
>     >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading
> variable>
>     >     (gdb) p dtype
>     >     $2 = 0
>     >     (gdb) p shape
>     >     $3 = (const mxnet::TShape &) @0x7f3905a58b50:
> {<nnvm::Tuple<long>> =
>     >     {static kStackCache = <optimized out>, ndim_ = 2,
> num_heap_allocated_
>     > = 0,
>     >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ =
> 0x0}, <No
>     > data
>     >     fields>}
>     >     (gdb)
>     >
>     >
>     >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzz...@amazon.com>
> wrote:
>     >
>     >     > There are a few problems with valgrind, which makes it not an
> ideal
>     > tool
>     >     > for mxnet with python interface.
>     >     >
>     >     > First, valgrind generates a huge number of irrelevant
> messages, most
>     > of
>     >     > them from in Python itself.
>     >     >
>     >     > Second, valgrind can't emulate all CPU instructions. I
> remember that
>     > when
>     >     > I run valgrind with mxnet, valgrind exits with a strange
> error. I
>     > later on
>     >     > found that it was caused by an unsupported CPU instructions.
>     >     >
>     >     > Third, valgrind doesn't support multithreading well. As far as
> I
>     > know,
>     >     > valgrind runs everything in a single thread even if the
> program uses
>     >     > multi-threading. An error like this, which is likely caused by
> race
>     >     > condition, can't be caught by valgrind.
>     >     >
>     >     > I used to use Address Sanitizer for memory errors. This tool
> is much
>     >     > faster and can work with multi-threads. However, it doesn't
> work with
>     >     > Python for some reason.
>     >     >
>     >     > One thing we potentially can do is to use memory checker for
> C++ unit
>     >     > tests. Not sure it'll cover all memory errors we want.
>     >     >
>     >     > Best,
>     >     > Da
>     >     >
>     >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <
> pedro.larroy.li...@gmail.com>
>     > wrote:
>     >     >
>     >     >     It's very difficult to reproduce, non-deterministic. We
> were also
>     >     > running
>     >     >     without signal handlers in CI so there are no stack traces
>     >     > unfortunately.
>     >     >
>     >     >     Care to elaborate why valgrind doesn't work with Python?
>     >     >
>     >     >
>     >     >
>     >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <
> zhengda1...@gmail.com>
>     >     > wrote:
>     >     >
>     >     >     > can we build it in CI？segfault doesn't happen
> infrequently.
>     >     >     >
>     >     >     > 2018年5月2日 下午11:34，"Chris Olivier" <cjolivie...@gmail.com
> >写道：
>     >     >     >
>     >     >     > > you can try Intel Inspector, which is like an enhanced
>     > version of
>     >     >     > valgrind
>     >     >     > > with a GUI and whatnot.
>     >     >     > >
>     >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
>     > zhengda1...@gmail.com>
>     >     > wrote:
>     >     >     > >
>     >     >     > > > valgrind doesn't work with Python. also, valgrind
> doesn't
>     >     > support some
>     >     >     > > > CPU instructions used by MXNet (I think some
> instructions
>     >     > related to
>     >     >     > > > random generator).
>     >     >     > > >
>     >     >     > > >
>     >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>     >     > bhavintha...@gmail.com>
>     >     >     > > > wrote:
>     >     >     > > > > Have you tried running with valgrind to get some
> clues
>     > on the
>     >     >     > > root-cause?
>     >     >     > > > >
>     >     >     > > > > Bhavin Thaker.
>     >     >     > > > >
>     >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
>     > zhengda1...@gmail.com
>     >     > >
>     >     >     > wrote:
>     >     >     > > > >
>     >     >     > > > >> It might also be possible that this isn't an
> MKLDNN bug.
>     >     >     > > > >> I just saw a similar memory error without MKLDNN
> build.
>     >     >     > > > >>
>     >     >     > > > >>
>     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     > organizations/jenkins/
>     >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
>     >     >     > > > >>
>     >     >     > > > >> Best,
>     >     >     > > > >> Da
>     >     >     > > > >>
>     >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
>     > dzz...@amazon.com>
>     >     >     > wrote:
>     >     >     > > > >> > There might be a race condition that causes the
> memory
>     >     > error.
>     >     >     > > > >> > It might be caused by this PR:
>     >     >     > > > >> > https://github.com/apache/
> incubator-mxnet/pull/10706/
>     > files
>     >     >     > > > >> > This PR removes MKLDNN memory from NDArray.
>     >     >     > > > >> > However, I don't know why this causes memory
> error. If
>     >     > someone is
>     >     >     > > > using
>     >     >     > > > >> the memory, it should still hold the memory with
> shared
>     >     > pointer.
>     >     >     > > > >> > But I do see the memory error increase after
> this PR
>     > is
>     >     > merged.
>     >     >     > > > >> >
>     >     >     > > > >> > Best,
>     >     >     > > > >> > Da
>     >     >     > > > >> >
>     >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>     >     >     > pedro.larroy.li...@gmail.com>
>     >     >     > > > >> wrote:
>     >     >     > > > >> >
>     >     >     > > > >> >     I couldn't reproduce locally with:
>     >     >     > > > >> >
>     >     >     > > > >> >     ci/build.py -p ubuntu_cpu
>     > /work/runtime_functions.sh
>     >     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py
> --platform
>     >     > ubuntu_cpu
>     >     >     > > > >> >     /work/runtime_functions.sh
>     > unittest_ubuntu_python2_cpu
>     >     >     > > > >> >
>     >     >     > > > >> >
>     >     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro
> Larroy <
>     >     >     > > > >> pedro.larroy.li...@gmail.com>
>     >     >     > > > >> >     wrote:
>     >     >     > > > >> >
>     >     >     > > > >> >     > Hi
>     >     >     > > > >> >     >
>     >     >     > > > >> >     > Seems master is not running  anymore,
> there's a
>     >     > segmentation
>     >     >     > > > fault
>     >     >     > > > >> using
>     >     >     > > > >> >     > MKDLNN-CPU
>     >     >     > > > >> >     >
>     >     >     > > > >> >     >
>     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     > organizations/jenkins/
>     >     >     > > > >> >     > incubator-mxnet/detail/master/
> 801/pipeline/662
>     >     >     > > > >> >     >
>     >     >     > > > >> >     >
>     >     >     > > > >> >     > I see my PRs failing with a similar error.
>     >     >     > > > >> >     >
>     >     >     > > > >> >     > Pedro
>     >     >     > > > >> >     >
>     >     >     > > > >> >
>     >     >     > > > >> >
>     >     >     > > > >>
>     >     >     > > >
>     >     >     > >
>     >     >     >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>
>
>

Re: segmentation fault in master using mkdlnn

Reply via email to