Hi Pedro and Da, I am not sure how to install mkldnn with cmake. But for make to reproduce you can do the following:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0 USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1 export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 export MXNET_TEST_SEED=11 export MXNET_MODULE_SEED=812478194 export MXNET_TEST_COUNT=10000 nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape I was able to reproduce on master, now trying on 1.2 branch. Anirudh On Thu, May 3, 2018 at 10:17 AM, Zheng, Da <dzz...@amazon.com> wrote: > Hello Pedro, > > I tried your instructions. It seems I can't run the docker in EC2 > instances. > Where did you reproduce the error? > > Thanks, > Da > > + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/' > + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 > gpg: directory `/root/.gnupg' created > gpg: new configuration file `/root/.gnupg/gpg.conf' created > gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during > this run > gpg: keyring `/root/.gnupg/secring.gpg' created > gpg: keyring `/root/.gnupg/pubring.gpg' created > gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com > gpg: keyserver timed out > gpg: keyserver receive failed: keyserver error > The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2 > Traceback (most recent call last): > File "ci/build.py", line 263, in <module> > sys.exit(main()) > File "ci/build.py", line 197, in main > build_docker(platform, docker_binary) > File "ci/build.py", line 73, in build_docker > check_call(cmd) > File "/usr/lib/python3.5/subprocess.py", line 581, in check_call > raise CalledProcessError(retcode, cmd) > subprocess.CalledProcessError: Command '['docker', 'build', '-f', > 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000', > '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2 > > > On 5/3/18, 8:01 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com> wrote: > > Hi Da > > Reproduction instructions: > > On the host: > > Adjust core pattern: > > $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern > > > Use the following patch: > > =============== > > diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn > --- a/3rdparty/mkldnn > +++ b/3rdparty/mkldnn > @@ -1 +1 @@ > -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da > +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty > diff --git a/ci/docker/runtime_functions.sh > b/ci/docker/runtime_functions.sh > index 027e287..62649c9 100755 > --- a/ci/docker/runtime_functions.sh > +++ b/ci/docker/runtime_functions.sh > @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() { > # https://github.com/apache/incubator-mxnet/issues/10026 > #export MXNET_MKLDNN_DEBUG=1 # Ignored if not present > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 > - nosetests-2.7 --verbose tests/python/unittest > - nosetests-2.7 --verbose tests/python/train > - nosetests-2.7 --verbose tests/python/quantization > + export MXNET_TEST_SEED=11 > + export MXNET_MODULE_SEED=812478194 > + pwd > + export MXNET_TEST_COUNT=10000 > + ulimit -c unlimited > + ulimit -c > + while nosetests-2.7 --verbose > tests/python/unittest/test_module.py:test_forward_reshape; do echo > round; > done > + #nosetests-2.7 --verbose tests/python/train > + #nosetests-2.7 --verbose tests/python/quantization > } > > unittest_ubuntu_python3_cpu() { > > > > ============== > > Build and execute the test, make sure the repo is clean > > $ ci/docker/runtime_functions.sh clean_repo > > $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh > build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu > /work/runtime_functions.sh unittest_ubuntu_python2_cpu > > > Once it crashes it will stop. > > Then go in the container: > > > $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run > > A core should be there. > > you might need to install gdb as root by executing the previous command > without uid so you can use apt-get. > > > > > Good luck. > > > > > > > > On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dzz...@amazon.com> wrote: > > > Thanks a lot for locating the error. > > Could you tell me How you reproduce the error? > > > > On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com> > wrote: > > > > Looks like a problem in mkl's same_shape > > > > the pointer to mkldnn::memory::desc &desc looks invalid. > > > > (More stack frames follow...) > > (gdb) p desc > > $1 = (const mkldnn::memory::desc &) @0x10: <error reading > variable> > > (gdb) p dtype > > $2 = 0 > > (gdb) p shape > > $3 = (const mxnet::TShape &) @0x7f3905a58b50: > {<nnvm::Tuple<long>> = > > {static kStackCache = <optimized out>, ndim_ = 2, > num_heap_allocated_ > > = 0, > > data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = > 0x0}, <No > > data > > fields>} > > (gdb) > > > > > > On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzz...@amazon.com> > wrote: > > > > > There are a few problems with valgrind, which makes it not an > ideal > > tool > > > for mxnet with python interface. > > > > > > First, valgrind generates a huge number of irrelevant > messages, most > > of > > > them from in Python itself. > > > > > > Second, valgrind can't emulate all CPU instructions. I > remember that > > when > > > I run valgrind with mxnet, valgrind exits with a strange > error. I > > later on > > > found that it was caused by an unsupported CPU instructions. > > > > > > Third, valgrind doesn't support multithreading well. As far as > I > > know, > > > valgrind runs everything in a single thread even if the > program uses > > > multi-threading. An error like this, which is likely caused by > race > > > condition, can't be caught by valgrind. > > > > > > I used to use Address Sanitizer for memory errors. This tool > is much > > > faster and can work with multi-threads. However, it doesn't > work with > > > Python for some reason. > > > > > > One thing we potentially can do is to use memory checker for > C++ unit > > > tests. Not sure it'll cover all memory errors we want. > > > > > > Best, > > > Da > > > > > > On 5/3/18, 6:50 AM, "Pedro Larroy" < > pedro.larroy.li...@gmail.com> > > wrote: > > > > > > It's very difficult to reproduce, non-deterministic. We > were also > > > running > > > without signal handlers in CI so there are no stack traces > > > unfortunately. > > > > > > Care to elaborate why valgrind doesn't work with Python? > > > > > > > > > > > > On Thu, May 3, 2018 at 3:32 PM, Da Zheng < > zhengda1...@gmail.com> > > > wrote: > > > > > > > can we build it in CI?segfault doesn't happen > infrequently. > > > > > > > > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivie...@gmail.com > >写道: > > > > > > > > > you can try Intel Inspector, which is like an enhanced > > version of > > > > valgrind > > > > > with a GUI and whatnot. > > > > > > > > > > On Wed, May 2, 2018 at 9:42 PM Da Zheng < > > zhengda1...@gmail.com> > > > wrote: > > > > > > > > > > > valgrind doesn't work with Python. also, valgrind > doesn't > > > support some > > > > > > CPU instructions used by MXNet (I think some > instructions > > > related to > > > > > > random generator). > > > > > > > > > > > > > > > > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker < > > > bhavintha...@gmail.com> > > > > > > wrote: > > > > > > > Have you tried running with valgrind to get some > clues > > on the > > > > > root-cause? > > > > > > > > > > > > > > Bhavin Thaker. > > > > > > > > > > > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng < > > zhengda1...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > >> It might also be possible that this isn't an > MKLDNN bug. > > > > > > >> I just saw a similar memory error without MKLDNN > build. > > > > > > >> > > > > > > >> > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > > organizations/jenkins/ > > > > > incubator-mxnet/detail/PR-10783/1/pipeline > > > > > > >> > > > > > > >> Best, > > > > > > >> Da > > > > > > >> > > > > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da < > > dzz...@amazon.com> > > > > wrote: > > > > > > >> > There might be a race condition that causes the > memory > > > error. > > > > > > >> > It might be caused by this PR: > > > > > > >> > https://github.com/apache/ > incubator-mxnet/pull/10706/ > > files > > > > > > >> > This PR removes MKLDNN memory from NDArray. > > > > > > >> > However, I don't know why this causes memory > error. If > > > someone is > > > > > > using > > > > > > >> the memory, it should still hold the memory with > shared > > > pointer. > > > > > > >> > But I do see the memory error increase after > this PR > > is > > > merged. > > > > > > >> > > > > > > > >> > Best, > > > > > > >> > Da > > > > > > >> > > > > > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" < > > > > pedro.larroy.li...@gmail.com> > > > > > > >> wrote: > > > > > > >> > > > > > > > >> > I couldn't reproduce locally with: > > > > > > >> > > > > > > > >> > ci/build.py -p ubuntu_cpu > > /work/runtime_functions.sh > > > > > > >> > build_ubuntu_cpu_mkldnn && ci/build.py > --platform > > > ubuntu_cpu > > > > > > >> > /work/runtime_functions.sh > > unittest_ubuntu_python2_cpu > > > > > > >> > > > > > > > >> > > > > > > > >> > On Wed, May 2, 2018 at 8:50 PM, Pedro > Larroy < > > > > > > >> pedro.larroy.li...@gmail.com> > > > > > > >> > wrote: > > > > > > >> > > > > > > > >> > > Hi > > > > > > >> > > > > > > > > >> > > Seems master is not running anymore, > there's a > > > segmentation > > > > > > fault > > > > > > >> using > > > > > > >> > > MKDLNN-CPU > > > > > > >> > > > > > > > > >> > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > > organizations/jenkins/ > > > > > > >> > > incubator-mxnet/detail/master/ > 801/pipeline/662 > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > I see my PRs failing with a similar error. > > > > > > >> > > > > > > > > >> > > Pedro > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >