Da, it seems like you have a problem with your internet connection, leading to a timeout to the keyserver.
-Marco On Thu, May 3, 2018 at 8:20 PM, Anirudh <anirudh2...@gmail.com> wrote: > Hi Pedro and Da, > > I am not sure how to install mkldnn with cmake. But for make to reproduce > you can do the following: > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0 > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1 > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 > export MXNET_TEST_SEED=11 > export MXNET_MODULE_SEED=812478194 > export MXNET_TEST_COUNT=10000 > nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape > > I was able to reproduce on master, now trying on 1.2 branch. > > Anirudh > > > On Thu, May 3, 2018 at 10:17 AM, Zheng, Da <dzz...@amazon.com> wrote: > > > Hello Pedro, > > > > I tried your instructions. It seems I can't run the docker in EC2 > > instances. > > Where did you reproduce the error? > > > > Thanks, > > Da > > > > + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/' > > + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 > > gpg: directory `/root/.gnupg' created > > gpg: new configuration file `/root/.gnupg/gpg.conf' created > > gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active > during > > this run > > gpg: keyring `/root/.gnupg/secring.gpg' created > > gpg: keyring `/root/.gnupg/pubring.gpg' created > > gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com > > gpg: keyserver timed out > > gpg: keyserver receive failed: keyserver error > > The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2 > > Traceback (most recent call last): > > File "ci/build.py", line 263, in <module> > > sys.exit(main()) > > File "ci/build.py", line 197, in main > > build_docker(platform, docker_binary) > > File "ci/build.py", line 73, in build_docker > > check_call(cmd) > > File "/usr/lib/python3.5/subprocess.py", line 581, in check_call > > raise CalledProcessError(retcode, cmd) > > subprocess.CalledProcessError: Command '['docker', 'build', '-f', > > 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000', > > '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status > 2 > > > > > > On 5/3/18, 8:01 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com> > wrote: > > > > Hi Da > > > > Reproduction instructions: > > > > On the host: > > > > Adjust core pattern: > > > > $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern > > > > > > Use the following patch: > > > > =============== > > > > diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn > > --- a/3rdparty/mkldnn > > +++ b/3rdparty/mkldnn > > @@ -1 +1 @@ > > -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da > > +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty > > diff --git a/ci/docker/runtime_functions.sh > > b/ci/docker/runtime_functions.sh > > index 027e287..62649c9 100755 > > --- a/ci/docker/runtime_functions.sh > > +++ b/ci/docker/runtime_functions.sh > > @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() { > > # https://github.com/apache/incubator-mxnet/issues/10026 > > #export MXNET_MKLDNN_DEBUG=1 # Ignored if not present > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 > > - nosetests-2.7 --verbose tests/python/unittest > > - nosetests-2.7 --verbose tests/python/train > > - nosetests-2.7 --verbose tests/python/quantization > > + export MXNET_TEST_SEED=11 > > + export MXNET_MODULE_SEED=812478194 > > + pwd > > + export MXNET_TEST_COUNT=10000 > > + ulimit -c unlimited > > + ulimit -c > > + while nosetests-2.7 --verbose > > tests/python/unittest/test_module.py:test_forward_reshape; do echo > > round; > > done > > + #nosetests-2.7 --verbose tests/python/train > > + #nosetests-2.7 --verbose tests/python/quantization > > } > > > > unittest_ubuntu_python3_cpu() { > > > > > > > > ============== > > > > Build and execute the test, make sure the repo is clean > > > > $ ci/docker/runtime_functions.sh clean_repo > > > > $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh > > build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu > > /work/runtime_functions.sh unittest_ubuntu_python2_cpu > > > > > > Once it crashes it will stop. > > > > Then go in the container: > > > > > > $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run > > > > A core should be there. > > > > you might need to install gdb as root by executing the previous > command > > without uid so you can use apt-get. > > > > > > > > > > Good luck. > > > > > > > > > > > > > > > > On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dzz...@amazon.com> wrote: > > > > > Thanks a lot for locating the error. > > > Could you tell me How you reproduce the error? > > > > > > On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com> > > wrote: > > > > > > Looks like a problem in mkl's same_shape > > > > > > the pointer to mkldnn::memory::desc &desc looks invalid. > > > > > > (More stack frames follow...) > > > (gdb) p desc > > > $1 = (const mkldnn::memory::desc &) @0x10: <error reading > > variable> > > > (gdb) p dtype > > > $2 = 0 > > > (gdb) p shape > > > $3 = (const mxnet::TShape &) @0x7f3905a58b50: > > {<nnvm::Tuple<long>> = > > > {static kStackCache = <optimized out>, ndim_ = 2, > > num_heap_allocated_ > > > = 0, > > > data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = > > 0x0}, <No > > > data > > > fields>} > > > (gdb) > > > > > > > > > On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzz...@amazon.com> > > wrote: > > > > > > > There are a few problems with valgrind, which makes it not an > > ideal > > > tool > > > > for mxnet with python interface. > > > > > > > > First, valgrind generates a huge number of irrelevant > > messages, most > > > of > > > > them from in Python itself. > > > > > > > > Second, valgrind can't emulate all CPU instructions. I > > remember that > > > when > > > > I run valgrind with mxnet, valgrind exits with a strange > > error. I > > > later on > > > > found that it was caused by an unsupported CPU instructions. > > > > > > > > Third, valgrind doesn't support multithreading well. As far > as > > I > > > know, > > > > valgrind runs everything in a single thread even if the > > program uses > > > > multi-threading. An error like this, which is likely caused > by > > race > > > > condition, can't be caught by valgrind. > > > > > > > > I used to use Address Sanitizer for memory errors. This tool > > is much > > > > faster and can work with multi-threads. However, it doesn't > > work with > > > > Python for some reason. > > > > > > > > One thing we potentially can do is to use memory checker for > > C++ unit > > > > tests. Not sure it'll cover all memory errors we want. > > > > > > > > Best, > > > > Da > > > > > > > > On 5/3/18, 6:50 AM, "Pedro Larroy" < > > pedro.larroy.li...@gmail.com> > > > wrote: > > > > > > > > It's very difficult to reproduce, non-deterministic. We > > were also > > > > running > > > > without signal handlers in CI so there are no stack > traces > > > > unfortunately. > > > > > > > > Care to elaborate why valgrind doesn't work with Python? > > > > > > > > > > > > > > > > On Thu, May 3, 2018 at 3:32 PM, Da Zheng < > > zhengda1...@gmail.com> > > > > wrote: > > > > > > > > > can we build it in CI?segfault doesn't happen > > infrequently. > > > > > > > > > > 2018年5月2日 下午11:34,"Chris Olivier" < > cjolivie...@gmail.com > > >写道: > > > > > > > > > > > you can try Intel Inspector, which is like an > enhanced > > > version of > > > > > valgrind > > > > > > with a GUI and whatnot. > > > > > > > > > > > > On Wed, May 2, 2018 at 9:42 PM Da Zheng < > > > zhengda1...@gmail.com> > > > > wrote: > > > > > > > > > > > > > valgrind doesn't work with Python. also, valgrind > > doesn't > > > > support some > > > > > > > CPU instructions used by MXNet (I think some > > instructions > > > > related to > > > > > > > random generator). > > > > > > > > > > > > > > > > > > > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker < > > > > bhavintha...@gmail.com> > > > > > > > wrote: > > > > > > > > Have you tried running with valgrind to get some > > clues > > > on the > > > > > > root-cause? > > > > > > > > > > > > > > > > Bhavin Thaker. > > > > > > > > > > > > > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng < > > > zhengda1...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> It might also be possible that this isn't an > > MKLDNN bug. > > > > > > > >> I just saw a similar memory error without MKLDNN > > build. > > > > > > > >> > > > > > > > >> > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > > > organizations/jenkins/ > > > > > > incubator-mxnet/detail/PR-10783/1/pipeline > > > > > > > >> > > > > > > > >> Best, > > > > > > > >> Da > > > > > > > >> > > > > > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da < > > > dzz...@amazon.com> > > > > > wrote: > > > > > > > >> > There might be a race condition that causes > the > > memory > > > > error. > > > > > > > >> > It might be caused by this PR: > > > > > > > >> > https://github.com/apache/ > > incubator-mxnet/pull/10706/ > > > files > > > > > > > >> > This PR removes MKLDNN memory from NDArray. > > > > > > > >> > However, I don't know why this causes memory > > error. If > > > > someone is > > > > > > > using > > > > > > > >> the memory, it should still hold the memory with > > shared > > > > pointer. > > > > > > > >> > But I do see the memory error increase after > > this PR > > > is > > > > merged. > > > > > > > >> > > > > > > > > >> > Best, > > > > > > > >> > Da > > > > > > > >> > > > > > > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" < > > > > > pedro.larroy.li...@gmail.com> > > > > > > > >> wrote: > > > > > > > >> > > > > > > > > >> > I couldn't reproduce locally with: > > > > > > > >> > > > > > > > > >> > ci/build.py -p ubuntu_cpu > > > /work/runtime_functions.sh > > > > > > > >> > build_ubuntu_cpu_mkldnn && ci/build.py > > --platform > > > > ubuntu_cpu > > > > > > > >> > /work/runtime_functions.sh > > > unittest_ubuntu_python2_cpu > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > On Wed, May 2, 2018 at 8:50 PM, Pedro > > Larroy < > > > > > > > >> pedro.larroy.li...@gmail.com> > > > > > > > >> > wrote: > > > > > > > >> > > > > > > > > >> > > Hi > > > > > > > >> > > > > > > > > > >> > > Seems master is not running anymore, > > there's a > > > > segmentation > > > > > > > fault > > > > > > > >> using > > > > > > > >> > > MKDLNN-CPU > > > > > > > >> > > > > > > > > > >> > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > > > organizations/jenkins/ > > > > > > > >> > > incubator-mxnet/detail/master/ > > 801/pipeline/662 > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > I see my PRs failing with a similar > error. > > > > > > > >> > > > > > > > > > >> > > Pedro > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >