Hi Bhavin Good suggestion
I tried 1) but I can't get a core inside the container, even with ulimit -c unlimited I found out that /proc/sys/kernel/core_pattern by default in ubuntu uses a pipe to /usr/share/apport/apport which doesn't exist inside the container, changing it outside the container to echo 'core.%h.%e.%t' > /proc/sys/kernel/core_pattern fixes this mistery, so now I got a coredump which I added to the ticket. Trying to get to the bottom of the issue :-) On Thu, May 3, 2018 at 4:02 PM, Bhavin Thaker <bhavintha...@gmail.com> wrote: > Hi Pedro, All, > > 1) I would suggest that we run “ulimit -c unlimited” in every CI Slave > machine at startup to enable core-dump and get stack trace. > > 2) Valgrind on Python generates so much noise that extracting signal from > it is painful, but it is still worth trying it out and look at the messages > towards the end when the crash happens. Valgrind on a one-liner python > code generates noise and this demonstrates that python itself is not > Valgrind-clean. > > 3) If there are C++ APIs to trigger the same functionality as the current > problematic use-case, then one could write a small program to reproduce the > crash and then use Valgrind to get to the culprit portion of the code > quickly. > > Bhavin Thaker. > > On Thu, May 3, 2018 at 6:49 AM Pedro Larroy <pedro.larroy.li...@gmail.com> > wrote: > > > It's very difficult to reproduce, non-deterministic. We were also running > > without signal handlers in CI so there are no stack traces unfortunately. > > > > Care to elaborate why valgrind doesn't work with Python? > > > > > > > > On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1...@gmail.com> wrote: > > > > > can we build it in CI?segfault doesn't happen infrequently. > > > > > > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivie...@gmail.com>写道: > > > > > > > you can try Intel Inspector, which is like an enhanced version of > > > valgrind > > > > with a GUI and whatnot. > > > > > > > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zhengda1...@gmail.com> > wrote: > > > > > > > > > valgrind doesn't work with Python. also, valgrind doesn't support > > some > > > > > CPU instructions used by MXNet (I think some instructions related > to > > > > > random generator). > > > > > > > > > > > > > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker < > > bhavintha...@gmail.com> > > > > > wrote: > > > > > > Have you tried running with valgrind to get some clues on the > > > > root-cause? > > > > > > > > > > > > Bhavin Thaker. > > > > > > > > > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1...@gmail.com> > > > wrote: > > > > > > > > > > > >> It might also be possible that this isn't an MKLDNN bug. > > > > > >> I just saw a similar memory error without MKLDNN build. > > > > > >> > > > > > >> > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > > > > incubator-mxnet/detail/PR-10783/1/pipeline > > > > > >> > > > > > >> Best, > > > > > >> Da > > > > > >> > > > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dzz...@amazon.com> > > > wrote: > > > > > >> > There might be a race condition that causes the memory error. > > > > > >> > It might be caused by this PR: > > > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files > > > > > >> > This PR removes MKLDNN memory from NDArray. > > > > > >> > However, I don't know why this causes memory error. If someone > > is > > > > > using > > > > > >> the memory, it should still hold the memory with shared pointer. > > > > > >> > But I do see the memory error increase after this PR is > merged. > > > > > >> > > > > > > >> > Best, > > > > > >> > Da > > > > > >> > > > > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" < > > > pedro.larroy.li...@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > > >> > I couldn't reproduce locally with: > > > > > >> > > > > > > >> > ci/build.py -p ubuntu_cpu /work/runtime_functions.sh > > > > > >> > build_ubuntu_cpu_mkldnn && ci/build.py --platform > ubuntu_cpu > > > > > >> > /work/runtime_functions.sh unittest_ubuntu_python2_cpu > > > > > >> > > > > > > >> > > > > > > >> > On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy < > > > > > >> pedro.larroy.li...@gmail.com> > > > > > >> > wrote: > > > > > >> > > > > > > >> > > Hi > > > > > >> > > > > > > > >> > > Seems master is not running anymore, there's a > > segmentation > > > > > fault > > > > > >> using > > > > > >> > > MKDLNN-CPU > > > > > >> > > > > > > > >> > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > > > > > >> > > incubator-mxnet/detail/master/801/pipeline/662 > > > > > >> > > > > > > > >> > > > > > > > >> > > I see my PRs failing with a similar error. > > > > > >> > > > > > > > >> > > Pedro > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > >