ok. i see you answered below. was not obvious at first.

you can upgrade the llvm.

On Sun, Dec 8, 2019 at 6:51 PM Lausen, Leonard <lau...@amazon.com.invalid>
wrote:

> Thanks Chris for the elaboration.
>
> > What the assert in question is trying to say is that mxnet code is
> calling
> > into omp library after a fork, but before the omp library’s atfork()
> > handler is called, so the omp library has not yet initialized a new team
> if
> > threads.  This looks to be the case in one of the call stacks on that
> > issue. This is problematic for any openmp library which supports omp
> after
> > a fork, and may not be deterministic from build to build, since the order
> > of static init calls for a given module is undefined (i think mxnet is
> > initializing omp during static init, but this may not matter).
>
> I disagree. The assert fails without any forking. You can trigger it by
> running
> `python3 -c 'import mxnet'` a few times.
>
> As described in https://github.com/apache/incubator-mxnet/pull/17012 the
> previously failing assertion is still part of LLVM OpenMP 9.0 codebase. In
> particular you can compare line
> https://github.com/llvm-mirror/openmp/blob/release_90/runtime/src/kmp_runtime.cpp#L6616
> to
>
> https://github.com/llvm-mirror/openmp/blob/37c72127e90360a020f351f18d9cccfc30e5145a/runtime/src/kmp_runtime.cpp#L6481
> where the latter is the line that currently fails in MXNet.
>
> I would like to reiterate, we are currently using a random and UNRELEASED
> LLVM
> OpenMP version from 2017. There is no evidence that the assertion failure
> is due
> to a bug in MXNet, but rather there is strong evidence that it is due to a
> bug
> in the previously used LLVM version. (See the latter part of this email
> regarding your suggestion about bug in MXNet fork handling).
>
> > It stands to reason that linking this or that library may affect the
> assert
> > occurring because it’s not known at what time one of the dependent
> > libraries initializes omp (thus causing it to hook its atfork handler),
> so
> > it is not surprising that mucking with dependencies may cause the assert
> to
> > occur or not occur.
>
> With respect to the Assert failure, there is no difference in "linking
> this or
> that library". We are only updating the version of LLVM OpenMP codebase.
> See
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Thus your veto on updating the shipped LLVM OpenMP lacks technical
> justification
> and is void. I'm requesting 3 commiters to approve the PR to go ahead with
> the
> update.
>
> > So if mxnet is doing that, it is a bug and remains a problem regardless
> of
> > the omp library and probably should be fixed.  llvm omp happens to be
> nice
> > enough to tell you you’re doing something wrong, at least when built in
> > debug mode.
>
> Yes, we need to investigate if there is a bug. Let's do this based on
> https://github.com/apache/incubator-mxnet/issues/14979 which is a "real"
> crash
> that happens both with the old unreleased and the new 9.0 LLVM OpenMP
> versions.
>
> > Once this issue is resolved, we can discuss the library inclusion itself.
> > My objection is “fixing” what appears to be a bug by effectively
> > “commenting out the assert” which is what i stated in the very beginning.
>
> This is a reasonable approach, as long as you are willing to help fix it.
> As
> made evident by your comments, you have more experience with advanced
> OpenMP
> libraries. Thus let's work together on fixing the root cause, assuming it's
> indeed a bug in MXNet.
>
> If we run into any roadblock with finding a root cause in MXNet, the only
> alternative I see is to remove the 3rdparty/openmp library, as we can not
> rule
> out the possibility of an LLVM OpenMP bug either.
>
> > Is there another explanation for the call stack with the assert?  Can
> this
> > bug be ruled out?
>
> Yes. The other explanation is that it's due to a bug in the old LLVM OpenMP
> version (see above).
> The stack trace you refer to is
>
> https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-505162890
> relies on mkl and is hard to reproduce.
>
> With respect to establishing and fixing the bug in MXNet, let's focus on
> the
> crash described in https://github.com/apache/incubator-mxnet/issues/14979
> which
> is happening with Intel OpenMP, old unreleased LLVM OpenMP and the current
> LLVM
> OpenMP 9.0 release. It involves forking and thus seems quite related to the
> first paragraph of your email compared to the assertion failure which
> happens
> without any forking.
>
> Best regards
> Leonard
>
> On Sun, 2019-12-08 at 07:58 -0800, Chris Olivier wrote:
> > btw the call stack I am referring to below is the one where I explained
> > this problem before and after I got a hostile response, I locked the
> issue.
> >
> > On Sun, Dec 8, 2019 at 7:24 AM Chris Olivier <cjolivie...@gmail.com>
> wrote:
> >
> > > Again, here is what I suspect the bug is in mxnet:
> > >
> > > The way that advanced openmp libraries handle a fork is that they hook
> an
> > > atfork() callback in which, in the new process, it creates a new
> “team” of
> > > threads to use for its thread pool (since all of the thread handles in
> its
> > > data structure belong to the previous process). atfork() callback
> order is
> > > the order at which the callbacks are registered, which will tend to be
> the
> > > first call to the openmp library.  For this reason, the fork order will
> > > vary depending upon what other libraries might be linked in and whether
> > > they make omp calls before mxnet starts its static init.
> > >
> > > What the assert in question is trying to say is that mxnet code is
> calling
> > > into omp library after a fork, but before the omp library’s atfork()
> > > handler is called, so the omp library has not yet initialized a new
> team if
> > > threads.  This looks to be the case in one of the call stacks on that
> > > issue. This is problematic for any openmp library which supports omp
> after
> > > a fork, and may not be deterministic from build to build, since the
> order
> > > of static init calls for a given module is undefined (i think mxnet is
> > > initializing omp during static init, but this may not matter).
> > >
> > > So if mxnet is doing that, it is a bug and remains a problem
> regardless of
> > > the omp library and probably should be fixed.  llvm omp happens to be
> nice
> > > enough to tell you you’re doing something wrong, at least when built in
> > > debug mode.
> > >
> > > Once this issue is resolved, we can discuss the library inclusion
> itself.
> > > My objection is “fixing” what appears to be a bug by effectively
> > > “commenting out the assert” which is what i stated in the very
> beginning.
> > >
> > > It stands to reason that linking this or that library may affect the
> > > assert occurring because it’s not known at what time one of the
> dependent
> > > libraries initializes omp (thus causing it to hook its atfork
> handler), so
> > > it is not surprising that mucking with dependencies may cause the
> assert to
> > > occur or not occur.
> > >
> > > Is there another explanation for the call stack with the assert?  Can
> this
> > > bug be ruled out?
> > >
> > >
> > > Here is an example of the atfork team concept with libgomp as well.
> > > Probably you can check the current libgomp code itself but this
> explains
> > > the code:
> > > https://patchwork.ozlabs.org/patch/319827/
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard
> <lau...@amazon.com.invalid>
> > > wrote:
> > >
> > > > Thanks Pedro and Chris for your responses.
> > > >
> > > > After further investigation I find:
> > > >
> > > > 1) I don't think
> https://github.com/apache/incubator-mxnet/issues/14979
> > > > is
> > > > caused by any incompatibility between gomp and llvm / intel omp.
> Rather
> > > > it's
> > > > simply a problem of llvm / intel omp. See my comment to the issue
> for the
> > > > methodology to arrive at this claim.
> > > >
> > > > 2) Regarding the assertion failure when compiling with (llvm)
> > > > 3rdparty/openmp,
> > > > it can be fixed by updating the by now 2 years old llvm openmp code
> to the
> > > > newest released version. I went ahead and opened a PR
> > > > https://github.com/apache/incubator-mxnet/pull/17012
> > > >
> > > > Based on the investigation described in 1), I think Chris is right
> that
> > > > the
> > > > assertion failure is not due to some interaction between gomp and
> llvm
> > > > omp.
> > > > However, I'm not sure about Chris's suggestion that the assertion
> failure
> > > > is due
> > > > to a bug in MXNet. In fact, the failure goes away when updating the
> llvm
> > > > openmp
> > > > code. So I think it's just due to a bug in the 2 years old code.
> > > >
> > > > @Chris, I think updating 3rdparty/openmp to fix the assertion issue
> is not
> > > > contentious. Thus let's do it via lazy consensus (72 hours) or just
> > > > approve the
> > > > PR and merge it.
> > > >
> > > > Please also take a look at my comment at #14979 and let everyone
> know if
> > > > you see
> > > > any option to fix the bug while keeping 3rdparty/openmp. As this bug
> > > > affects an
> > > > important use-case, I beleive we need to remove 3rdparty/openmp from
> the
> > > > CMake
> > > > build as long as we don't find a solution for making #14979 work with
> > > > 3rdparty/openmp.
> > > >
> > > > In fact, removing 3rdparty/openmp will then match the current
> Makefile
> > > > setup
> > > > that according to my understanding is used to build the nightly
> releases
> > > > used by
> > > > the majority of developers. Ie. most users actually don't use the
> CMake
> > > > build
> > > > with 3rdparty/openmp. You can consider rescinding your veto on
> removing
> > > > 3rdparty/openmp after reading through the evidence in that issue. If
> you
> > > > don't
> > > > provide any evidence for why the methodology/conclusion in #14979 is
> > > > flawed, I
> > > > will assume your previous veto is void based on Apache Voting rule
> as it
> > > > lacks
> > > > technical justification and in any case was motivated by the
> assertion
> > > > issue,
> > > > which I agree with you, is likely not due to gomp / omp interaction.
> > > >
> > > > Thank you
> > > > Leonard
> > > >
> > > >
> > > > On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
> > > > > Stop disseminating false information:
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/issues/14979
> > > > >
> > > > >
> > > > > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier <
> cjolivie...@gmail.com>
> > > > wrote:
> > > > > > -1
> > > > > >
> > > > > > mkldnn removed omp5 for licencing issues
> > > > > > no bugs have actually been traced to the use of llvm openmp.
> only an
> > > > assert
> > > > > > caused by an actual bug in mxnet code. there are suitable
> workarounds.
> > > > > >
> > > > > > over time llvm omp has simply been used as a “catch all” for
> random
> > > > > > problems that aren’t related at all (such as getenv race
> condition in
> > > > an
> > > > > > atfork call that isn’t even part of an omp parallel region).
> > > > > >
> > > > > > proposal is now and has always been roughly equivalent to the
> idea of
> > > > > > “comment out an assert rather than fix the bug it’s reporting”.
> > > > > >
> > > > > > Up until very recently, Makefile version of mxnet used libomp5
> for
> > > > YEARS
> > > > > > and not libgomp, with no issue reported (omp not built in debug
> > > > mode), so
> > > > > > the equivalent configuration from CMake mysteriously causing
> myriads
> > > > if
> > > > > > problems has questionable merit and smells more like a hubris
> > > > situation.
> > > > > > I use tensorflow as well and it links to libomp5 rather than
> libgomp.
> > > > > >
> > > > > > if the assert problem is really a problem, the bug being reported
> > > > would be
> > > > > > prioritized and fixed. it should be fixed regardless. all the
> time
> > > > spent by
> > > > > > some CI people trying to remove this could have simply fixed the
> > > > actual bug
> > > > > > in a small fraction of the time.
> > > > > >
> > > > > >
> > > > > > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard
> > > > <lau...@amazon.com.invalid>
> > > > > > wrote:
> > > > > >
> > > > > > > I think it's reasonable to assume that the Intel MKLDNN team
> is an
> > > > > > > "authorative"
> > > > > > > source about the issue of compilation with OpenMP and the
> OpenMP
> > > > runtime
> > > > > > > library
> > > > > > > related issues. Thus I suggest we follow the recommendation of
> Intel
> > > > > > > MKLDNN team
> > > > > > > within the MXNet project.
> > > > > > >
> > > > > > > Looking through the Intel MKLDNN documentation, I find [1]:
> > > > > > >
> > > > > > > > DNNL uses OpenMP runtime library provided by the compiler.
> > > > > > >
> > > > > > > as well as
> > > > > > >
> > > > > > > > it's important to ensure that only one OpenMP runtime is used
> > > > > > throughout
> > > > > > > the
> > > > > > > > application. Having more than one OpenMP runtime linked to an
> > > > > > executable
> > > > > > > may
> > > > > > > > lead to undefined behavior including incorrect results or
> crashes.
> > > > > > >
> > > > > > > To keep our project maintainable and error free, I thus
> suggest we
> > > > follow
> > > > > > > DNNL
> > > > > > > and use the OpenMP runtime library provided by the compiler.
> > > > > > > We have limited ressources and finding the root cause for any
> bugs
> > > > > > > resulting
> > > > > > > from linking multiple OpenMP libraries as currently done is,
> in my
> > > > > > > opinion. not
> > > > > > > a good use of time. We know it's due to undefined behavior and
> we
> > > > know
> > > > > > > it's best
> > > > > > > practice to use OpenMP runtime library provided by the
> compiler. So
> > > > let's
> > > > > > > just
> > > > > > > do that.
> > > > > > >
> > > > > > > I think given that MKL-DNN has also adopted the "OpenMP runtime
> > > > library
> > > > > > > provided
> > > > > > > by the compiler" approach, this issue is not contentious
> anymore and
> > > > > > > qualifies
> > > > > > > for lazy consensus.
> > > > > > >
> > > > > > > Thus if there is no objection within 72 hours (lazy consensus),
> > > > let's
> > > > > > drop
> > > > > > > bundled LLVM OpenMP from master [2]. If we find any issues due
> to
> > > > > > > droppeing the
> > > > > > > bundled LLVM OpenMP, we can always add it back prior to the
> next
> > > > release.
> > > > > > > Best regards
> > > > > > > Leonard
> > > > > > >
> > > > > > > [1]:
> > > > > > >
> > > > > > >
> > > >
> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> > > > > > > (This is the updated reference from Anton's previous comment,
> based
> > > > on
> > > > > > the
> > > > > > > changes in MKLDNN done in the meantime
> > > > > > >
> > > >
> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> > > > > > > )
> > > > > > > [2]: Alike
> https://github.com/apache/incubator-mxnet/pull/12160
> > > > > > >
> > > > > > >
> > > > > > > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > > > > > > > I will try to stay on the sidelines for now since previous
> > > > > > conversations
> > > > > > > > about OMP have not been productive here and I have spent way
> too
> > > > much
> > > > > > > time
> > > > > > > > on this already, I'm not the first one giving up on trying to
> > > > help with
> > > > > > > > this topic.
> > > > > > > >
> > > > > > > > I would be glad if you guys can work together and find a
> > > > solution. I
> > > > > > will
> > > > > > > > just put my understanding of the big picture hoping that it
> helps
> > > > move
> > > > > > it
> > > > > > > > forward.
> > > > > > > >
> > > > > > > >
> > > > > > > > Recently the intel omp library which seemed to have the best
> > > > > > performance
> > > > > > > of
> > > > > > > > the 3 was removed from MKL.
> > > > > > > >
> > > > > > > > - There's 3 libraries in play, GNU Omp which is shipped with
> gcc
> > > > > > (gomp),
> > > > > > > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL,
> > > > which is
> > > > > > > > recently removed (iomp)
> > > > > > > >
> > > > > > > > - IOMP seems to have the best performance, there's stability
> > > > issues
> > > > > > > > producing crashes sometimes but the impact seems relatively
> small
> > > > for
> > > > > > > users
> > > > > > > > and developers. In general seems linking with a different OMP
> > > > version
> > > > > > > that
> > > > > > > > the one shipped with the compiler is known to cause stability
> > > > issues
> > > > > > but
> > > > > > > > it's done anyway.
> > > > > > > >
> > > > > > > > - LLVM-OMP used when building with CMake, not used in the PIP
> > > > releases
> > > > > > or
> > > > > > > > when building with Make. Has stability issues, hangs when
> running
> > > > in
> > > > > > > debug
> > > > > > > > mode during test execution and produces tons of assertions in
> > > > debug
> > > > > > mode.
> > > > > > > > Might have some small performance gains but there is no
> clear cut
> > > > data
> > > > > > > that
> > > > > > > > showcases significant performance gains.
> > > > > > > >
> > > > > > > > - GOMP is the version shipped with GCC and the PIP wheels
> without
> > > > MKL,
> > > > > > > has
> > > > > > > > no stability problems.
> > > > > > > >
> > > > > > > > As a ballpark, IOMP might give 10% performance improvement
> in some
> > > > > > cases.
> > > > > > > > We need to document well how users should tune and configure
> > > > MXNet when
> > > > > > > > using OMP.
> > > > > > > >
> > > > > > > > As a developer, the safest bet is to use GOMP to be able to
> debug
> > > > and
> > > > > > > > develop without issues. As a user of CPU inference /
> training you
> > > > want
> > > > > > to
> > > > > > > > run MKL so depends on how the Intel guys want to do things.
> My
> > > > > > preference
> > > > > > > > as an engineer is always stability > speed.
> > > > > > > >
> > > > > > > > Related tickets:
> > > > > > > >
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/16891
> > > > > > > >
> > > > > > > >
> > > >
> https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/11417
> > > > > > > >
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/15690
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard
> > > > > > > <lau...@amazon.com.invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Is this related to
> > > > > > > https://github.com/apache/incubator-mxnet/issues/10856?
> > > > > > > > > I unlocked that Github issue based on the Apache Code of
> Conduct
> > > > > > > > >
> > > >
> https://www.apache.org/foundation/policies/conduct#specific-guidelines
> > > > > > > > > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> > > > > > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6
> > > > (upstream_master)+$
> > > > > > ldd
> > > > > > > > > > build/libmxnet.so| grep -i openmp
> > > > > > > > > >         libomp.so =>
> > > > > > > > > >
> > > > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > > > > (0x00007fde0991d000)
> > > > > > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6
> > > > (upstream_master)+$
> > > > > > > python
> > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > > > > > > --use-rec
> > > > > > > > > > --batch-size 256 --dtype float32 --num-data-workers 40
> --mode
> > > > > > hybrid
> > > > > > > > > > --model resnet50_v2 --use-pretrained --kvstore local
> > > > > > --log-interval 1
> > > > > > > > > > --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > > > > > > ~/data/val-passthrough.idx
> > > > > > > > > > INFO:root:Namespace(batch_norm=False, batch_size=256,
> > > > > > > > > > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32,
> > > > > > > dtype='float32',
> > > > > > > > > > kvstore='local', last_gamma=False, log_interval=1,
> > > > > > > logging_dir='logs',
> > > > > > > > > > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60',
> lr_mode='step',
> > > > > > > > > > lr_poly_power=2, mode='hybrid', model='resnet50_v2',
> > > > momentum=0.9,
> > > > > > > > > > num_epochs=3, num_gpus=0, num_workers=40,
> > > > > > > > > > rec_val='/home/piotr/data/val-passthrough.rec',
> > > > > > > > > > rec_val_idx='/home/piotr/data/val-passthrough.idx',
> > > > > > > save_dir='params',
> > > > > > > > > > save_frequency=0, top_k=0, use_pretrained=True,
> use_rec=True,
> > > > > > > > > use_se=False,
> > > > > > > > > > warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
> > > > > > > > > > [10:42:02] ../src/io/iter_image_recordio_2.cc:178:
> > > > > > > ImageRecordIOParser2:
> > > > > > > > > > /home/piotr/data/val-passthrough.rec, use 36 threads for
> > > > decoding..
> > > > > > > > > > INFO:root:Batch [0]
> > > > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > > > INFO:root:warmup_throughput: 5 samples/sec warmup_time
> > > > 43.150922
> > > > > > > > > > INFO:root:Batch [1]
> > > > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > > > INFO:root:warmup_throughput: 6 samples/sec warmup_time
> > > > 37.971927
> > > > > > > > > > INFO:root:Batch [2]
> > > > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > > > INFO:root:warmup_throughput: 7 samples/sec warmup_time
> > > > 35.755363
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp
> > > > > > > > > (upstream_master)+$
> > > > > > > > > > git st
> > > > > > > > > > On branch upstream_master
> > > > > > > > > > Your branch is up to date with 'origin/upstream_master'.
> > > > > > > > > >
> > > > > > > > > > Changes not staged for commit:
> > > > > > > > > >   (use "git add/rm <file>..." to update what will be
> > > > committed)
> > > > > > > > > >   (use "git checkout -- <file>..." to discard changes in
> > > > working
> > > > > > > > > directory)
> > > > > > > > > >         deleted:    3rdparty/openmp
> > > > > > > > > >
> > > > > > > > > > no changes added to commit (use "git add" and/or "git
> commit
> > > > -a")
> > > > > > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp
> > > > > > > > > (upstream_master)+$
> > > > > > > > > > ldd build/libmxnet.so | grep -i omp
> > > > > > > > > >         libgomp.so.1 =>
> /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > > > > (0x00007f941241c000)
> > > > > > > > > >
> > > > > > > > > > (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp
> > > > > > > > > (upstream_master)+$
> > > > > > > > > > python
> > > > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > > > > > > > > > --use-rec --batch-size 256 --dtype float32
> --num-data-workers
> > > > 40
> > > > > > > --mode
> > > > > > > > > > hybrid --model resnet50_v2 --use-pretrained --kvstore
> local
> > > > > > > > > --log-interval
> > > > > > > > > > 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > > > > > > ~/data/val-passthrough.idx
> > > > > > > > > > INFO:root:warmup_throughput: 147 samples/sec warmup_time
> > > > 1.735117
> > > > > > > > > > INFO:root:Batch [16]
> > > > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > > > INFO:root:warmup_throughput: 143 samples/sec warmup_time
> > > > 1.785760
> > > > > > > > > > INFO:root:Batch [17]
> > > > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > > > INFO:root:warmup_throughput: 148 samples/sec warmup_time
> > > > 1.729033
> > > >
>

Reply via email to