Hi Leonard.

Are you saying that you have updated this library and the problems desribed
in the related tickets are no longer present?

P.

On Sunday, December 8, 2019, Lausen, Leonard <lau...@amazon.com.invalid>
wrote:
> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that
the
> assertion failure is not due to some interaction between gomp and llvm
omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
setup
> that according to my understanding is used to build the nightly releases
used by
> the majority of developers. Ie. most users actually don't use the CMake
build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
don't
> provide any evidence for why the methodology/conclusion in #14979 is
flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
lacks
> technical justification and in any case was motivated by the assertion
issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> Stop disseminating false information:
>>
>> https://github.com/apache/incubator-mxnet/issues/14979
>>
>>
>> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier <cjolivie...@gmail.com>
wrote:
>>
>> > -1
>> >
>> > mkldnn removed omp5 for licencing issues
>> > no bugs have actually been traced to the use of llvm openmp. only an
assert
>> > caused by an actual bug in mxnet code. there are suitable workarounds.
>> >
>> > over time llvm omp has simply been used as a “catch all” for random
>> > problems that aren’t related at all (such as getenv race condition in
an
>> > atfork call that isn’t even part of an omp parallel region).
>> >
>> > proposal is now and has always been roughly equivalent to the idea of
>> > “comment out an assert rather than fix the bug it’s reporting”.
>> >
>> > Up until very recently, Makefile version of mxnet used libomp5 for
YEARS
>> > and not libgomp, with no issue reported (omp not built in debug mode),
so
>> > the equivalent configuration from CMake mysteriously causing myriads if
>> > problems has questionable merit and smells more like a hubris
situation.
>> >
>> > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> >
>> > if the assert problem is really a problem, the bug being reported
would be
>> > prioritized and fixed. it should be fixed regardless. all the time
spent by
>> > some CI people trying to remove this could have simply fixed the
actual bug
>> > in a small fraction of the time.
>> >
>> >
>> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard
<lau...@amazon.com.invalid>
>> > wrote:
>> >
>> > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > "authorative"
>> > > source about the issue of compilation with OpenMP and the OpenMP
runtime
>> > > library
>> > > related issues. Thus I suggest we follow the recommendation of Intel
>> > > MKLDNN team
>> > > within the MXNet project.
>> > >
>> > > Looking through the Intel MKLDNN documentation, I find [1]:
>> > >
>> > > > DNNL uses OpenMP runtime library provided by the compiler.
>> > >
>> > > as well as
>> > >
>> > > > it's important to ensure that only one OpenMP runtime is used
>> > throughout
>> > > the
>> > > > application. Having more than one OpenMP runtime linked to an
>> > executable
>> > > may
>> > > > lead to undefined behavior including incorrect results or crashes.
>> > >
>> > > To keep our project maintainable and error free, I thus suggest we
follow
>> > > DNNL
>> > > and use the OpenMP runtime library provided by the compiler.
>> > > We have limited ressources and finding the root cause for any bugs
>> > > resulting
>> > > from linking multiple OpenMP libraries as currently done is, in my
>> > > opinion. not
>> > > a good use of time. We know it's due to undefined behavior and we
know
>> > > it's best
>> > > practice to use OpenMP runtime library provided by the compiler. So
let's
>> > > just
>> > > do that.
>> > >
>> > > I think given that MKL-DNN has also adopted the "OpenMP runtime
library
>> > > provided
>> > > by the compiler" approach, this issue is not contentious anymore and
>> > > qualifies
>> > > for lazy consensus.
>> > >
>> > > Thus if there is no objection within 72 hours (lazy consensus), let's
>> > drop
>> > > bundled LLVM OpenMP from master [2]. If we find any issues due to
>> > > droppeing the
>> > > bundled LLVM OpenMP, we can always add it back prior to the next
release.
>> > >
>> > > Best regards
>> > > Leonard
>> > >
>> > > [1]:
>> > >
>> > >
>> >
https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
>> > > (This is the updated reference from Anton's previous comment, based
on
>> > the
>> > > changes in MKLDNN done in the meantime
>> > >
>> >
https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
>> > > )
>> > > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
>> > >
>> > >
>> > > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
>> > > > I will try to stay on the sidelines for now since previous
>> > conversations
>> > > > about OMP have not been productive here and I have spent way too
much
>> > > time
>> > > > on this already, I'm not the first one giving up on trying to help
with
>> > > > this topic.
>> > > >
>> > > > I would be glad if you guys can work together and find a solution.
I
>> > will
>> > > > just put my understanding of the big picture hoping that it helps
move
>> > it
>> > > > forward.
>> > > >
>> > > >
>> > > > Recently the intel omp library which seemed to have the best
>> > performance
>> > > of
>> > > > the 3 was removed from MKL.
>> > > >
>> > > > - There's 3 libraries in play, GNU Omp which is shipped with gcc
>> > (gomp),
>> > > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL,
which is
>> > > > recently removed (iomp)
>> > > >
>> > > > - IOMP seems to have the best performance, there's stability issues
>> > > > producing crashes sometimes but the impact seems relatively small
for
>> > > users
>> > > > and developers. In general seems linking with a different OMP
version
>> > > that
>> > > > the one shipped with the compiler is known to cause stability
issues
>> > but
>> > > > it's done anyway.
>> > > >
>> > > > - LLVM-OMP used when building with CMake, not used in the PIP
releases
>> > or
>> > > > when building with Make. Has stability issues, hangs when running
in
>> > > debug
>> > > > mode during test execution and produces tons of assertions in debug
>> > mode.
>> > > > Might have some small performance gains but there is no clear cut
data
>> > > that
>> > > > showcases significant performance gains.
>> > > >
>> > > > - GOMP is the version shipped with GCC and the PIP wheels without
MKL,
>> > > has
>> > > > no stability problems.
>> > > >
>> > > > As a ballpark, IOMP might give 10% performance improvement in some
>> > cases.
>> > > > We need to document well how users should tune and configure MXNet
when
>> > > > using OMP.
>> > > >
>> > > > As a developer, the safest bet is to use GOMP to be able to debug
and
>> > > > develop without issues. As a user of CPU inference / training you
want
>> > to
>> > > > run MKL so depends on how the Intel guys want to do things. My
>> > preference
>> > > > as an engineer is always stability > speed.
>> > > >
>> > > > Related tickets:
>> > > >
>> > > > https://github.com/apache/incubator-mxnet/issues/16891
>> > > >
>> > > >
>> >
https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931
>> > > >
>> > > > https://github.com/apache/incubator-mxnet/issues/11417
>> > > >
>> > > > https://github.com/apache/incubator-mxnet/issues/15690
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard
>> > > <lau...@amazon.com.invalid>
>> > > > wrote:
>> > > >
>> > > > > Is this related to
>> > > https://github.com/apache/incubator-mxnet/issues/10856?
>> > > > > I unlocked that Github issue based on the Apache Code of Conduct
>> > > > >
>> > https://www.apache.org/foundation/policies/conduct#specific-guidelines
>> > > > >
>> > > > > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
>> > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6
(upstream_master)+$
>> > ldd
>> > > > > > build/libmxnet.so| grep -i openmp
>> > > > > >         libomp.so =>
>> > > > > >
/home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
>> > > > > > (0x00007fde0991d000)
>> > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6
(upstream_master)+$
>> > > python
>> > > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
>> > > --use-rec
>> > > > > > --batch-size 256 --dtype float32 --num-data-workers 40 --mode
>> > hybrid
>> > > > > > --model resnet50_v2 --use-pretrained --kvstore local
>> > --log-interval 1
>> > > > > > --rec-val ~/data/val-passthrough.rec --rec-val-idx
>> > > > > > ~/data/val-passthrough.idx
>> > > > > > INFO:root:Namespace(batch_norm=False, batch_size=256,
>> > > > > > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32,
>> > > dtype='float32',
>> > > > > > kvstore='local', last_gamma=False, log_interval=1,
>> > > logging_dir='logs',
>> > > > > > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
>> > > > > > lr_poly_power=2, mode='hybrid', model='resnet50_v2',
momentum=0.9,
>> > > > > > num_epochs=3, num_gpus=0, num_workers=40,
>> > > > > > rec_val='/home/piotr/data/val-passthrough.rec',
>> > > > > > rec_val_idx='/home/piotr/data/val-passthrough.idx',
>> > > save_dir='params',
>> > > > > > save_frequency=0, top_k=0, use_pretrained=True, use_rec=True,
>> > > > > use_se=False,
>> > > > > > warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
>> > > > > > [10:42:02] ../src/io/iter_image_recordio_2.cc:178:
>> > > ImageRecordIOParser2:
>> > > > > > /home/piotr/data/val-passthrough.rec, use 36 threads for
decoding..
>> > > > > > INFO:root:Batch [0]
>> > > > > > INFO:root:Top 1 accuracy: 0
>> > > > > > INFO:root:warmup_throughput: 5 samples/sec warmup_time
43.150922
>> > > > > > INFO:root:Batch [1]
>> > > > > > INFO:root:Top 1 accuracy: 0
>> > > > > > INFO:root:warmup_throughput: 6 samples/sec warmup_time
37.971927
>> > > > > > INFO:root:Batch [2]
>> > > > > > INFO:root:Top 1 accuracy: 0
>> > > > > > INFO:root:warmup_throughput: 7 samples/sec warmup_time
35.755363
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp
>> > > > > (upstream_master)+$
>> > > > > > git st
>> > > > > > On branch upstream_master
>> > > > > > Your branch is up to date with 'origin/upstream_master'.
>> > > > > >
>> > > > > > Changes not staged for commit:
>> > > > > >   (use "git add/rm <file>..." to update what will be committed)
>> > > > > >   (use "git checkout -- <file>..." to discard changes in
working
>> > > > > directory)
>> > > > > >         deleted:    3rdparty/openmp
>> > > > > >
>> > > > > > no changes added to commit (use "git add" and/or "git commit
-a")
>> > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp
>> > > > > (upstream_master)+$
>> > > > > > ldd build/libmxnet.so | grep -i omp
>> > > > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
>> > > > > > (0x00007f941241c000)
>> > > > > >
>> > > > > > (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp
>> > > > > (upstream_master)+$
>> > > > > > python
>> > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
>> > > > > > --use-rec --batch-size 256 --dtype float32 --num-data-workers
40
>> > > --mode
>> > > > > > hybrid --model resnet50_v2 --use-pretrained --kvstore local
>> > > > > --log-interval
>> > > > > > 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
>> > > > > > ~/data/val-passthrough.idx
>> > > > > > INFO:root:warmup_throughput: 147 samples/sec warmup_time
1.735117
>> > > > > > INFO:root:Batch [16]
>> > > > > > INFO:root:Top 1 accuracy: 0
>> > > > > > INFO:root:warmup_throughput: 143 samples/sec warmup_time
1.785760
>> > > > > > INFO:root:Batch [17]
>> > > > > > INFO:root:Top 1 accuracy: 0
>> > > > > > INFO:root:warmup_throughput: 148 samples/sec warmup_time
1.729033
>

Reply via email to