ok. i see you answered below. was not obvious at first. you can upgrade the llvm.
On Sun, Dec 8, 2019 at 6:51 PM Lausen, Leonard <lau...@amazon.com.invalid> wrote: > Thanks Chris for the elaboration. > > > What the assert in question is trying to say is that mxnet code is > calling > > into omp library after a fork, but before the omp library’s atfork() > > handler is called, so the omp library has not yet initialized a new team > if > > threads. This looks to be the case in one of the call stacks on that > > issue. This is problematic for any openmp library which supports omp > after > > a fork, and may not be deterministic from build to build, since the order > > of static init calls for a given module is undefined (i think mxnet is > > initializing omp during static init, but this may not matter). > > I disagree. The assert fails without any forking. You can trigger it by > running > `python3 -c 'import mxnet'` a few times. > > As described in https://github.com/apache/incubator-mxnet/pull/17012 the > previously failing assertion is still part of LLVM OpenMP 9.0 codebase. In > particular you can compare line > https://github.com/llvm-mirror/openmp/blob/release_90/runtime/src/kmp_runtime.cpp#L6616 > to > > https://github.com/llvm-mirror/openmp/blob/37c72127e90360a020f351f18d9cccfc30e5145a/runtime/src/kmp_runtime.cpp#L6481 > where the latter is the line that currently fails in MXNet. > > I would like to reiterate, we are currently using a random and UNRELEASED > LLVM > OpenMP version from 2017. There is no evidence that the assertion failure > is due > to a bug in MXNet, but rather there is strong evidence that it is due to a > bug > in the previously used LLVM version. (See the latter part of this email > regarding your suggestion about bug in MXNet fork handling). > > > It stands to reason that linking this or that library may affect the > assert > > occurring because it’s not known at what time one of the dependent > > libraries initializes omp (thus causing it to hook its atfork handler), > so > > it is not surprising that mucking with dependencies may cause the assert > to > > occur or not occur. > > With respect to the Assert failure, there is no difference in "linking > this or > that library". We are only updating the version of LLVM OpenMP codebase. > See > https://github.com/apache/incubator-mxnet/pull/17012 > > Thus your veto on updating the shipped LLVM OpenMP lacks technical > justification > and is void. I'm requesting 3 commiters to approve the PR to go ahead with > the > update. > > > So if mxnet is doing that, it is a bug and remains a problem regardless > of > > the omp library and probably should be fixed. llvm omp happens to be > nice > > enough to tell you you’re doing something wrong, at least when built in > > debug mode. > > Yes, we need to investigate if there is a bug. Let's do this based on > https://github.com/apache/incubator-mxnet/issues/14979 which is a "real" > crash > that happens both with the old unreleased and the new 9.0 LLVM OpenMP > versions. > > > Once this issue is resolved, we can discuss the library inclusion itself. > > My objection is “fixing” what appears to be a bug by effectively > > “commenting out the assert” which is what i stated in the very beginning. > > This is a reasonable approach, as long as you are willing to help fix it. > As > made evident by your comments, you have more experience with advanced > OpenMP > libraries. Thus let's work together on fixing the root cause, assuming it's > indeed a bug in MXNet. > > If we run into any roadblock with finding a root cause in MXNet, the only > alternative I see is to remove the 3rdparty/openmp library, as we can not > rule > out the possibility of an LLVM OpenMP bug either. > > > Is there another explanation for the call stack with the assert? Can > this > > bug be ruled out? > > Yes. The other explanation is that it's due to a bug in the old LLVM OpenMP > version (see above). > The stack trace you refer to is > > https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-505162890 > relies on mkl and is hard to reproduce. > > With respect to establishing and fixing the bug in MXNet, let's focus on > the > crash described in https://github.com/apache/incubator-mxnet/issues/14979 > which > is happening with Intel OpenMP, old unreleased LLVM OpenMP and the current > LLVM > OpenMP 9.0 release. It involves forking and thus seems quite related to the > first paragraph of your email compared to the assertion failure which > happens > without any forking. > > Best regards > Leonard > > On Sun, 2019-12-08 at 07:58 -0800, Chris Olivier wrote: > > btw the call stack I am referring to below is the one where I explained > > this problem before and after I got a hostile response, I locked the > issue. > > > > On Sun, Dec 8, 2019 at 7:24 AM Chris Olivier <cjolivie...@gmail.com> > wrote: > > > > > Again, here is what I suspect the bug is in mxnet: > > > > > > The way that advanced openmp libraries handle a fork is that they hook > an > > > atfork() callback in which, in the new process, it creates a new > “team” of > > > threads to use for its thread pool (since all of the thread handles in > its > > > data structure belong to the previous process). atfork() callback > order is > > > the order at which the callbacks are registered, which will tend to be > the > > > first call to the openmp library. For this reason, the fork order will > > > vary depending upon what other libraries might be linked in and whether > > > they make omp calls before mxnet starts its static init. > > > > > > What the assert in question is trying to say is that mxnet code is > calling > > > into omp library after a fork, but before the omp library’s atfork() > > > handler is called, so the omp library has not yet initialized a new > team if > > > threads. This looks to be the case in one of the call stacks on that > > > issue. This is problematic for any openmp library which supports omp > after > > > a fork, and may not be deterministic from build to build, since the > order > > > of static init calls for a given module is undefined (i think mxnet is > > > initializing omp during static init, but this may not matter). > > > > > > So if mxnet is doing that, it is a bug and remains a problem > regardless of > > > the omp library and probably should be fixed. llvm omp happens to be > nice > > > enough to tell you you’re doing something wrong, at least when built in > > > debug mode. > > > > > > Once this issue is resolved, we can discuss the library inclusion > itself. > > > My objection is “fixing” what appears to be a bug by effectively > > > “commenting out the assert” which is what i stated in the very > beginning. > > > > > > It stands to reason that linking this or that library may affect the > > > assert occurring because it’s not known at what time one of the > dependent > > > libraries initializes omp (thus causing it to hook its atfork > handler), so > > > it is not surprising that mucking with dependencies may cause the > assert to > > > occur or not occur. > > > > > > Is there another explanation for the call stack with the assert? Can > this > > > bug be ruled out? > > > > > > > > > Here is an example of the atfork team concept with libgomp as well. > > > Probably you can check the current libgomp code itself but this > explains > > > the code: > > > https://patchwork.ozlabs.org/patch/319827/ > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard > <lau...@amazon.com.invalid> > > > wrote: > > > > > > > Thanks Pedro and Chris for your responses. > > > > > > > > After further investigation I find: > > > > > > > > 1) I don't think > https://github.com/apache/incubator-mxnet/issues/14979 > > > > is > > > > caused by any incompatibility between gomp and llvm / intel omp. > Rather > > > > it's > > > > simply a problem of llvm / intel omp. See my comment to the issue > for the > > > > methodology to arrive at this claim. > > > > > > > > 2) Regarding the assertion failure when compiling with (llvm) > > > > 3rdparty/openmp, > > > > it can be fixed by updating the by now 2 years old llvm openmp code > to the > > > > newest released version. I went ahead and opened a PR > > > > https://github.com/apache/incubator-mxnet/pull/17012 > > > > > > > > Based on the investigation described in 1), I think Chris is right > that > > > > the > > > > assertion failure is not due to some interaction between gomp and > llvm > > > > omp. > > > > However, I'm not sure about Chris's suggestion that the assertion > failure > > > > is due > > > > to a bug in MXNet. In fact, the failure goes away when updating the > llvm > > > > openmp > > > > code. So I think it's just due to a bug in the 2 years old code. > > > > > > > > @Chris, I think updating 3rdparty/openmp to fix the assertion issue > is not > > > > contentious. Thus let's do it via lazy consensus (72 hours) or just > > > > approve the > > > > PR and merge it. > > > > > > > > Please also take a look at my comment at #14979 and let everyone > know if > > > > you see > > > > any option to fix the bug while keeping 3rdparty/openmp. As this bug > > > > affects an > > > > important use-case, I beleive we need to remove 3rdparty/openmp from > the > > > > CMake > > > > build as long as we don't find a solution for making #14979 work with > > > > 3rdparty/openmp. > > > > > > > > In fact, removing 3rdparty/openmp will then match the current > Makefile > > > > setup > > > > that according to my understanding is used to build the nightly > releases > > > > used by > > > > the majority of developers. Ie. most users actually don't use the > CMake > > > > build > > > > with 3rdparty/openmp. You can consider rescinding your veto on > removing > > > > 3rdparty/openmp after reading through the evidence in that issue. If > you > > > > don't > > > > provide any evidence for why the methodology/conclusion in #14979 is > > > > flawed, I > > > > will assume your previous veto is void based on Apache Voting rule > as it > > > > lacks > > > > technical justification and in any case was motivated by the > assertion > > > > issue, > > > > which I agree with you, is likely not due to gomp / omp interaction. > > > > > > > > Thank you > > > > Leonard > > > > > > > > > > > > On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote: > > > > > Stop disseminating false information: > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/14979 > > > > > > > > > > > > > > > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier < > cjolivie...@gmail.com> > > > > wrote: > > > > > > -1 > > > > > > > > > > > > mkldnn removed omp5 for licencing issues > > > > > > no bugs have actually been traced to the use of llvm openmp. > only an > > > > assert > > > > > > caused by an actual bug in mxnet code. there are suitable > workarounds. > > > > > > > > > > > > over time llvm omp has simply been used as a “catch all” for > random > > > > > > problems that aren’t related at all (such as getenv race > condition in > > > > an > > > > > > atfork call that isn’t even part of an omp parallel region). > > > > > > > > > > > > proposal is now and has always been roughly equivalent to the > idea of > > > > > > “comment out an assert rather than fix the bug it’s reporting”. > > > > > > > > > > > > Up until very recently, Makefile version of mxnet used libomp5 > for > > > > YEARS > > > > > > and not libgomp, with no issue reported (omp not built in debug > > > > mode), so > > > > > > the equivalent configuration from CMake mysteriously causing > myriads > > > > if > > > > > > problems has questionable merit and smells more like a hubris > > > > situation. > > > > > > I use tensorflow as well and it links to libomp5 rather than > libgomp. > > > > > > > > > > > > if the assert problem is really a problem, the bug being reported > > > > would be > > > > > > prioritized and fixed. it should be fixed regardless. all the > time > > > > spent by > > > > > > some CI people trying to remove this could have simply fixed the > > > > actual bug > > > > > > in a small fraction of the time. > > > > > > > > > > > > > > > > > > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard > > > > <lau...@amazon.com.invalid> > > > > > > wrote: > > > > > > > > > > > > > I think it's reasonable to assume that the Intel MKLDNN team > is an > > > > > > > "authorative" > > > > > > > source about the issue of compilation with OpenMP and the > OpenMP > > > > runtime > > > > > > > library > > > > > > > related issues. Thus I suggest we follow the recommendation of > Intel > > > > > > > MKLDNN team > > > > > > > within the MXNet project. > > > > > > > > > > > > > > Looking through the Intel MKLDNN documentation, I find [1]: > > > > > > > > > > > > > > > DNNL uses OpenMP runtime library provided by the compiler. > > > > > > > > > > > > > > as well as > > > > > > > > > > > > > > > it's important to ensure that only one OpenMP runtime is used > > > > > > throughout > > > > > > > the > > > > > > > > application. Having more than one OpenMP runtime linked to an > > > > > > executable > > > > > > > may > > > > > > > > lead to undefined behavior including incorrect results or > crashes. > > > > > > > > > > > > > > To keep our project maintainable and error free, I thus > suggest we > > > > follow > > > > > > > DNNL > > > > > > > and use the OpenMP runtime library provided by the compiler. > > > > > > > We have limited ressources and finding the root cause for any > bugs > > > > > > > resulting > > > > > > > from linking multiple OpenMP libraries as currently done is, > in my > > > > > > > opinion. not > > > > > > > a good use of time. We know it's due to undefined behavior and > we > > > > know > > > > > > > it's best > > > > > > > practice to use OpenMP runtime library provided by the > compiler. So > > > > let's > > > > > > > just > > > > > > > do that. > > > > > > > > > > > > > > I think given that MKL-DNN has also adopted the "OpenMP runtime > > > > library > > > > > > > provided > > > > > > > by the compiler" approach, this issue is not contentious > anymore and > > > > > > > qualifies > > > > > > > for lazy consensus. > > > > > > > > > > > > > > Thus if there is no objection within 72 hours (lazy consensus), > > > > let's > > > > > > drop > > > > > > > bundled LLVM OpenMP from master [2]. If we find any issues due > to > > > > > > > droppeing the > > > > > > > bundled LLVM OpenMP, we can always add it back prior to the > next > > > > release. > > > > > > > Best regards > > > > > > > Leonard > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp > > > > > > > (This is the updated reference from Anton's previous comment, > based > > > > on > > > > > > the > > > > > > > changes in MKLDNN done in the meantime > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066 > > > > > > > ) > > > > > > > [2]: Alike > https://github.com/apache/incubator-mxnet/pull/12160 > > > > > > > > > > > > > > > > > > > > > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote: > > > > > > > > I will try to stay on the sidelines for now since previous > > > > > > conversations > > > > > > > > about OMP have not been productive here and I have spent way > too > > > > much > > > > > > > time > > > > > > > > on this already, I'm not the first one giving up on trying to > > > > help with > > > > > > > > this topic. > > > > > > > > > > > > > > > > I would be glad if you guys can work together and find a > > > > solution. I > > > > > > will > > > > > > > > just put my understanding of the big picture hoping that it > helps > > > > move > > > > > > it > > > > > > > > forward. > > > > > > > > > > > > > > > > > > > > > > > > Recently the intel omp library which seemed to have the best > > > > > > performance > > > > > > > of > > > > > > > > the 3 was removed from MKL. > > > > > > > > > > > > > > > > - There's 3 libraries in play, GNU Omp which is shipped with > gcc > > > > > > (gomp), > > > > > > > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, > > > > which is > > > > > > > > recently removed (iomp) > > > > > > > > > > > > > > > > - IOMP seems to have the best performance, there's stability > > > > issues > > > > > > > > producing crashes sometimes but the impact seems relatively > small > > > > for > > > > > > > users > > > > > > > > and developers. In general seems linking with a different OMP > > > > version > > > > > > > that > > > > > > > > the one shipped with the compiler is known to cause stability > > > > issues > > > > > > but > > > > > > > > it's done anyway. > > > > > > > > > > > > > > > > - LLVM-OMP used when building with CMake, not used in the PIP > > > > releases > > > > > > or > > > > > > > > when building with Make. Has stability issues, hangs when > running > > > > in > > > > > > > debug > > > > > > > > mode during test execution and produces tons of assertions in > > > > debug > > > > > > mode. > > > > > > > > Might have some small performance gains but there is no > clear cut > > > > data > > > > > > > that > > > > > > > > showcases significant performance gains. > > > > > > > > > > > > > > > > - GOMP is the version shipped with GCC and the PIP wheels > without > > > > MKL, > > > > > > > has > > > > > > > > no stability problems. > > > > > > > > > > > > > > > > As a ballpark, IOMP might give 10% performance improvement > in some > > > > > > cases. > > > > > > > > We need to document well how users should tune and configure > > > > MXNet when > > > > > > > > using OMP. > > > > > > > > > > > > > > > > As a developer, the safest bet is to use GOMP to be able to > debug > > > > and > > > > > > > > develop without issues. As a user of CPU inference / > training you > > > > want > > > > > > to > > > > > > > > run MKL so depends on how the Intel guys want to do things. > My > > > > > > preference > > > > > > > > as an engineer is always stability > speed. > > > > > > > > > > > > > > > > Related tickets: > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/16891 > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931 > > > > > > > > https://github.com/apache/incubator-mxnet/issues/11417 > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/15690 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard > > > > > > > <lau...@amazon.com.invalid> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Is this related to > > > > > > > https://github.com/apache/incubator-mxnet/issues/10856? > > > > > > > > > I unlocked that Github issue based on the Apache Code of > Conduct > > > > > > > > > > > > > > https://www.apache.org/foundation/policies/conduct#specific-guidelines > > > > > > > > > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote: > > > > > > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 > > > > (upstream_master)+$ > > > > > > ldd > > > > > > > > > > build/libmxnet.so| grep -i openmp > > > > > > > > > > libomp.so => > > > > > > > > > > > > > > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so > > > > > > > > > > (0x00007fde0991d000) > > > > > > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 > > > > (upstream_master)+$ > > > > > > > python > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py > > > > > > > --use-rec > > > > > > > > > > --batch-size 256 --dtype float32 --num-data-workers 40 > --mode > > > > > > hybrid > > > > > > > > > > --model resnet50_v2 --use-pretrained --kvstore local > > > > > > --log-interval 1 > > > > > > > > > > --rec-val ~/data/val-passthrough.rec --rec-val-idx > > > > > > > > > > ~/data/val-passthrough.idx > > > > > > > > > > INFO:root:Namespace(batch_norm=False, batch_size=256, > > > > > > > > > > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32, > > > > > > > dtype='float32', > > > > > > > > > > kvstore='local', last_gamma=False, log_interval=1, > > > > > > > logging_dir='logs', > > > > > > > > > > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', > lr_mode='step', > > > > > > > > > > lr_poly_power=2, mode='hybrid', model='resnet50_v2', > > > > momentum=0.9, > > > > > > > > > > num_epochs=3, num_gpus=0, num_workers=40, > > > > > > > > > > rec_val='/home/piotr/data/val-passthrough.rec', > > > > > > > > > > rec_val_idx='/home/piotr/data/val-passthrough.idx', > > > > > > > save_dir='params', > > > > > > > > > > save_frequency=0, top_k=0, use_pretrained=True, > use_rec=True, > > > > > > > > > use_se=False, > > > > > > > > > > warmup_epochs=0, warmup_lr=0.0, wd=0.0001) > > > > > > > > > > [10:42:02] ../src/io/iter_image_recordio_2.cc:178: > > > > > > > ImageRecordIOParser2: > > > > > > > > > > /home/piotr/data/val-passthrough.rec, use 36 threads for > > > > decoding.. > > > > > > > > > > INFO:root:Batch [0] > > > > > > > > > > INFO:root:Top 1 accuracy: 0 > > > > > > > > > > INFO:root:warmup_throughput: 5 samples/sec warmup_time > > > > 43.150922 > > > > > > > > > > INFO:root:Batch [1] > > > > > > > > > > INFO:root:Top 1 accuracy: 0 > > > > > > > > > > INFO:root:warmup_throughput: 6 samples/sec warmup_time > > > > 37.971927 > > > > > > > > > > INFO:root:Batch [2] > > > > > > > > > > INFO:root:Top 1 accuracy: 0 > > > > > > > > > > INFO:root:warmup_throughput: 7 samples/sec warmup_time > > > > 35.755363 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp > > > > > > > > > (upstream_master)+$ > > > > > > > > > > git st > > > > > > > > > > On branch upstream_master > > > > > > > > > > Your branch is up to date with 'origin/upstream_master'. > > > > > > > > > > > > > > > > > > > > Changes not staged for commit: > > > > > > > > > > (use "git add/rm <file>..." to update what will be > > > > committed) > > > > > > > > > > (use "git checkout -- <file>..." to discard changes in > > > > working > > > > > > > > > directory) > > > > > > > > > > deleted: 3rdparty/openmp > > > > > > > > > > > > > > > > > > > > no changes added to commit (use "git add" and/or "git > commit > > > > -a") > > > > > > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp > > > > > > > > > (upstream_master)+$ > > > > > > > > > > ldd build/libmxnet.so | grep -i omp > > > > > > > > > > libgomp.so.1 => > /usr/lib/x86_64-linux-gnu/libgomp.so.1 > > > > > > > > > > (0x00007f941241c000) > > > > > > > > > > > > > > > > > > > > (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp > > > > > > > > > (upstream_master)+$ > > > > > > > > > > python > > > > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py > > > > > > > > > > --use-rec --batch-size 256 --dtype float32 > --num-data-workers > > > > 40 > > > > > > > --mode > > > > > > > > > > hybrid --model resnet50_v2 --use-pretrained --kvstore > local > > > > > > > > > --log-interval > > > > > > > > > > 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx > > > > > > > > > > ~/data/val-passthrough.idx > > > > > > > > > > INFO:root:warmup_throughput: 147 samples/sec warmup_time > > > > 1.735117 > > > > > > > > > > INFO:root:Batch [16] > > > > > > > > > > INFO:root:Top 1 accuracy: 0 > > > > > > > > > > INFO:root:warmup_throughput: 143 samples/sec warmup_time > > > > 1.785760 > > > > > > > > > > INFO:root:Batch [17] > > > > > > > > > > INFO:root:Top 1 accuracy: 0 > > > > > > > > > > INFO:root:warmup_throughput: 148 samples/sec warmup_time > > > > 1.729033 > > > > >