btw the call stack I am referring to below is the one where I explained this problem before and after I got a hostile response, I locked the issue.
On Sun, Dec 8, 2019 at 7:24 AM Chris Olivier <cjolivie...@gmail.com> wrote: > Again, here is what I suspect the bug is in mxnet: > > The way that advanced openmp libraries handle a fork is that they hook an > atfork() callback in which, in the new process, it creates a new “team” of > threads to use for its thread pool (since all of the thread handles in its > data structure belong to the previous process). atfork() callback order is > the order at which the callbacks are registered, which will tend to be the > first call to the openmp library. For this reason, the fork order will > vary depending upon what other libraries might be linked in and whether > they make omp calls before mxnet starts its static init. > > What the assert in question is trying to say is that mxnet code is calling > into omp library after a fork, but before the omp library’s atfork() > handler is called, so the omp library has not yet initialized a new team if > threads. This looks to be the case in one of the call stacks on that > issue. This is problematic for any openmp library which supports omp after > a fork, and may not be deterministic from build to build, since the order > of static init calls for a given module is undefined (i think mxnet is > initializing omp during static init, but this may not matter). > > So if mxnet is doing that, it is a bug and remains a problem regardless of > the omp library and probably should be fixed. llvm omp happens to be nice > enough to tell you you’re doing something wrong, at least when built in > debug mode. > > Once this issue is resolved, we can discuss the library inclusion itself. > My objection is “fixing” what appears to be a bug by effectively > “commenting out the assert” which is what i stated in the very beginning. > > It stands to reason that linking this or that library may affect the > assert occurring because it’s not known at what time one of the dependent > libraries initializes omp (thus causing it to hook its atfork handler), so > it is not surprising that mucking with dependencies may cause the assert to > occur or not occur. > > Is there another explanation for the call stack with the assert? Can this > bug be ruled out? > > > Here is an example of the atfork team concept with libgomp as well. > Probably you can check the current libgomp code itself but this explains > the code: > https://patchwork.ozlabs.org/patch/319827/ > > > > > > > > > On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard <lau...@amazon.com.invalid> > wrote: > >> Thanks Pedro and Chris for your responses. >> >> After further investigation I find: >> >> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 >> is >> caused by any incompatibility between gomp and llvm / intel omp. Rather >> it's >> simply a problem of llvm / intel omp. See my comment to the issue for the >> methodology to arrive at this claim. >> >> 2) Regarding the assertion failure when compiling with (llvm) >> 3rdparty/openmp, >> it can be fixed by updating the by now 2 years old llvm openmp code to the >> newest released version. I went ahead and opened a PR >> https://github.com/apache/incubator-mxnet/pull/17012 >> >> Based on the investigation described in 1), I think Chris is right that >> the >> assertion failure is not due to some interaction between gomp and llvm >> omp. >> However, I'm not sure about Chris's suggestion that the assertion failure >> is due >> to a bug in MXNet. In fact, the failure goes away when updating the llvm >> openmp >> code. So I think it's just due to a bug in the 2 years old code. >> >> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not >> contentious. Thus let's do it via lazy consensus (72 hours) or just >> approve the >> PR and merge it. >> >> Please also take a look at my comment at #14979 and let everyone know if >> you see >> any option to fix the bug while keeping 3rdparty/openmp. As this bug >> affects an >> important use-case, I beleive we need to remove 3rdparty/openmp from the >> CMake >> build as long as we don't find a solution for making #14979 work with >> 3rdparty/openmp. >> >> In fact, removing 3rdparty/openmp will then match the current Makefile >> setup >> that according to my understanding is used to build the nightly releases >> used by >> the majority of developers. Ie. most users actually don't use the CMake >> build >> with 3rdparty/openmp. You can consider rescinding your veto on removing >> 3rdparty/openmp after reading through the evidence in that issue. If you >> don't >> provide any evidence for why the methodology/conclusion in #14979 is >> flawed, I >> will assume your previous veto is void based on Apache Voting rule as it >> lacks >> technical justification and in any case was motivated by the assertion >> issue, >> which I agree with you, is likely not due to gomp / omp interaction. >> >> Thank you >> Leonard >> >> >> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote: >> > Stop disseminating false information: >> > >> > https://github.com/apache/incubator-mxnet/issues/14979 >> > >> > >> > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier <cjolivie...@gmail.com> >> wrote: >> > >> > > -1 >> > > >> > > mkldnn removed omp5 for licencing issues >> > > no bugs have actually been traced to the use of llvm openmp. only an >> assert >> > > caused by an actual bug in mxnet code. there are suitable workarounds. >> > > >> > > over time llvm omp has simply been used as a “catch all” for random >> > > problems that aren’t related at all (such as getenv race condition in >> an >> > > atfork call that isn’t even part of an omp parallel region). >> > > >> > > proposal is now and has always been roughly equivalent to the idea of >> > > “comment out an assert rather than fix the bug it’s reporting”. >> > > >> > > Up until very recently, Makefile version of mxnet used libomp5 for >> YEARS >> > > and not libgomp, with no issue reported (omp not built in debug >> mode), so >> > > the equivalent configuration from CMake mysteriously causing myriads >> if >> > > problems has questionable merit and smells more like a hubris >> situation. >> > > >> > > I use tensorflow as well and it links to libomp5 rather than libgomp. >> > > >> > > if the assert problem is really a problem, the bug being reported >> would be >> > > prioritized and fixed. it should be fixed regardless. all the time >> spent by >> > > some CI people trying to remove this could have simply fixed the >> actual bug >> > > in a small fraction of the time. >> > > >> > > >> > > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard >> <lau...@amazon.com.invalid> >> > > wrote: >> > > >> > > > I think it's reasonable to assume that the Intel MKLDNN team is an >> > > > "authorative" >> > > > source about the issue of compilation with OpenMP and the OpenMP >> runtime >> > > > library >> > > > related issues. Thus I suggest we follow the recommendation of Intel >> > > > MKLDNN team >> > > > within the MXNet project. >> > > > >> > > > Looking through the Intel MKLDNN documentation, I find [1]: >> > > > >> > > > > DNNL uses OpenMP runtime library provided by the compiler. >> > > > >> > > > as well as >> > > > >> > > > > it's important to ensure that only one OpenMP runtime is used >> > > throughout >> > > > the >> > > > > application. Having more than one OpenMP runtime linked to an >> > > executable >> > > > may >> > > > > lead to undefined behavior including incorrect results or crashes. >> > > > >> > > > To keep our project maintainable and error free, I thus suggest we >> follow >> > > > DNNL >> > > > and use the OpenMP runtime library provided by the compiler. >> > > > We have limited ressources and finding the root cause for any bugs >> > > > resulting >> > > > from linking multiple OpenMP libraries as currently done is, in my >> > > > opinion. not >> > > > a good use of time. We know it's due to undefined behavior and we >> know >> > > > it's best >> > > > practice to use OpenMP runtime library provided by the compiler. So >> let's >> > > > just >> > > > do that. >> > > > >> > > > I think given that MKL-DNN has also adopted the "OpenMP runtime >> library >> > > > provided >> > > > by the compiler" approach, this issue is not contentious anymore and >> > > > qualifies >> > > > for lazy consensus. >> > > > >> > > > Thus if there is no objection within 72 hours (lazy consensus), >> let's >> > > drop >> > > > bundled LLVM OpenMP from master [2]. If we find any issues due to >> > > > droppeing the >> > > > bundled LLVM OpenMP, we can always add it back prior to the next >> release. >> > > > >> > > > Best regards >> > > > Leonard >> > > > >> > > > [1]: >> > > > >> > > > >> > > >> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp >> > > > (This is the updated reference from Anton's previous comment, based >> on >> > > the >> > > > changes in MKLDNN done in the meantime >> > > > >> > > >> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066 >> > > > ) >> > > > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160 >> > > > >> > > > >> > > > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote: >> > > > > I will try to stay on the sidelines for now since previous >> > > conversations >> > > > > about OMP have not been productive here and I have spent way too >> much >> > > > time >> > > > > on this already, I'm not the first one giving up on trying to >> help with >> > > > > this topic. >> > > > > >> > > > > I would be glad if you guys can work together and find a >> solution. I >> > > will >> > > > > just put my understanding of the big picture hoping that it helps >> move >> > > it >> > > > > forward. >> > > > > >> > > > > >> > > > > Recently the intel omp library which seemed to have the best >> > > performance >> > > > of >> > > > > the 3 was removed from MKL. >> > > > > >> > > > > - There's 3 libraries in play, GNU Omp which is shipped with gcc >> > > (gomp), >> > > > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, >> which is >> > > > > recently removed (iomp) >> > > > > >> > > > > - IOMP seems to have the best performance, there's stability >> issues >> > > > > producing crashes sometimes but the impact seems relatively small >> for >> > > > users >> > > > > and developers. In general seems linking with a different OMP >> version >> > > > that >> > > > > the one shipped with the compiler is known to cause stability >> issues >> > > but >> > > > > it's done anyway. >> > > > > >> > > > > - LLVM-OMP used when building with CMake, not used in the PIP >> releases >> > > or >> > > > > when building with Make. Has stability issues, hangs when running >> in >> > > > debug >> > > > > mode during test execution and produces tons of assertions in >> debug >> > > mode. >> > > > > Might have some small performance gains but there is no clear cut >> data >> > > > that >> > > > > showcases significant performance gains. >> > > > > >> > > > > - GOMP is the version shipped with GCC and the PIP wheels without >> MKL, >> > > > has >> > > > > no stability problems. >> > > > > >> > > > > As a ballpark, IOMP might give 10% performance improvement in some >> > > cases. >> > > > > We need to document well how users should tune and configure >> MXNet when >> > > > > using OMP. >> > > > > >> > > > > As a developer, the safest bet is to use GOMP to be able to debug >> and >> > > > > develop without issues. As a user of CPU inference / training you >> want >> > > to >> > > > > run MKL so depends on how the Intel guys want to do things. My >> > > preference >> > > > > as an engineer is always stability > speed. >> > > > > >> > > > > Related tickets: >> > > > > >> > > > > https://github.com/apache/incubator-mxnet/issues/16891 >> > > > > >> > > > > >> > > >> https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931 >> > > > > >> > > > > https://github.com/apache/incubator-mxnet/issues/11417 >> > > > > >> > > > > https://github.com/apache/incubator-mxnet/issues/15690 >> > > > > >> > > > > >> > > > > >> > > > > On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard >> > > > <lau...@amazon.com.invalid> >> > > > > wrote: >> > > > > >> > > > > > Is this related to >> > > > https://github.com/apache/incubator-mxnet/issues/10856? >> > > > > > I unlocked that Github issue based on the Apache Code of Conduct >> > > > > > >> > > >> https://www.apache.org/foundation/policies/conduct#specific-guidelines >> > > > > > >> > > > > > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote: >> > > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 >> (upstream_master)+$ >> > > ldd >> > > > > > > build/libmxnet.so| grep -i openmp >> > > > > > > libomp.so => >> > > > > > > >> /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so >> > > > > > > (0x00007fde0991d000) >> > > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 >> (upstream_master)+$ >> > > > python >> > > > > > > >> ~/deeplearning-benchmark/image_classification/infer_imagenet.py >> > > > --use-rec >> > > > > > > --batch-size 256 --dtype float32 --num-data-workers 40 --mode >> > > hybrid >> > > > > > > --model resnet50_v2 --use-pretrained --kvstore local >> > > --log-interval 1 >> > > > > > > --rec-val ~/data/val-passthrough.rec --rec-val-idx >> > > > > > > ~/data/val-passthrough.idx >> > > > > > > INFO:root:Namespace(batch_norm=False, batch_size=256, >> > > > > > > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32, >> > > > dtype='float32', >> > > > > > > kvstore='local', last_gamma=False, log_interval=1, >> > > > logging_dir='logs', >> > > > > > > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step', >> > > > > > > lr_poly_power=2, mode='hybrid', model='resnet50_v2', >> momentum=0.9, >> > > > > > > num_epochs=3, num_gpus=0, num_workers=40, >> > > > > > > rec_val='/home/piotr/data/val-passthrough.rec', >> > > > > > > rec_val_idx='/home/piotr/data/val-passthrough.idx', >> > > > save_dir='params', >> > > > > > > save_frequency=0, top_k=0, use_pretrained=True, use_rec=True, >> > > > > > use_se=False, >> > > > > > > warmup_epochs=0, warmup_lr=0.0, wd=0.0001) >> > > > > > > [10:42:02] ../src/io/iter_image_recordio_2.cc:178: >> > > > ImageRecordIOParser2: >> > > > > > > /home/piotr/data/val-passthrough.rec, use 36 threads for >> decoding.. >> > > > > > > INFO:root:Batch [0] >> > > > > > > INFO:root:Top 1 accuracy: 0 >> > > > > > > INFO:root:warmup_throughput: 5 samples/sec warmup_time >> 43.150922 >> > > > > > > INFO:root:Batch [1] >> > > > > > > INFO:root:Top 1 accuracy: 0 >> > > > > > > INFO:root:warmup_throughput: 6 samples/sec warmup_time >> 37.971927 >> > > > > > > INFO:root:Batch [2] >> > > > > > > INFO:root:Top 1 accuracy: 0 >> > > > > > > INFO:root:warmup_throughput: 7 samples/sec warmup_time >> 35.755363 >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp >> > > > > > (upstream_master)+$ >> > > > > > > git st >> > > > > > > On branch upstream_master >> > > > > > > Your branch is up to date with 'origin/upstream_master'. >> > > > > > > >> > > > > > > Changes not staged for commit: >> > > > > > > (use "git add/rm <file>..." to update what will be >> committed) >> > > > > > > (use "git checkout -- <file>..." to discard changes in >> working >> > > > > > directory) >> > > > > > > deleted: 3rdparty/openmp >> > > > > > > >> > > > > > > no changes added to commit (use "git add" and/or "git commit >> -a") >> > > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp >> > > > > > (upstream_master)+$ >> > > > > > > ldd build/libmxnet.so | grep -i omp >> > > > > > > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 >> > > > > > > (0x00007f941241c000) >> > > > > > > >> > > > > > > (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp >> > > > > > (upstream_master)+$ >> > > > > > > python >> > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py >> > > > > > > --use-rec --batch-size 256 --dtype float32 --num-data-workers >> 40 >> > > > --mode >> > > > > > > hybrid --model resnet50_v2 --use-pretrained --kvstore local >> > > > > > --log-interval >> > > > > > > 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx >> > > > > > > ~/data/val-passthrough.idx >> > > > > > > INFO:root:warmup_throughput: 147 samples/sec warmup_time >> 1.735117 >> > > > > > > INFO:root:Batch [16] >> > > > > > > INFO:root:Top 1 accuracy: 0 >> > > > > > > INFO:root:warmup_throughput: 143 samples/sec warmup_time >> 1.785760 >> > > > > > > INFO:root:Batch [17] >> > > > > > > INFO:root:Top 1 accuracy: 0 >> > > > > > > INFO:root:warmup_throughput: 148 samples/sec warmup_time >> 1.729033 >> >