As already proposed, I think the easiest way to get a common understanding is if we start with a few docker containers. Pedro, would it be possible for you to wrap your benchmarks into a few containers that will produce your shown results? That way, we can avoid possible misunderstandings and also pinpoint the exact parts where people disagree or misunderstood each other.
-Marco Pedro Larroy <pedro.larroy.li...@gmail.com> schrieb am Do., 20. Juni 2019, 21:47: > I can confirm that we are linking with two versions of omp, I'm > gaining more clarity into this topic, but I have still questions, the > facts that I got so far are the folllowing: > > * #1: We are linking with two versions of omp, intel's omp and llvm > openmp when building with MKL enabled. > * #2: We have 3 different possible OMP versions: Intel OMP (comes with > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This > one is used on the PR proposed by Anton). > > Questions: > > * #1 Is it ok to have two versions of openmp linked at the same time? > * #2 Which implementation of OMP gives the best performance? (See > total training time of my measurement for a partial answer) > * #3 Should we have a build flag so we can choose the OMP version at > runtime? > * #4 Which Compiler and build flags did Chris use to get 10x slowdown? > * #5 @Stas: is there a script to replicate your benchmarks easily? If > so could you provide a link? I think we would need to reproduce your > benchmarks and verify which versions are being linked. It's possible > that while compiling with MKL intel's omp was pulled in instead of > GNU OpenMP. > * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we > update the subrepo regularly? > > My conclusion so far: > > * #1 We should avoid linking two versions of omp if possible and > allow users to choose one in the build as we do for BLAS. > * #2 For performance reasons and more control vs different compiler > versions seems it makes indeed sense to keep the LLVM OpenMP version > in 3rdparty for now. So unless some more data is gathered, it makes > sense not to remove it as of now. > * #3 We should provide build options to choose which openmp library > is to be used from the three options available, including libgomp. > * #4 Refining the build we could also enable OpenMP in mac without > additional contortions (doesn't work as of today): > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ > * #5 We should add different omp versions to our benchmarks and track > the performance, so this data is available for prescribing the best > build options and for binary releases. > > This is also an interesting related gh issue posted in the mkl-dnn > repository: https://github.com/intel/mkl-dnn/issues/230 > > > I don't observe the order of magnitude divergence reported by Chris in > vanilla Ubuntu 18.04 in samples / s but the full training finishes > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp. > > There's also differences in training time when using MKL and the , > it's actually a bit slower, I don't know if it's related to OMP. > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) > > Anton's branch: g...@github.com:lebeg/incubator-mxnet.git branch 'omp' > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd > build/libmxnet.so |grep -i omp > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 > (0x00007fd99a51d000) > > time python train_mnist.py > > INFO:root:Epoch[18] Validation-accuracy=0.984176 > INFO:root:Epoch[19] Batch [0-100] Speed: 41617.00 samples/sec > accuracy=1.000000 > INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec > accuracy=0.999531 > INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec > accuracy=1.000000 > INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 samples/sec > accuracy=0.999375 > INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 samples/sec > accuracy=0.999531 > INFO:root:Epoch[19] Train-accuracy=0.999717 > INFO:root:Epoch[19] Time cost=1.219 > INFO:root:Epoch[19] Validation-accuracy=0.983977 > 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata > 1146052maxresident)k > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps > > Master, MKL ON: > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd > ../../build/libmxnet.so | grep -i omp > libomp.so => > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so > (0x00007f05ba38f000) > libiomp5.so => > > /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so > (0x00007f05b09f4000) > > INFO:root:Epoch[18] Validation-accuracy=0.982484 > INFO:root:Epoch[19] Batch [0-100] Speed: 36651.63 samples/sec > accuracy=0.999691 > INFO:root:Epoch[19] Batch [100-200] Speed: 45093.98 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [200-300] Speed: 45146.84 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [300-400] Speed: 45119.90 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [400-500] Speed: 44998.96 samples/sec > accuracy=0.999531 > INFO:root:Epoch[19] Batch [500-600] Speed: 45072.25 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [600-700] Speed: 44969.79 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [700-800] Speed: 44962.78 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [800-900] Speed: 44945.47 samples/sec > accuracy=0.999375 > INFO:root:Epoch[19] Train-accuracy=0.999717 > INFO:root:Epoch[19] Time cost=1.367 > INFO:root:Epoch[19] Validation-accuracy=0.982783 > 854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata > 1154348maxresident)k > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps > > > MKL OFF: > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL > cmake_options.yml > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd > build/libmxnet.so |grep -i omp > libomp.so => > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so > (0x00007fb720c54000) > > INFO:root:Epoch[18] Validation-accuracy=0.983479 > INFO:root:Epoch[19] Batch [0-100] Speed: 46784.02 samples/sec > accuracy=1.000000 > INFO:root:Epoch[19] Batch [100-200] Speed: 48824.29 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [200-300] Speed: 49190.31 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [300-400] Speed: 51518.77 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [400-500] Speed: 51551.62 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [500-600] Speed: 49026.35 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [600-700] Speed: 49002.46 samples/sec > accuracy=0.999375 > INFO:root:Epoch[19] Batch [700-800] Speed: 48980.55 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [800-900] Speed: 47402.56 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Train-accuracy=0.999767 > INFO:root:Epoch[19] Time cost=1.259 > INFO:root:Epoch[19] Validation-accuracy=0.983181 > 755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata > 1147008maxresident)k > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps > > Let me know what you think. > > Link to the original PR: > https://github.com/apache/incubator-mxnet/pull/12160 > > Thanks. > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland > <kellen.sunderl...@gmail.com> wrote: > > > > "if you’re linking in two then you’re doing something wrong." Correct, > > that's one thing I believe we've got consensus on. So let's call that > out > > as a bug to be fixed. > > > > Let's move forward with some reproducible numbers and then discuss the > pros > > / cons of which particular OMP implementation we should use. > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy < > pedro.larroy.li...@gmail.com> > > wrote: > > > > > Hi Chris > > > > > > I would ask you to have a bit of patience and help us with your > > > experience in this matter. Nobody is ignoring anything, I think we are > > > individually gathering feedbacks and trying to understand the multiple > > > contributions done to this topic including yours, then go step by > > > step, understand what is going on and run experiments and report back > > > to the list or the corresponding github item. It was suggested by > > > Kellen to prepare some containers, this takes effort. > > > > > > Regarding your final comment, most of us also have many other things > > > to do and responsibilities even if our daytime jobs might involve > > > MXNet in some form or another. I think that's part of the privilege > > > and responsibility of working close with an open source project and > > > the magic of collaboration across organizations. Let's all be patient > > > and take some time to understand and reason about this topic which is > > > not simple. Since we decided to step back and gather more data let's > > > take time and do it properly. > > > > > > Personally I hope to find time to look again into this issue before > > > the end of the week. > > > > > > Thanks. > > > > > > Pedro. > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <cjolivie...@apache.org> > > > wrote: > > > > > > > > if you’re linking in two then you’re doing something wrong. You can > see > > > by > > > > my email yesterday that only one is linked in. This is also the case > with > > > > the mkl version built by the Makefile — only the Intel OMP library is > > > used > > > > (no libgomp). > > > > > > > > That being said, Do you have clear evidence that using Intel OMP is > both > > > > problematic and the situation isn’t fixable? The burden of proof is > on > > > the > > > > ones requesting the change — it is not my responsibility to justify > the > > > > current state. There must be something “terrible” and unfixable to > > > justify > > > > a change. I have seen no proof of this in all this time. > > > > > > > > On a side note, I mentioned a couple of things in my email yesterday > that > > > > still are not being responded to (they were also ignored in the last > > > > incarnation of this “discussion” — I have much experience in this > matter > > > to > > > > assume “discussion” is a waste of my time, seeing and I am not paid > to > > > > “work on” mxnet like y’all are). > > > > > > > > -C > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland < > > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > > > I've also quite often seen two versions of OpenMP linked. I think > we > > > can > > > > > all agree we probably want to avoid linking in two libraries that > do > > > > > effectively the same thing. > > > > > > > > > > The performance questions should be fairly straight forward to > > > demonstrate > > > > > right? Could we just collaborate on a few minimal Dockerfiles that > > > show > > > > > (or don't show) Intel OpenMP performance speedups with the > workloads > > > Chris > > > > > is referencing? > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav < > > > > > stanislav.tsuk...@gmail.com> wrote: > > > > > > > > > > > Hi, Chris! > > > > > > > > > > > > Stas here - I've gathered that performance data. > > > > > > Sure thing, I can be wrong, but please elaborate a bit on what > we are > > > > > > missing. > > > > > > Be assured, intentional misdirection was never a case. > > > > > > > > > > > > Thanks a lot for being constructive. > > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well, since it tends > to > > > pull > > > > > in > > > > > > omp, depending which one is linked in). > > > > > > > > > > > > We never ever considered turning MKL off. We are on the same page > > > here - > > > > > > MKL is crucial for the performance. > > > > > > Why should we? There's a GOMP-linked version of MKL, that we can > use. > > > > > > > > > > > > What we did - we measured, if using compilers default OpenMP > > > > > > implementation instead of referenced source code distribution of > > > OpenMP > > > > > > makes anything slower. > > > > > > We have found the impact to be hardly measurable. > > > > > > The difference between GOMP and iOMP is <5% on our benchmarks, > most > > > of > > > > > the > > > > > > time less than that. > > > > > > > > > > > > We just suggest to simplify the build of mxnet, by removing the > > > > > > unnecessary dependency. > > > > > > > > > > > > During that we discovered for example the following amazing > issue: > > > > > > https://github.com/apache/incubator-mxnet/issues/14087 > > > > > > > > > > > > Best Regards > > > > > > > > > > > > Stas > > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <cjolivie...@gmail.com> > wrote: > > > > > > > > > > > > I am very reluctant to feed the trolls again, and this will > be > > > teh > > > > > last > > > > > > time I address Pedro or Anton on the subject, but since I > think > > > the > > > > > > numbers > > > > > > being presented are incorrect (either by te builders not > really > > > > > > understanding what they are building, or possibly intentional > > > > > > misdirection): > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well, since it > tends to > > > pull > > > > > > in > > > > > > omp, depending which one is linked in). > > > > > > There is a HUGE difference. This is consistent with my > > > experience > > > > > > before > > > > > > when it was added. > > > > > > > > > > > > > > > > > > default mnist: > > > > > > > > > > > > python ../example/image-classification/train_mnist.py > > > > > > INFO:root:start with arguments Namespace(add_stn=False, > > > > > batch_size=64, > > > > > > disp_batches=100, dtype='float32', gc_threshold=0.5, > > > gc_type='none', > > > > > > gpus=None, image_shape='1, 28, 28', initializer='default', > > > > > > kv_store='device', load_epoch=None, loss='', lr=0.05, > > > lr_factor=0.1, > > > > > > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, > > > mom=0.9, > > > > > > monitor=0, network='mlp', num_classes=10, num_epochs=20, > > > > > > num_examples=60000, num_layers=None, optimizer='sgd', > > > > > > profile_server_suffix='', profile_worker_suffix='', > > > save_period=1, > > > > > > test_io=0, top_k=0, warmup_epochs=5, > warmup_strategy='linear', > > > > > > wd=0.0001) > > > > > > > > > > > > INTEL OMP: > > > > > > > > > > > > ldd libmxnet.so | grep omp > > > > > > libomp.so => > > > > > > > > > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so > > > > > > (0x00007f978fde7000) > > > > > > > > > > > > :root:Epoch[0] Batch [0-100] Speed: 31548.09 > samples/sec > > > > > > accuracy=0.780012 > > > > > > INFO:root:Epoch[0] Batch [100-200] Speed: 16073.21 > > > samples/sec > > > > > > accuracy=0.920469 > > > > > > INFO:root:Epoch[0] Batch [200-300] Speed: 19075.91 > > > samples/sec > > > > > > accuracy=0.928281 > > > > > > INFO:root:Epoch[0] Batch [300-400] Speed: 23211.36 > > > samples/sec > > > > > > accuracy=0.942813 > > > > > > INFO:root:Epoch[0] Batch [400-500] Speed: 22139.79 > > > samples/sec > > > > > > accuracy=0.938750 > > > > > > INFO:root:Epoch[0] Batch [500-600] Speed: 23225.52 > > > samples/sec > > > > > > accuracy=0.946562 > > > > > > INFO:root:Epoch[0] Batch [600-700] Speed: 19547.41 > > > samples/sec > > > > > > accuracy=0.953281 > > > > > > INFO:root:Epoch[0] Batch [700-800] Speed: 24111.73 > > > samples/sec > > > > > > accuracy=0.951562 > > > > > > INFO:root:Epoch[0] Batch [800-900] Speed: 13959.88 > > > samples/sec > > > > > > accuracy=0.957500 > > > > > > INFO:root:Epoch[0] Train-accuracy=0.925423 > > > > > > INFO:root:Epoch[0] Time cost=3.806 > > > > > > INFO:root:Epoch[0] Validation-accuracy=0.962580 > > > > > > INFO:root:Epoch[1] Batch [0-100] Speed: 24560.21 > > > samples/sec > > > > > > accuracy=0.968131 > > > > > > INFO:root:Epoch[1] Batch [100-200] Speed: 23457.03 > > > samples/sec > > > > > > accuracy=0.966250 > > > > > > > > > > > > > > > > > > LIBGOMP: > > > > > > > > > > > > ldd libmxnet.so | grep omp > > > > > > libgomp.so.1 => > /usr/lib/x86_64-linux-gnu/libgomp.so.1 > > > > > > (0x00007f25c25dd000) > > > > > > > > > > > > INFO:root:Epoch[0] Batch [0-100] Speed: 1731.01 > > > samples/sec > > > > > > accuracy=0.782488 > > > > > > INFO:root:Epoch[0] Batch [100-200] Speed: 3551.32 > > > samples/sec > > > > > > accuracy=0.907813 > > > > > > INFO:root:Epoch[0] Batch [200-300] Speed: 1991.00 > > > samples/sec > > > > > > accuracy=0.927188 > > > > > > INFO:root:Epoch[0] Batch [300-400] Speed: 2175.45 > > > samples/sec > > > > > > accuracy=0.937969 > > > > > > INFO:root:Epoch[0] Batch [400-500] Speed: 1644.95 > > > samples/sec > > > > > > accuracy=0.942187 > > > > > > INFO:root:Epoch[0] Batch [500-600] Speed: 6444.58 > > > samples/sec > > > > > > accuracy=0.950156 > > > > > > INFO:root:Epoch[0] Batch [600-700] Speed: 7842.16 > > > samples/sec > > > > > > accuracy=0.947969 > > > > > > INFO:root:Epoch[0] Batch [700-800] Speed: 9412.07 > > > samples/sec > > > > > > accuracy=0.953750 > > > > > > INFO:root:Epoch[0] Batch [800-900] Speed: 12707.58 > > > samples/sec > > > > > > accuracy=0.953125 > > > > > > > > > > > > That being said, there's other issued beyond speed. The > DEFAULT > > > > > build > > > > > > from > > > > > > makefile (not CMake) uses Intel OMP mkl (I showed before) and > > > > > > mysteriously > > > > > > it has no issues? This seems highly suspicious. All I see > is a > > > lot > > > > > of > > > > > > hand-waving and conjecture and pointing to StackOverflow > posts > > > made > > > > > by > > > > > > people who may be of questionable pedigree to begin with. > This > > > > > smells > > > > > > of a > > > > > > Pedro-ego-fight rather than one of purely technical merit. > > > Also, if > > > > > > one > > > > > > knows how OMP works, they would be very suspicious of the > > > > > > "intermittent > > > > > > hangs" claim -- that's probably just broken race conditions > > > elsewhere > > > > > > until > > > > > > proven differently. It'd tend freeze on the first use if > > > something > > > > > is > > > > > > wrong (try using libgomp after a fork and see), since worker > > > threads" > > > > > > wouldn't be assigned/joined properly. IntelOMP is faster, > but > > > also > > > > > has > > > > > > other advantages, such as allowing OMP after a fork. > > > > > > > > > > > > I actually addressed a lot of issues and ask for > clarification > > > in the > > > > > > original PR's way back when, but they're all just ignored. > > > > > > > > > > > > -Chris > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >