Re: OMP

Pedro Larroy Thu, 20 Jun 2019 12:47:32 -0700

I can confirm that we are linking with two versions of omp, I'm
gaining more clarity into this topic, but I have still questions, the
facts that I got so far are the folllowing:


* #1: We are linking with two versions of omp, intel's omp and llvm
openmp when building with MKL enabled.
* #2: We have 3 different possible OMP versions: Intel OMP (comes with
MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
one is used on the PR proposed by Anton).

Questions:

 * #1 Is it ok to have two versions of openmp linked at the same time?
 * #2 Which implementation of OMP gives the best performance?  (See
total training time of my measurement for a partial answer)
 * #3 Should we have a build flag so we can choose the OMP version at runtime?
 * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
 * #5 @Stas: is there a script to replicate your benchmarks easily? If
so could you provide a link?  I think we would need to reproduce your
benchmarks and verify which versions are being linked. It's possible
that while compiling with MKL intel's omp was pulled in instead of
GNU OpenMP.
 * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
update the subrepo regularly?

My conclusion so far:

 * #1 We should avoid linking two versions of omp if possible and
allow users to choose one in the build as we do for BLAS.
 * #2 For performance reasons and more control vs different compiler
versions seems it makes indeed sense to keep the LLVM OpenMP version
in 3rdparty for now. So unless some more data is gathered, it makes
sense not to remove it as of now.
 * #3 We should provide build options to choose which openmp library
is to be used from the three options available, including libgomp.
 * #4 Refining the build we could also enable OpenMP in mac without
additional contortions (doesn't work as of today):
https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
 * #5 We should add different omp versions to our benchmarks and track
the performance, so this data is available for prescribing the best
build options and for binary releases.

This is also an interesting related gh issue posted in the mkl-dnn
repository:  https://github.com/intel/mkl-dnn/issues/230


I don't observe the order of magnitude divergence reported by Chris in
vanilla Ubuntu 18.04 in samples / s but the full training finishes
indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.

There's also differences in training time when using MKL and the ,
it's actually a bit slower, I don't know if it's related to OMP.

gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)

Anton's branch:  g...@github.com:lebeg/incubator-mxnet.git   branch 'omp'
(py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
build/libmxnet.so |grep -i omp
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x00007fd99a51d000)

time python train_mnist.py

INFO:root:Epoch[18] Validation-accuracy=0.984176
INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00 samples/sec
 accuracy=1.000000
INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69 samples/sec
 accuracy=0.999531
INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53 samples/sec
 accuracy=1.000000
INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63 samples/sec
 accuracy=0.999375
INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98 samples/sec
 accuracy=0.999531
INFO:root:Epoch[19] Train-accuracy=0.999717
INFO:root:Epoch[19] Time cost=1.219
INFO:root:Epoch[19] Validation-accuracy=0.983977
1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
1146052maxresident)k
0inputs+0outputs (0major+3496364minor)pagefaults 0swaps

Master, MKL ON:

(py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd
../../build/libmxnet.so | grep -i omp
        libomp.so =>
/home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
(0x00007f05ba38f000)
        libiomp5.so =>
/home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
(0x00007f05b09f4000)

INFO:root:Epoch[18] Validation-accuracy=0.982484
INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63 samples/sec
 accuracy=0.999691
INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96 samples/sec
 accuracy=0.999531
INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47 samples/sec
 accuracy=0.999375
INFO:root:Epoch[19] Train-accuracy=0.999717
INFO:root:Epoch[19] Time cost=1.367
INFO:root:Epoch[19] Validation-accuracy=0.982783
854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
1154348maxresident)k
0inputs+0outputs (0major+3624361minor)pagefaults 0swaps


MKL OFF:
(py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
cmake_options.yml
USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
USE_MKL_IF_AVAILABLE AND (NOT APPLE)
(py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
build/libmxnet.so |grep -i omp
        libomp.so =>
/home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
(0x00007fb720c54000)

INFO:root:Epoch[18] Validation-accuracy=0.983479
INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02 samples/sec
 accuracy=1.000000
INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46 samples/sec
 accuracy=0.999375
INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Train-accuracy=0.999767
INFO:root:Epoch[19] Time cost=1.259
INFO:root:Epoch[19] Validation-accuracy=0.983181
755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
1147008maxresident)k
0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps

Let me know what you think.

Link to the original PR: https://github.com/apache/incubator-mxnet/pull/12160

Thanks.

On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
<kellen.sunderl...@gmail.com> wrote:
>
> "if you’re linking in two then you’re doing something wrong." Correct,
> that's one thing I believe we've got consensus on.  So let's call that out
> as a bug to be fixed.
>
> Let's move forward with some reproducible numbers and then discuss the pros
> / cons of which particular OMP implementation we should use.
>
> On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <pedro.larroy.li...@gmail.com>
> wrote:
>
> > Hi Chris
> >
> > I would ask you to have a bit of patience and help us with your
> > experience in this matter. Nobody is ignoring anything, I think we are
> > individually gathering feedbacks and trying to understand the multiple
> > contributions done to this topic including yours, then go step by
> > step, understand what is going on and run experiments and report back
> > to the list or the corresponding github item. It was suggested by
> > Kellen to prepare some containers, this takes effort.
> >
> > Regarding your final comment, most of us also have many other things
> > to do and responsibilities even if our daytime jobs might involve
> > MXNet in some form or another. I think that's part of the privilege
> > and responsibility of working close with an open source project and
> > the magic of collaboration across organizations. Let's all be patient
> > and take some time to understand and reason about this topic which is
> > not simple. Since we decided to step back and gather more data let's
> > take time and do it properly.
> >
> > Personally I hope to find time to look again into this issue before
> > the end of the week.
> >
> > Thanks.
> >
> > Pedro.
> >
> > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <cjolivie...@apache.org>
> > wrote:
> > >
> > > if you’re linking in two then you’re doing something wrong. You can see
> > by
> > > my email yesterday that only one is linked in. This is also the case with
> > > the mkl version built by the Makefile — only the Intel OMP library is
> > used
> > > (no libgomp).
> > >
> > > That being said, Do you have clear evidence that using Intel OMP is both
> > > problematic and the situation isn’t fixable?  The burden of proof is on
> > the
> > > ones requesting the change — it is not my responsibility to justify the
> > > current state.  There must be something “terrible” and unfixable to
> > justify
> > > a change.  I have seen no proof of this in all this time.
> > >
> > > On a side note, I mentioned a couple of things in my email yesterday that
> > > still are not being responded to (they were also ignored in the last
> > > incarnation of this “discussion” — I have much experience in this matter
> > to
> > > assume “discussion” is a waste of my time, seeing and I am not paid to
> > > “work on” mxnet like y’all are).
> > >
> > > -C
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > I've also quite often seen two versions of OpenMP linked.  I think we
> > can
> > > > all agree we probably want to avoid linking in two libraries that do
> > > > effectively the same thing.
> > > >
> > > > The performance questions should be fairly straight forward to
> > demonstrate
> > > > right?  Could we just collaborate on a few minimal Dockerfiles that
> > show
> > > > (or don't show) Intel OpenMP performance speedups with the workloads
> > Chris
> > > > is referencing?
> > > >
> > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > stanislav.tsuk...@gmail.com> wrote:
> > > >
> > > > > Hi, Chris!
> > > > >
> > > > > Stas here - I've gathered that performance data.
> > > > > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > > > > missing.
> > > > > Be assured, intentional misdirection was never a case.
> > > > >
> > > > > Thanks a lot for being constructive.
> > > > >
> > > > > > Turning Intel OMP on and off (and MKL as well, since it tends to
> > pull
> > > > in
> > > > > omp, depending which one is linked in).
> > > > >
> > > > > We never ever considered turning MKL off. We are on the same page
> > here -
> > > > > MKL is crucial for the performance.
> > > > > Why should we? There's a GOMP-linked version of MKL, that we can use.
> > > > >
> > > > > What we did - we measured, if using compilers default OpenMP
> > > > > implementation instead of referenced source code distribution of
> > OpenMP
> > > > > makes anything slower.
> > > > > We have found the impact to be hardly measurable.
> > > > > The difference between GOMP and iOMP is <5% on our benchmarks, most
> > of
> > > > the
> > > > > time less than that.
> > > > >
> > > > > We just suggest to simplify the build of mxnet, by removing the
> > > > > unnecessary dependency.
> > > > >
> > > > > During that we discovered for example the following amazing issue:
> > > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Stas
> > > > >
> > > > > On 18.06.19, 18:24, "Chris Olivier" <cjolivie...@gmail.com> wrote:
> > > > >
> > > > >     I am very reluctant to feed the trolls again, and this will be
> > teh
> > > > last
> > > > >     time I address Pedro or Anton on the subject, but since I think
> > the
> > > > > numbers
> > > > >     being presented are incorrect (either by te builders not really
> > > > >     understanding what they are building, or possibly intentional
> > > > > misdirection):
> > > > >
> > > > >     Turning Intel OMP on and off (and MKL as well, since it tends to
> > pull
> > > > > in
> > > > >     omp, depending which one is linked in).
> > > > >     There is a HUGE difference.  This is consistent with my
> > experience
> > > > > before
> > > > >     when it was added.
> > > > >
> > > > >
> > > > >     default mnist:
> > > > >
> > > > >     python ../example/image-classification/train_mnist.py
> > > > >     INFO:root:start with arguments Namespace(add_stn=False,
> > > > batch_size=64,
> > > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> > gc_type='none',
> > > > >     gpus=None, image_shape='1, 28, 28', initializer='default',
> > > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> > lr_factor=0.1,
> > > > >     lr_step_epochs='10', macrobatch_size=0, model_prefix=None,
> > mom=0.9,
> > > > >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > > >     profile_server_suffix='', profile_worker_suffix='',
> > save_period=1,
> > > > >     test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > > > > wd=0.0001)
> > > > >
> > > > >     INTEL OMP:
> > > > >
> > > > >     ldd libmxnet.so | grep omp
> > > > >             libomp.so =>
> > > > >
> >  /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > >     (0x00007f978fde7000)
> > > > >
> > > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
> > > > >     accuracy=0.780012
> > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21
> > samples/sec
> > > > >     accuracy=0.920469
> > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91
> > samples/sec
> > > > >     accuracy=0.928281
> > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36
> > samples/sec
> > > > >     accuracy=0.942813
> > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79
> > samples/sec
> > > > >     accuracy=0.938750
> > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52
> > samples/sec
> > > > >     accuracy=0.946562
> > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41
> > samples/sec
> > > > >     accuracy=0.953281
> > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73
> > samples/sec
> > > > >     accuracy=0.951562
> > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88
> > samples/sec
> > > > >     accuracy=0.957500
> > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21
> > samples/sec
> > > > >     accuracy=0.968131
> > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03
> > samples/sec
> > > > >     accuracy=0.966250
> > > > >
> > > > >
> > > > >     LIBGOMP:
> > > > >
> > > > >     ldd libmxnet.so | grep omp
> > > > >             libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > >     (0x00007f25c25dd000)
> > > > >
> > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01
> > samples/sec
> > > > >      accuracy=0.782488
> > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32
> > samples/sec
> > > > >      accuracy=0.907813
> > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00
> > samples/sec
> > > > >      accuracy=0.927188
> > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45
> > samples/sec
> > > > >      accuracy=0.937969
> > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95
> > samples/sec
> > > > >      accuracy=0.942187
> > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58
> > samples/sec
> > > > >      accuracy=0.950156
> > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16
> > samples/sec
> > > > >      accuracy=0.947969
> > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07
> > samples/sec
> > > > >      accuracy=0.953750
> > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58
> > samples/sec
> > > > >     accuracy=0.953125
> > > > >
> > > > >     That being said, there's other issued beyond speed.  The DEFAULT
> > > > build
> > > > > from
> > > > >     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> > > > > mysteriously
> > > > >     it has no issues?  This seems highly suspicious.  All I see is a
> > lot
> > > > of
> > > > >     hand-waving and conjecture and pointing to StackOverflow posts
> > made
> > > > by
> > > > >     people who may be of questionable pedigree to begin with.  This
> > > > smells
> > > > > of a
> > > > >     Pedro-ego-fight rather than one of purely technical merit.
> > Also, if
> > > > > one
> > > > >     knows how OMP works,  they would be very suspicious of the
> > > > > "intermittent
> > > > >     hangs" claim -- that's probably just broken race conditions
> > elsewhere
> > > > > until
> > > > >     proven differently.  It'd tend freeze on the first use if
> > something
> > > > is
> > > > >     wrong (try using libgomp after a fork and see), since worker
> > threads"
> > > > >     wouldn't be assigned/joined properly.  IntelOMP is faster, but
> > also
> > > > has
> > > > >     other advantages, such as allowing OMP after a fork.
> > > > >
> > > > >     I actually addressed a lot of issues and ask for clarification
> > in the
> > > > >     original PR's way back when, but they're all just ignored.
> > > > >
> > > > >     -Chris
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> >

Re: OMP

Reply via email to