Re: OMP

Marco de Abreu Thu, 20 Jun 2019 17:29:53 -0700

As already proposed, I think the easiest way to get a common understanding
is if we start with a few docker containers. Pedro, would it be possible
for you to wrap your benchmarks into a few containers that will produce
your shown results? That way, we can avoid possible misunderstandings and
also pinpoint the exact parts where people disagree or misunderstood each
other.


-Marco

Pedro Larroy <pedro.larroy.li...@gmail.com> schrieb am Do., 20. Juni 2019,
21:47:

> I can confirm that we are linking with two versions of omp, I'm
> gaining more clarity into this topic, but I have still questions, the
> facts that I got so far are the folllowing:
>
> * #1: We are linking with two versions of omp, intel's omp and llvm
> openmp when building with MKL enabled.
> * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> one is used on the PR proposed by Anton).
>
> Questions:
>
>  * #1 Is it ok to have two versions of openmp linked at the same time?
>  * #2 Which implementation of OMP gives the best performance?  (See
> total training time of my measurement for a partial answer)
>  * #3 Should we have a build flag so we can choose the OMP version at
> runtime?
>  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
>  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> so could you provide a link?  I think we would need to reproduce your
> benchmarks and verify which versions are being linked. It's possible
> that while compiling with MKL intel's omp was pulled in instead of
> GNU OpenMP.
>  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> update the subrepo regularly?
>
> My conclusion so far:
>
>  * #1 We should avoid linking two versions of omp if possible and
> allow users to choose one in the build as we do for BLAS.
>  * #2 For performance reasons and more control vs different compiler
> versions seems it makes indeed sense to keep the LLVM OpenMP version
> in 3rdparty for now. So unless some more data is gathered, it makes
> sense not to remove it as of now.
>  * #3 We should provide build options to choose which openmp library
> is to be used from the three options available, including libgomp.
>  * #4 Refining the build we could also enable OpenMP in mac without
> additional contortions (doesn't work as of today):
> https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
>  * #5 We should add different omp versions to our benchmarks and track
> the performance, so this data is available for prescribing the best
> build options and for binary releases.
>
> This is also an interesting related gh issue posted in the mkl-dnn
> repository:  https://github.com/intel/mkl-dnn/issues/230
>
>
> I don't observe the order of magnitude divergence reported by Chris in
> vanilla Ubuntu 18.04 in samples / s but the full training finishes
> indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
>
> There's also differences in training time when using MKL and the ,
> it's actually a bit slower, I don't know if it's related to OMP.
>
> gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
>
> Anton's branch:  g...@github.com:lebeg/incubator-mxnet.git   branch 'omp'
> (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> build/libmxnet.so |grep -i omp
>         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x00007fd99a51d000)
>
> time python train_mnist.py
>
> INFO:root:Epoch[18] Validation-accuracy=0.984176
> INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00 samples/sec
>  accuracy=1.000000
> INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69 samples/sec
>  accuracy=0.999531
> INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53 samples/sec
>  accuracy=1.000000
> INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63 samples/sec
>  accuracy=0.999375
> INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98 samples/sec
>  accuracy=0.999531
> INFO:root:Epoch[19] Train-accuracy=0.999717
> INFO:root:Epoch[19] Time cost=1.219
> INFO:root:Epoch[19] Validation-accuracy=0.983977
> 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
> 1146052maxresident)k
> 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
>
> Master, MKL ON:
>
> (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd
> ../../build/libmxnet.so | grep -i omp
>         libomp.so =>
> /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> (0x00007f05ba38f000)
>         libiomp5.so =>
>
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> (0x00007f05b09f4000)
>
> INFO:root:Epoch[18] Validation-accuracy=0.982484
> INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63 samples/sec
>  accuracy=0.999691
> INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96 samples/sec
>  accuracy=0.999531
> INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47 samples/sec
>  accuracy=0.999375
> INFO:root:Epoch[19] Train-accuracy=0.999717
> INFO:root:Epoch[19] Time cost=1.367
> INFO:root:Epoch[19] Validation-accuracy=0.982783
> 854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
> 1154348maxresident)k
> 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
>
>
> MKL OFF:
> (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
> cmake_options.yml
> USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> build/libmxnet.so |grep -i omp
>         libomp.so =>
> /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> (0x00007fb720c54000)
>
> INFO:root:Epoch[18] Validation-accuracy=0.983479
> INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02 samples/sec
>  accuracy=1.000000
> INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46 samples/sec
>  accuracy=0.999375
> INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Train-accuracy=0.999767
> INFO:root:Epoch[19] Time cost=1.259
> INFO:root:Epoch[19] Validation-accuracy=0.983181
> 755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
> 1147008maxresident)k
> 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
>
> Let me know what you think.
>
> Link to the original PR:
> https://github.com/apache/incubator-mxnet/pull/12160
>
> Thanks.
>
> On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> <kellen.sunderl...@gmail.com> wrote:
> >
> > "if you’re linking in two then you’re doing something wrong." Correct,
> > that's one thing I believe we've got consensus on.  So let's call that
> out
> > as a bug to be fixed.
> >
> > Let's move forward with some reproducible numbers and then discuss the
> pros
> > / cons of which particular OMP implementation we should use.
> >
> > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > Hi Chris
> > >
> > > I would ask you to have a bit of patience and help us with your
> > > experience in this matter. Nobody is ignoring anything, I think we are
> > > individually gathering feedbacks and trying to understand the multiple
> > > contributions done to this topic including yours, then go step by
> > > step, understand what is going on and run experiments and report back
> > > to the list or the corresponding github item. It was suggested by
> > > Kellen to prepare some containers, this takes effort.
> > >
> > > Regarding your final comment, most of us also have many other things
> > > to do and responsibilities even if our daytime jobs might involve
> > > MXNet in some form or another. I think that's part of the privilege
> > > and responsibility of working close with an open source project and
> > > the magic of collaboration across organizations. Let's all be patient
> > > and take some time to understand and reason about this topic which is
> > > not simple. Since we decided to step back and gather more data let's
> > > take time and do it properly.
> > >
> > > Personally I hope to find time to look again into this issue before
> > > the end of the week.
> > >
> > > Thanks.
> > >
> > > Pedro.
> > >
> > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <cjolivie...@apache.org>
> > > wrote:
> > > >
> > > > if you’re linking in two then you’re doing something wrong. You can
> see
> > > by
> > > > my email yesterday that only one is linked in. This is also the case
> with
> > > > the mkl version built by the Makefile — only the Intel OMP library is
> > > used
> > > > (no libgomp).
> > > >
> > > > That being said, Do you have clear evidence that using Intel OMP is
> both
> > > > problematic and the situation isn’t fixable?  The burden of proof is
> on
> > > the
> > > > ones requesting the change — it is not my responsibility to justify
> the
> > > > current state.  There must be something “terrible” and unfixable to
> > > justify
> > > > a change.  I have seen no proof of this in all this time.
> > > >
> > > > On a side note, I mentioned a couple of things in my email yesterday
> that
> > > > still are not being responded to (they were also ignored in the last
> > > > incarnation of this “discussion” — I have much experience in this
> matter
> > > to
> > > > assume “discussion” is a waste of my time, seeing and I am not paid
> to
> > > > “work on” mxnet like y’all are).
> > > >
> > > > -C
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > I've also quite often seen two versions of OpenMP linked.  I think
> we
> > > can
> > > > > all agree we probably want to avoid linking in two libraries that
> do
> > > > > effectively the same thing.
> > > > >
> > > > > The performance questions should be fairly straight forward to
> > > demonstrate
> > > > > right?  Could we just collaborate on a few minimal Dockerfiles that
> > > show
> > > > > (or don't show) Intel OpenMP performance speedups with the
> workloads
> > > Chris
> > > > > is referencing?
> > > > >
> > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > > stanislav.tsuk...@gmail.com> wrote:
> > > > >
> > > > > > Hi, Chris!
> > > > > >
> > > > > > Stas here - I've gathered that performance data.
> > > > > > Sure thing, I can be wrong, but please elaborate a bit on what
> we are
> > > > > > missing.
> > > > > > Be assured, intentional misdirection was never a case.
> > > > > >
> > > > > > Thanks a lot for being constructive.
> > > > > >
> > > > > > > Turning Intel OMP on and off (and MKL as well, since it tends
> to
> > > pull
> > > > > in
> > > > > > omp, depending which one is linked in).
> > > > > >
> > > > > > We never ever considered turning MKL off. We are on the same page
> > > here -
> > > > > > MKL is crucial for the performance.
> > > > > > Why should we? There's a GOMP-linked version of MKL, that we can
> use.
> > > > > >
> > > > > > What we did - we measured, if using compilers default OpenMP
> > > > > > implementation instead of referenced source code distribution of
> > > OpenMP
> > > > > > makes anything slower.
> > > > > > We have found the impact to be hardly measurable.
> > > > > > The difference between GOMP and iOMP is <5% on our benchmarks,
> most
> > > of
> > > > > the
> > > > > > time less than that.
> > > > > >
> > > > > > We just suggest to simplify the build of mxnet, by removing the
> > > > > > unnecessary dependency.
> > > > > >
> > > > > > During that we discovered for example the following amazing
> issue:
> > > > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > > > >
> > > > > > Best Regards
> > > > > >
> > > > > > Stas
> > > > > >
> > > > > > On 18.06.19, 18:24, "Chris Olivier" <cjolivie...@gmail.com>
> wrote:
> > > > > >
> > > > > >     I am very reluctant to feed the trolls again, and this will
> be
> > > teh
> > > > > last
> > > > > >     time I address Pedro or Anton on the subject, but since I
> think
> > > the
> > > > > > numbers
> > > > > >     being presented are incorrect (either by te builders not
> really
> > > > > >     understanding what they are building, or possibly intentional
> > > > > > misdirection):
> > > > > >
> > > > > >     Turning Intel OMP on and off (and MKL as well, since it
> tends to
> > > pull
> > > > > > in
> > > > > >     omp, depending which one is linked in).
> > > > > >     There is a HUGE difference.  This is consistent with my
> > > experience
> > > > > > before
> > > > > >     when it was added.
> > > > > >
> > > > > >
> > > > > >     default mnist:
> > > > > >
> > > > > >     python ../example/image-classification/train_mnist.py
> > > > > >     INFO:root:start with arguments Namespace(add_stn=False,
> > > > > batch_size=64,
> > > > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> > > gc_type='none',
> > > > > >     gpus=None, image_shape='1, 28, 28', initializer='default',
> > > > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> > > lr_factor=0.1,
> > > > > >     lr_step_epochs='10', macrobatch_size=0, model_prefix=None,
> > > mom=0.9,
> > > > > >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > > > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > > > >     profile_server_suffix='', profile_worker_suffix='',
> > > save_period=1,
> > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> warmup_strategy='linear',
> > > > > > wd=0.0001)
> > > > > >
> > > > > >     INTEL OMP:
> > > > > >
> > > > > >     ldd libmxnet.so | grep omp
> > > > > >             libomp.so =>
> > > > > >
> > >  /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > >     (0x00007f978fde7000)
> > > > > >
> > > > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09
> samples/sec
> > > > > >     accuracy=0.780012
> > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21
> > > samples/sec
> > > > > >     accuracy=0.920469
> > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91
> > > samples/sec
> > > > > >     accuracy=0.928281
> > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36
> > > samples/sec
> > > > > >     accuracy=0.942813
> > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79
> > > samples/sec
> > > > > >     accuracy=0.938750
> > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52
> > > samples/sec
> > > > > >     accuracy=0.946562
> > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41
> > > samples/sec
> > > > > >     accuracy=0.953281
> > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73
> > > samples/sec
> > > > > >     accuracy=0.951562
> > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88
> > > samples/sec
> > > > > >     accuracy=0.957500
> > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21
> > > samples/sec
> > > > > >     accuracy=0.968131
> > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03
> > > samples/sec
> > > > > >     accuracy=0.966250
> > > > > >
> > > > > >
> > > > > >     LIBGOMP:
> > > > > >
> > > > > >     ldd libmxnet.so | grep omp
> > > > > >             libgomp.so.1 =>
> /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > >     (0x00007f25c25dd000)
> > > > > >
> > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01
> > > samples/sec
> > > > > >      accuracy=0.782488
> > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32
> > > samples/sec
> > > > > >      accuracy=0.907813
> > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00
> > > samples/sec
> > > > > >      accuracy=0.927188
> > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45
> > > samples/sec
> > > > > >      accuracy=0.937969
> > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95
> > > samples/sec
> > > > > >      accuracy=0.942187
> > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58
> > > samples/sec
> > > > > >      accuracy=0.950156
> > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16
> > > samples/sec
> > > > > >      accuracy=0.947969
> > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07
> > > samples/sec
> > > > > >      accuracy=0.953750
> > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58
> > > samples/sec
> > > > > >     accuracy=0.953125
> > > > > >
> > > > > >     That being said, there's other issued beyond speed.  The
> DEFAULT
> > > > > build
> > > > > > from
> > > > > >     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> > > > > > mysteriously
> > > > > >     it has no issues?  This seems highly suspicious.  All I see
> is a
> > > lot
> > > > > of
> > > > > >     hand-waving and conjecture and pointing to StackOverflow
> posts
> > > made
> > > > > by
> > > > > >     people who may be of questionable pedigree to begin with.
> This
> > > > > smells
> > > > > > of a
> > > > > >     Pedro-ego-fight rather than one of purely technical merit.
> > > Also, if
> > > > > > one
> > > > > >     knows how OMP works,  they would be very suspicious of the
> > > > > > "intermittent
> > > > > >     hangs" claim -- that's probably just broken race conditions
> > > elsewhere
> > > > > > until
> > > > > >     proven differently.  It'd tend freeze on the first use if
> > > something
> > > > > is
> > > > > >     wrong (try using libgomp after a fork and see), since worker
> > > threads"
> > > > > >     wouldn't be assigned/joined properly.  IntelOMP is faster,
> but
> > > also
> > > > > has
> > > > > >     other advantages, such as allowing OMP after a fork.
> > > > > >
> > > > > >     I actually addressed a lot of issues and ask for
> clarification
> > > in the
> > > > > >     original PR's way back when, but they're all just ignored.
> > > > > >
> > > > > >     -Chris
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > >
>

Re: OMP

Reply via email to