Hi Chris,

It's not clear why you think the numbers are wrong. It seems Stas has taken
a lot of effort to perform the benchmarks and comprehensively write down
the methodology and results. Of course, no one is above making mistakes.
Therefore, it would be great if you could shine some light on what you find
objectionable and maybe add some suggestions for experiments or
improvements. Perhaps you could try to rerun the benchmarks yourself and
reach out if there are any steps that are missing or unclear.

I work with Stas and he's a very talented engineer and his integrity is
above reproach. So, you don't need to fear any "political" motivations
behind his effort. I feel this level of antagonism doesn't help the
community at all. Perhaps we could keep the conversation around the
methodology and the results so we can bring this story to a conclusion.



On Tue., 18 Jun. 2019, 6:24 pm Chris Olivier, <cjolivie...@gmail.com> wrote:

> I am very reluctant to feed the trolls again, and this will be teh last
> time I address Pedro or Anton on the subject, but since I think the numbers
> being presented are incorrect (either by te builders not really
> understanding what they are building, or possibly intentional
> misdirection):
> Turning Intel OMP on and off (and MKL as well, since it tends to pull in
> omp, depending which one is linked in).
> There is a HUGE difference.  This is consistent with my experience before
> when it was added.
> default mnist:
> python ../example/image-classification/train_mnist.py
> INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
> disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> gpus=None, image_shape='1, 28, 28', initializer='default',
> kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> monitor=0, network='mlp', num_classes=10, num_epochs=20,
> num_examples=60000, num_layers=None, optimizer='sgd',
> profile_server_suffix='', profile_worker_suffix='', save_period=1,
> test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
> ldd libmxnet.so | grep omp
>         libomp.so =>
> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> (0x00007f978fde7000)
> :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
> accuracy=0.780012
> INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
> accuracy=0.920469
> INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
> accuracy=0.928281
> INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
> accuracy=0.942813
> INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
> accuracy=0.938750
> INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
> accuracy=0.946562
> INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
> accuracy=0.953281
> INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
> accuracy=0.951562
> INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
> accuracy=0.957500
> INFO:root:Epoch[0] Train-accuracy=0.925423
> INFO:root:Epoch[0] Time cost=3.806
> INFO:root:Epoch[0] Validation-accuracy=0.962580
> INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
> accuracy=0.968131
> INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
> accuracy=0.966250
> ldd libmxnet.so | grep omp
>         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x00007f25c25dd000)
> INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
>  accuracy=0.782488
> INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
>  accuracy=0.907813
> INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
>  accuracy=0.927188
> INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
>  accuracy=0.937969
> INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
>  accuracy=0.942187
> INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
>  accuracy=0.950156
> INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
>  accuracy=0.947969
> INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
>  accuracy=0.953750
> INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
> accuracy=0.953125
> That being said, there's other issued beyond speed.  The DEFAULT build from
> makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
> it has no issues?  This seems highly suspicious.  All I see is a lot of
> hand-waving and conjecture and pointing to StackOverflow posts made by
> people who may be of questionable pedigree to begin with.  This smells of a
> Pedro-ego-fight rather than one of purely technical merit.  Also, if one
> knows how OMP works,  they would be very suspicious of the "intermittent
> hangs" claim -- that's probably just broken race conditions elsewhere until
> proven differently.  It'd tend freeze on the first use if something is
> wrong (try using libgomp after a fork and see), since worker threads"
> wouldn't be assigned/joined properly.  IntelOMP is faster, but also has
> other advantages, such as allowing OMP after a fork.
> I actually addressed a lot of issues and ask for clarification in the
> original PR's way back when, but they're all just ignored.
> -Chris

Reply via email to