[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers

2020-03-03 Thread GitBox
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm 
matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-593952925
 
 
   The quantization operator is now parallelized with OpenMP and supports an 
arbitrary number of arguments. It is substantially faster than the current 
MXNet implementation on both 1 and 24 cores (see below for benchmarks) 
@pengzhao-intel .  
   
   Maybe this is too big of a pull request.  Would you be happy with a smaller 
pull request that takes the faster quantization code and replaces the 
implementation of the existing quantize and quantize_v2 operators, so it also 
appears in the quantization flow?  
   
   Then we can carry on with matrix multiply next.  
   
   @ciyongch I'm calling operators manually because we're using gluon and the 
quantization workflow doesn't work for us anyway.  But if you're game to have 
operators optimized, they'll automatically be in the workflow too.  
   
   ```
   OMP_NUM_THREADS=24 ./quant_bench.py
   Shape (1, 1)
   0.0001304 seconds for quantize
   0.0001076 seconds for quantize_v2
   0.310 seconds for intgemm
   0.0001114 seconds for quantize_v2_fit
   0.479 seconds for intgemm_fit
   intgemm is 3.5x faster with calibration
   intgemm is 2.3x faster without calibration
   Shape (128, 128)
   0.0001649 seconds for quantize
   0.0001399 seconds for quantize_v2
   0.329 seconds for intgemm
   0.0001533 seconds for quantize_v2_fit
   0.502 seconds for intgemm_fit
   intgemm is 4.2x faster with calibration
   intgemm is 3.1x faster without calibration
   Shape (256, 256)
   0.0001660 seconds for quantize
   0.0001404 seconds for quantize_v2
   0.335 seconds for intgemm
   0.0001599 seconds for quantize_v2_fit
   0.505 seconds for intgemm_fit
   intgemm is 4.2x faster with calibration
   intgemm is 3.2x faster without calibration
   Shape (512, 512)
   0.0001691 seconds for quantize
   0.0001434 seconds for quantize_v2
   0.342 seconds for intgemm
   0.0001813 seconds for quantize_v2_fit
   0.540 seconds for intgemm_fit
   intgemm is 4.2x faster with calibration
   intgemm is 3.4x faster without calibration
   Shape (1024, 1024)
   0.0001920 seconds for quantize
   0.0001538 seconds for quantize_v2
   0.511 seconds for intgemm
   0.0002390 seconds for quantize_v2_fit
   0.827 seconds for intgemm_fit
   intgemm is 3.0x faster with calibration
   intgemm is 2.9x faster without calibration
   Shape (2048, 2048)
   0.0002364 seconds for quantize
   0.0001989 seconds for quantize_v2
   0.875 seconds for intgemm
   0.0004747 seconds for quantize_v2_fit
   0.0001531 seconds for intgemm_fit
   intgemm is 2.3x faster with calibration
   intgemm is 3.1x faster without calibration
   Shape (20971520,)
   0.0011446 seconds for quantize
   0.0010902 seconds for quantize_v2
   0.0008950 seconds for intgemm
   0.0023337 seconds for quantize_v2_fit
   0.0015005 seconds for intgemm_fit
   intgemm is 1.2x faster with calibration
   intgemm is 1.6x faster without calibration
   Shape (8, 4096)
   0.0001636 seconds for quantize
   0.0001392 seconds for quantize_v2
   0.364 seconds for intgemm
   0.0001508 seconds for quantize_v2_fit
   0.651 seconds for intgemm_fit
   intgemm is 3.8x faster with calibration
   intgemm is 2.3x faster without calibration
   Shape (4096, 8)
   0.0001642 seconds for quantize
   0.0001392 seconds for quantize_v2
   0.370 seconds for intgemm
   0.0001515 seconds for quantize_v2_fit
   0.654 seconds for intgemm_fit
   intgemm is 3.8x faster with calibration
   intgemm is 2.3x faster without calibration
   ```
   ```
   OMP_NUM_THREADS=1 ./quant_bench.py
   Shape (1, 1)
   0.630 seconds for quantize
   0.706 seconds for quantize_v2
   0.294 seconds for intgemm
   0.632 seconds for quantize_v2_fit
   0.475 seconds for intgemm_fit
   intgemm is 2.1x faster with calibration
   intgemm is 1.3x faster without calibration
   Shape (128, 128)
   0.860 seconds for quantize
   0.898 seconds for quantize_v2
   0.324 seconds for intgemm
   0.996 seconds for quantize_v2_fit
   0.464 seconds for intgemm_fit
   intgemm is 2.6x faster with calibration
   intgemm is 2.1x faster without calibration
   Shape (256, 256)
   0.976 seconds for quantize
   0.0001028 seconds for quantize_v2
   0.339 seconds for intgemm
   0.0001513 seconds for quantize_v2_fit
   0.521 seconds for intgemm_fit
   intgemm is 2.9x faster with calibration
   intgemm is 2.9x faster without calibration
   Shape (512, 512)
   0.0001724 seconds for quantize
   0.0001693 seconds for quantize_v2
   0.839 seconds for intgemm
   0.0004351 seconds for quantize_v2_fit
   0.0001420 seconds for intgemm_fit
   intgemm is 2.0x faster with calibration
   intgemm is 3.1x faster without calibration
   Shape (1024, 1024)
   0.0003559 seconds for quantize
   0.0003481 seconds for quantize_v2
   0.0002384 seconds for intgemm
   

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers

2020-02-24 Thread GitBox
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm 
matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-590363395
 
 
   While I agree with the principle that all operators should be parallel (and 
intend to parallelize mine), it's important to find the optimal place for 
parallelism.  
   
   This benchmark shows the right place is over sentences, not inside 
operators, at least with Sockeye on current MXNet.  In Marian, we've turned off 
all OMP threading and just use sentence-level parallelism.  
   
   Setup: float32 models using MKL and 1 sentence at a time (admittedly small, 
but common use case in inference).  Nothing special from this pull request.  
   
   One OMP thread, parallelize across sentences with separate processes.  
   ```bash
   export OMP_NUM_THREADS=1
   time parallel --block 10k --line-buffer --pipe -k python3 -m 
sockeye.translate --use-cpu -m model --restrict-lexicon model/lexicon 
--beam-size 5 https://raw.githubusercontent.com/awslabs/sockeye/master/docs/tutorials/cpu_process_per_core_translation.py


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers

2020-02-24 Thread GitBox
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm 
matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-590265641
 
 
   @ciyongch Here is a usage example.  
   ```python
   import mxnet as mx
   #This is done offline.
   weight = mx.nd.random_uniform(shape=(8,64), low=-1.0, high=1.0)
   weight_max = mx.nd.contrib.intgemm_maxabsolute(weight)
   weight_prepared = mx.nd.contrib.intgemm_prepare_weight(weight, weight_max)
   
   data = mx.nd.random_uniform(shape=(1,64), low=-1.0, high=1.0)
   #Fused multiply quantizes on the fly.
   product1 = mx.nd.contrib.intgemm_fully_connected(data, weight_prepared, 
scaling = weight_max / 127.0, num_hidden = 8, no_bias = True)
   
   #One can also have quantized data.
   data_max = mx.nd.contrib.intgemm_maxabsolute(data)
   data_prepared = mx.nd.contrib.intgemm_prepare_data(data, data_max)
   product2 = mx.nd.contrib.intgemm_fully_connected(data_prepared, 
weight_prepared, scaling = weight_max / 127.0 * data_max / 127.0, num_hidden = 
8, no_bias = True)
   
   baseline = mx.nd.FullyConnected(data, weight, num_hidden = 8, no_bias = True)
   ```
   The `prepare_data` step is just a quantizer.  The `prepare_weight` step does 
some element rearragement into a CPU-dependent format so the API isn't quite 
the same as the current one.  
   
   And bias works in the usual way.  Internally intgemm takes a template 
argument for what to do with the int32 output while it's still in a register.  
So one can write out floats or apply activation functions in registers first.  
   
   The multiple of 8 issue is relatively easy to fix; that was just us being 
lazy about smaller tiles.  Having inner dimension a multiple of 64 is rather 
useful for alignment purposes; how does MXNet deal with padding?  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers

2020-02-24 Thread GitBox
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm 
matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-590256391
 
 
   @pengzhao-intel Which OMP do you recommend?  I've been getting bad OMP 
results with a stock install of Ubuntu 18, but am happy to sprinkle a `#pragma 
omp parallel for` on this embarrassingly parallel task and add a loop at the 
end to deal with non-multiples of register size.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers

2020-02-17 Thread GitBox
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm 
matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-587073135
 
 
   Also, OMP performance is very bad. (NB: intgemm is running single-threaded 
here, partly because OMP is bad at this problem)
   
   ```bash
   export MXNET_ENGINE_TYPE=NaiveEngine; export OMP_NUM_THREADS=2; taskset 
--cpu-list 0,1 ./quant_bench.py
   ```
   ```
   [16:18:40] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
   Shape (128, 128)
   0.0008789 seconds for quantize
   0.0008693 seconds for quantize_v2
   0.175 seconds for intgemm
   intgemm is 49.7x faster
   Shape (256, 256)
   0.0034812 seconds for quantize
   0.0034044 seconds for quantize_v2
   0.212 seconds for intgemm
   intgemm is 161.0x faster
   Shape (512, 512)
   0.0138909 seconds for quantize
   0.0138283 seconds for quantize_v2
   0.731 seconds for intgemm
   intgemm is 189.3x faster
   Shape (1024, 1024)
   0.0557616 seconds for quantize
   0.0553598 seconds for quantize_v2
   0.0002330 seconds for intgemm
   intgemm is 237.6x faster
   Shape (2048, 2048)
   0.2225617 seconds for quantize
   0.2196410 seconds for quantize_v2
   0.0008387 seconds for intgemm
   intgemm is 261.9x faster
   Shape (8, 4096)
   0.0017372 seconds for quantize
   0.0017434 seconds for quantize_v2
   0.183 seconds for intgemm
   intgemm is 94.8x faster
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers

2020-02-17 Thread GitBox
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm 
matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-587066870
 
 
   The current MXNet quantizer is 3-10x slower than intgemm's on a wide variety 
of matrix sizes.  
   
   Experiment on c5.12xlarge:
   ```
   Shape (128, 128)
   0.731 seconds for quantize
   0.706 seconds for quantize_v2
   0.219 seconds for intgemm
   intgemm is 3.2x faster
   Shape (256, 256)
   0.0002116 seconds for quantize
   0.0001778 seconds for quantize_v2
   0.258 seconds for intgemm
   intgemm is 6.9x faster
   Shape (512, 512)
   0.0008112 seconds for quantize
   0.0006480 seconds for quantize_v2
   0.917 seconds for intgemm
   intgemm is 7.1x faster
   Shape (1024, 1024)
   0.0030176 seconds for quantize
   0.0023387 seconds for quantize_v2
   0.0002542 seconds for intgemm
   intgemm is 9.2x faster
   Shape (2048, 2048)
   0.0118271 seconds for quantize
   0.0090704 seconds for quantize_v2
   0.0008705 seconds for intgemm
   intgemm is 10.4x faster
   Shape (8, 4096)
   0.0001187 seconds for quantize
   0.0001061 seconds for quantize_v2
   0.226 seconds for intgemm
   intgemm is 4.7x faster
   ```
   
   Generated by `export MXNET_ENGINE_TYPE=NaiveEngine; export 
OMP_NUM_THREADS=1; taskset --cpu-list 0 ./quant_bench.py`
   where `quant_bench.py` is:
   ```
   #!/usr/bin/env python3
   import mxnet as mx
   import time
   
   def time_procedure(shape, count, proc):
 data = mx.nd.random_uniform(shape=s, low=-1.0, high = 1.0)
 mx.nd.waitall()
 begin = time.time()
 for i in range(0, count):
   proc(data)
   mx.nd.waitall()
 return (time.time() - begin) / count
   
   shapes = [(128, 128), (256,256), (512, 512), (1024, 1024), (2048, 2048)]
   count = 1000
   one = mx.nd.ones(shape=(1))
   minusone = -one
   
   procedures = {
 "quantize" : (lambda data : mx.nd.contrib.quantize(data, minusone, one)),
 "quantize_v2" : (lambda data : mx.nd.contrib.quantize_v2(data, 
min_calib_range = -1.0, max_calib_range = 1.0)),
 "intgemm" : (lambda data : mx.nd.contrib.intgemm_prepare_data(data, one))
   }
   for s in shapes:
 print("Shape " + str(s))
 stats = {}
 for name, l in procedures.items():
   stats[name] = time_procedure(s, count, l)
   print("{:.7f} seconds for {}".format(stats[name], name))
 best_baseline = min(stats["quantize"], stats["quantize_v2"])
 ratio = best_baseline / stats["intgemm"]
 print("intgemm is {:.1f}x faster".format(ratio))
   ```
   As a C++ programmer used to benchmarking with `clock_gettime`, using Python 
to do benchmarks pains me, but I think the point is clear.  If anything I'm 
handicapped on small matrices due to Python overhead.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers

2020-02-17 Thread GitBox
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm 
matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-587020032
 
 
   I'm the same person as @kpu but work part time as @kpuatamazon. Typically 
you'll hear from my Amazon hat on Mondays, though I plan to work flexibly to 
respond more quickly.  
   
   Overall, I think this is going to come down to an end-to-end benchmark.  
   
   Here's some numbers from a c5.12xlarge (with VNNI).  
   
   Sockeye in fp32 on one core:
   ```
   real14m21.688s
   user14m24.608s
   sys 0m1.329s
   ```
   Sockeye in int8 (intgemm) on one core:
   ```
   real5m2.986s
   user5m6.203s
   sys 0m1.036s
   ```
   And BLEU was unchanged (it went up 0.1% oddly).  
   
   I'll work on how much time is spent in GEMM and a version backed with DNNL.  
   
   > Also, the intgemm library seems to be a personal project more than a 
product. I'm not sure how will it be maintained and what's the adoption status 
in other projects. 
   > What's the adoption status of the library? And who will maintain the 
library? Amazon or @kpuatamazon himself?
   
   The intgemm library started as code inside the Marian machine translation 
project https://marian-nmt.github.io/ .  It's been extracted as a standalone. 
Marian is run in production at Microsoft, the European Union, World 
Intellectual Property Organization, US Air Force, and others listed on the 
site.  I've introduced @pengzhao-intel to our collaborators at Intel, which has 
funded some of the development.  
   
   I coordinate a 3-year EUR 3 million project funded by the EU to add 
client-side machine translation to web browsers https://browser.mt/ 
https://www.zdnet.com/article/firefox-to-get-page-translation-feature-like-chrome/
 .  This project is using Marian.  Since we want to run on people's desktops, 
intgemm is mostly optimized for pre-VNNI CPUs though we have VNNI support and 
further register optimization in a branch.  
   
   > Could you please be more specific what the functionality is?
   > If possible, please share more about how you did the quantization in your 
gluon model. 
   
   I'm calling the quantization operators directly from gluon instead of doing 
a graph transformation.  Please see the Sockeye code that uses this pull 
request.  The code is in https://github.com/awslabs/sockeye/pull/771 and  
https://github.com/kpuatamazon/sockeye/tree/heafield-quantize


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services