kpuatamazon edited a comment on issue #17559: [MXNET-1446] Quantization: 
intgemm matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-587066870
 
 
   The current MXNet quantizer is 3-10x slower than intgemm's on a wide variety 
of matrix sizes.  
   
   Experiment on c5.12xlarge:
   ```
   Shape (128, 128)
   0.0000731 seconds for quantize
   0.0000706 seconds for quantize_v2
   0.0000219 seconds for intgemm
   intgemm is 3.2x faster
   Shape (256, 256)
   0.0002116 seconds for quantize
   0.0001778 seconds for quantize_v2
   0.0000258 seconds for intgemm
   intgemm is 6.9x faster
   Shape (512, 512)
   0.0008112 seconds for quantize
   0.0006480 seconds for quantize_v2
   0.0000917 seconds for intgemm
   intgemm is 7.1x faster
   Shape (1024, 1024)
   0.0030176 seconds for quantize
   0.0023387 seconds for quantize_v2
   0.0002542 seconds for intgemm
   intgemm is 9.2x faster
   Shape (2048, 2048)
   0.0118271 seconds for quantize
   0.0090704 seconds for quantize_v2
   0.0008705 seconds for intgemm
   intgemm is 10.4x faster
   Shape (8, 4096)
   0.0001187 seconds for quantize
   0.0001061 seconds for quantize_v2
   0.0000226 seconds for intgemm
   intgemm is 4.7x faster
   ```
   
   Generated by `export MXNET_ENGINE_TYPE=NaiveEngine; export 
OMP_NUM_THREADS=1; taskset --cpu-list 0 ./quant_bench.py`
   where `quant_bench.py` is:
   ```
   #!/usr/bin/env python3
   import mxnet as mx
   import time
   
   def time_procedure(shape, count, proc):
     data = mx.nd.random_uniform(shape=s, low=-1.0, high = 1.0)
     mx.nd.waitall()
     begin = time.time()
     for i in range(0, count):
       proc(data)
       mx.nd.waitall()
     return (time.time() - begin) / count
   
   shapes = [(128, 128), (256,256), (512, 512), (1024, 1024), (2048, 2048)]
   count = 1000
   one = mx.nd.ones(shape=(1))
   minusone = -one
   
   procedures = {
     "quantize" : (lambda data : mx.nd.contrib.quantize(data, minusone, one)),
     "quantize_v2" : (lambda data : mx.nd.contrib.quantize_v2(data, 
min_calib_range = -1.0, max_calib_range = 1.0)),
     "intgemm" : (lambda data : mx.nd.contrib.intgemm_prepare_data(data, one))
   }
   for s in shapes:
     print("Shape " + str(s))
     stats = {}
     for name, l in procedures.items():
       stats[name] = time_procedure(s, count, l)
       print("{:.7f} seconds for {}".format(stats[name], name))
     best_baseline = min(stats["quantize"], stats["quantize_v2"])
     ratio = best_baseline / stats["intgemm"]
     print("intgemm is {:.1f}x faster".format(ratio))
   ```
   As a C++ programmer used to benchmarking with `clock_gettime`, using Python 
to do benchmarks pains me, but I think the point is clear.  If anything the 
Python overhead is pushing results towards 1 by adding to numerator and 
denominator.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to