kpuatamazon edited a comment on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-587066870 The current MXNet quantizer is 3-10x slower than intgemm's on a wide variety of matrix sizes. Experiment on c5.12xlarge: ``` Shape (128, 128) 0.0000731 seconds for quantize 0.0000706 seconds for quantize_v2 0.0000219 seconds for intgemm intgemm is 3.2x faster Shape (256, 256) 0.0002116 seconds for quantize 0.0001778 seconds for quantize_v2 0.0000258 seconds for intgemm intgemm is 6.9x faster Shape (512, 512) 0.0008112 seconds for quantize 0.0006480 seconds for quantize_v2 0.0000917 seconds for intgemm intgemm is 7.1x faster Shape (1024, 1024) 0.0030176 seconds for quantize 0.0023387 seconds for quantize_v2 0.0002542 seconds for intgemm intgemm is 9.2x faster Shape (2048, 2048) 0.0118271 seconds for quantize 0.0090704 seconds for quantize_v2 0.0008705 seconds for intgemm intgemm is 10.4x faster Shape (8, 4096) 0.0001187 seconds for quantize 0.0001061 seconds for quantize_v2 0.0000226 seconds for intgemm intgemm is 4.7x faster ``` Generated by `export MXNET_ENGINE_TYPE=NaiveEngine; export OMP_NUM_THREADS=1; taskset --cpu-list 0 ./quant_bench.py` where `quant_bench.py` is: ``` #!/usr/bin/env python3 import mxnet as mx import time def time_procedure(shape, count, proc): data = mx.nd.random_uniform(shape=s, low=-1.0, high = 1.0) mx.nd.waitall() begin = time.time() for i in range(0, count): proc(data) mx.nd.waitall() return (time.time() - begin) / count shapes = [(128, 128), (256,256), (512, 512), (1024, 1024), (2048, 2048)] count = 1000 one = mx.nd.ones(shape=(1)) minusone = -one procedures = { "quantize" : (lambda data : mx.nd.contrib.quantize(data, minusone, one)), "quantize_v2" : (lambda data : mx.nd.contrib.quantize_v2(data, min_calib_range = -1.0, max_calib_range = 1.0)), "intgemm" : (lambda data : mx.nd.contrib.intgemm_prepare_data(data, one)) } for s in shapes: print("Shape " + str(s)) stats = {} for name, l in procedures.items(): stats[name] = time_procedure(s, count, l) print("{:.7f} seconds for {}".format(stats[name], name)) best_baseline = min(stats["quantize"], stats["quantize_v2"]) ratio = best_baseline / stats["intgemm"] print("intgemm is {:.1f}x faster".format(ratio)) ``` As a C++ programmer used to benchmarking with `clock_gettime`, using Python to do benchmarks pains me, but I think the point is clear. If anything the Python overhead is pushing results towards 1 by adding to numerator and denominator.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services