[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-593952925 The quantization operator is now parallelized with OpenMP and supports an arbitrary number of arguments. It is substantially faster than the current MXNet implementation on both 1 and 24 cores (see below for benchmarks) @pengzhao-intel . Maybe this is too big of a pull request. Would you be happy with a smaller pull request that takes the faster quantization code and replaces the implementation of the existing quantize and quantize_v2 operators, so it also appears in the quantization flow? Then we can carry on with matrix multiply next. @ciyongch I'm calling operators manually because we're using gluon and the quantization workflow doesn't work for us anyway. But if you're game to have operators optimized, they'll automatically be in the workflow too. ``` OMP_NUM_THREADS=24 ./quant_bench.py Shape (1, 1) 0.0001304 seconds for quantize 0.0001076 seconds for quantize_v2 0.310 seconds for intgemm 0.0001114 seconds for quantize_v2_fit 0.479 seconds for intgemm_fit intgemm is 3.5x faster with calibration intgemm is 2.3x faster without calibration Shape (128, 128) 0.0001649 seconds for quantize 0.0001399 seconds for quantize_v2 0.329 seconds for intgemm 0.0001533 seconds for quantize_v2_fit 0.502 seconds for intgemm_fit intgemm is 4.2x faster with calibration intgemm is 3.1x faster without calibration Shape (256, 256) 0.0001660 seconds for quantize 0.0001404 seconds for quantize_v2 0.335 seconds for intgemm 0.0001599 seconds for quantize_v2_fit 0.505 seconds for intgemm_fit intgemm is 4.2x faster with calibration intgemm is 3.2x faster without calibration Shape (512, 512) 0.0001691 seconds for quantize 0.0001434 seconds for quantize_v2 0.342 seconds for intgemm 0.0001813 seconds for quantize_v2_fit 0.540 seconds for intgemm_fit intgemm is 4.2x faster with calibration intgemm is 3.4x faster without calibration Shape (1024, 1024) 0.0001920 seconds for quantize 0.0001538 seconds for quantize_v2 0.511 seconds for intgemm 0.0002390 seconds for quantize_v2_fit 0.827 seconds for intgemm_fit intgemm is 3.0x faster with calibration intgemm is 2.9x faster without calibration Shape (2048, 2048) 0.0002364 seconds for quantize 0.0001989 seconds for quantize_v2 0.875 seconds for intgemm 0.0004747 seconds for quantize_v2_fit 0.0001531 seconds for intgemm_fit intgemm is 2.3x faster with calibration intgemm is 3.1x faster without calibration Shape (20971520,) 0.0011446 seconds for quantize 0.0010902 seconds for quantize_v2 0.0008950 seconds for intgemm 0.0023337 seconds for quantize_v2_fit 0.0015005 seconds for intgemm_fit intgemm is 1.2x faster with calibration intgemm is 1.6x faster without calibration Shape (8, 4096) 0.0001636 seconds for quantize 0.0001392 seconds for quantize_v2 0.364 seconds for intgemm 0.0001508 seconds for quantize_v2_fit 0.651 seconds for intgemm_fit intgemm is 3.8x faster with calibration intgemm is 2.3x faster without calibration Shape (4096, 8) 0.0001642 seconds for quantize 0.0001392 seconds for quantize_v2 0.370 seconds for intgemm 0.0001515 seconds for quantize_v2_fit 0.654 seconds for intgemm_fit intgemm is 3.8x faster with calibration intgemm is 2.3x faster without calibration ``` ``` OMP_NUM_THREADS=1 ./quant_bench.py Shape (1, 1) 0.630 seconds for quantize 0.706 seconds for quantize_v2 0.294 seconds for intgemm 0.632 seconds for quantize_v2_fit 0.475 seconds for intgemm_fit intgemm is 2.1x faster with calibration intgemm is 1.3x faster without calibration Shape (128, 128) 0.860 seconds for quantize 0.898 seconds for quantize_v2 0.324 seconds for intgemm 0.996 seconds for quantize_v2_fit 0.464 seconds for intgemm_fit intgemm is 2.6x faster with calibration intgemm is 2.1x faster without calibration Shape (256, 256) 0.976 seconds for quantize 0.0001028 seconds for quantize_v2 0.339 seconds for intgemm 0.0001513 seconds for quantize_v2_fit 0.521 seconds for intgemm_fit intgemm is 2.9x faster with calibration intgemm is 2.9x faster without calibration Shape (512, 512) 0.0001724 seconds for quantize 0.0001693 seconds for quantize_v2 0.839 seconds for intgemm 0.0004351 seconds for quantize_v2_fit 0.0001420 seconds for intgemm_fit intgemm is 2.0x faster with calibration intgemm is 3.1x faster without calibration Shape (1024, 1024) 0.0003559 seconds for quantize 0.0003481 seconds for quantize_v2 0.0002384 seconds for intgemm
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-590363395 While I agree with the principle that all operators should be parallel (and intend to parallelize mine), it's important to find the optimal place for parallelism. This benchmark shows the right place is over sentences, not inside operators, at least with Sockeye on current MXNet. In Marian, we've turned off all OMP threading and just use sentence-level parallelism. Setup: float32 models using MKL and 1 sentence at a time (admittedly small, but common use case in inference). Nothing special from this pull request. One OMP thread, parallelize across sentences with separate processes. ```bash export OMP_NUM_THREADS=1 time parallel --block 10k --line-buffer --pipe -k python3 -m sockeye.translate --use-cpu -m model --restrict-lexicon model/lexicon --beam-size 5 https://raw.githubusercontent.com/awslabs/sockeye/master/docs/tutorials/cpu_process_per_core_translation.py This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-590265641 @ciyongch Here is a usage example. ```python import mxnet as mx #This is done offline. weight = mx.nd.random_uniform(shape=(8,64), low=-1.0, high=1.0) weight_max = mx.nd.contrib.intgemm_maxabsolute(weight) weight_prepared = mx.nd.contrib.intgemm_prepare_weight(weight, weight_max) data = mx.nd.random_uniform(shape=(1,64), low=-1.0, high=1.0) #Fused multiply quantizes on the fly. product1 = mx.nd.contrib.intgemm_fully_connected(data, weight_prepared, scaling = weight_max / 127.0, num_hidden = 8, no_bias = True) #One can also have quantized data. data_max = mx.nd.contrib.intgemm_maxabsolute(data) data_prepared = mx.nd.contrib.intgemm_prepare_data(data, data_max) product2 = mx.nd.contrib.intgemm_fully_connected(data_prepared, weight_prepared, scaling = weight_max / 127.0 * data_max / 127.0, num_hidden = 8, no_bias = True) baseline = mx.nd.FullyConnected(data, weight, num_hidden = 8, no_bias = True) ``` The `prepare_data` step is just a quantizer. The `prepare_weight` step does some element rearragement into a CPU-dependent format so the API isn't quite the same as the current one. And bias works in the usual way. Internally intgemm takes a template argument for what to do with the int32 output while it's still in a register. So one can write out floats or apply activation functions in registers first. The multiple of 8 issue is relatively easy to fix; that was just us being lazy about smaller tiles. Having inner dimension a multiple of 64 is rather useful for alignment purposes; how does MXNet deal with padding? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-590256391 @pengzhao-intel Which OMP do you recommend? I've been getting bad OMP results with a stock install of Ubuntu 18, but am happy to sprinkle a `#pragma omp parallel for` on this embarrassingly parallel task and add a loop at the end to deal with non-multiples of register size. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-587073135 Also, OMP performance is very bad. (NB: intgemm is running single-threaded here, partly because OMP is bad at this problem) ```bash export MXNET_ENGINE_TYPE=NaiveEngine; export OMP_NUM_THREADS=2; taskset --cpu-list 0,1 ./quant_bench.py ``` ``` [16:18:40] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine Shape (128, 128) 0.0008789 seconds for quantize 0.0008693 seconds for quantize_v2 0.175 seconds for intgemm intgemm is 49.7x faster Shape (256, 256) 0.0034812 seconds for quantize 0.0034044 seconds for quantize_v2 0.212 seconds for intgemm intgemm is 161.0x faster Shape (512, 512) 0.0138909 seconds for quantize 0.0138283 seconds for quantize_v2 0.731 seconds for intgemm intgemm is 189.3x faster Shape (1024, 1024) 0.0557616 seconds for quantize 0.0553598 seconds for quantize_v2 0.0002330 seconds for intgemm intgemm is 237.6x faster Shape (2048, 2048) 0.2225617 seconds for quantize 0.2196410 seconds for quantize_v2 0.0008387 seconds for intgemm intgemm is 261.9x faster Shape (8, 4096) 0.0017372 seconds for quantize 0.0017434 seconds for quantize_v2 0.183 seconds for intgemm intgemm is 94.8x faster ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-587066870 The current MXNet quantizer is 3-10x slower than intgemm's on a wide variety of matrix sizes. Experiment on c5.12xlarge: ``` Shape (128, 128) 0.731 seconds for quantize 0.706 seconds for quantize_v2 0.219 seconds for intgemm intgemm is 3.2x faster Shape (256, 256) 0.0002116 seconds for quantize 0.0001778 seconds for quantize_v2 0.258 seconds for intgemm intgemm is 6.9x faster Shape (512, 512) 0.0008112 seconds for quantize 0.0006480 seconds for quantize_v2 0.917 seconds for intgemm intgemm is 7.1x faster Shape (1024, 1024) 0.0030176 seconds for quantize 0.0023387 seconds for quantize_v2 0.0002542 seconds for intgemm intgemm is 9.2x faster Shape (2048, 2048) 0.0118271 seconds for quantize 0.0090704 seconds for quantize_v2 0.0008705 seconds for intgemm intgemm is 10.4x faster Shape (8, 4096) 0.0001187 seconds for quantize 0.0001061 seconds for quantize_v2 0.226 seconds for intgemm intgemm is 4.7x faster ``` Generated by `export MXNET_ENGINE_TYPE=NaiveEngine; export OMP_NUM_THREADS=1; taskset --cpu-list 0 ./quant_bench.py` where `quant_bench.py` is: ``` #!/usr/bin/env python3 import mxnet as mx import time def time_procedure(shape, count, proc): data = mx.nd.random_uniform(shape=s, low=-1.0, high = 1.0) mx.nd.waitall() begin = time.time() for i in range(0, count): proc(data) mx.nd.waitall() return (time.time() - begin) / count shapes = [(128, 128), (256,256), (512, 512), (1024, 1024), (2048, 2048)] count = 1000 one = mx.nd.ones(shape=(1)) minusone = -one procedures = { "quantize" : (lambda data : mx.nd.contrib.quantize(data, minusone, one)), "quantize_v2" : (lambda data : mx.nd.contrib.quantize_v2(data, min_calib_range = -1.0, max_calib_range = 1.0)), "intgemm" : (lambda data : mx.nd.contrib.intgemm_prepare_data(data, one)) } for s in shapes: print("Shape " + str(s)) stats = {} for name, l in procedures.items(): stats[name] = time_procedure(s, count, l) print("{:.7f} seconds for {}".format(stats[name], name)) best_baseline = min(stats["quantize"], stats["quantize_v2"]) ratio = best_baseline / stats["intgemm"] print("intgemm is {:.1f}x faster".format(ratio)) ``` As a C++ programmer used to benchmarking with `clock_gettime`, using Python to do benchmarks pains me, but I think the point is clear. If anything I'm handicapped on small matrices due to Python overhead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers
kpuatamazon commented on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-587020032 I'm the same person as @kpu but work part time as @kpuatamazon. Typically you'll hear from my Amazon hat on Mondays, though I plan to work flexibly to respond more quickly. Overall, I think this is going to come down to an end-to-end benchmark. Here's some numbers from a c5.12xlarge (with VNNI). Sockeye in fp32 on one core: ``` real14m21.688s user14m24.608s sys 0m1.329s ``` Sockeye in int8 (intgemm) on one core: ``` real5m2.986s user5m6.203s sys 0m1.036s ``` And BLEU was unchanged (it went up 0.1% oddly). I'll work on how much time is spent in GEMM and a version backed with DNNL. > Also, the intgemm library seems to be a personal project more than a product. I'm not sure how will it be maintained and what's the adoption status in other projects. > What's the adoption status of the library? And who will maintain the library? Amazon or @kpuatamazon himself? The intgemm library started as code inside the Marian machine translation project https://marian-nmt.github.io/ . It's been extracted as a standalone. Marian is run in production at Microsoft, the European Union, World Intellectual Property Organization, US Air Force, and others listed on the site. I've introduced @pengzhao-intel to our collaborators at Intel, which has funded some of the development. I coordinate a 3-year EUR 3 million project funded by the EU to add client-side machine translation to web browsers https://browser.mt/ https://www.zdnet.com/article/firefox-to-get-page-translation-feature-like-chrome/ . This project is using Marian. Since we want to run on people's desktops, intgemm is mostly optimized for pre-VNNI CPUs though we have VNNI support and further register optimization in a branch. > Could you please be more specific what the functionality is? > If possible, please share more about how you did the quantization in your gluon model. I'm calling the quantization operators directly from gluon instead of doing a graph transformation. Please see the Sockeye code that uses this pull request. The code is in https://github.com/awslabs/sockeye/pull/771 and https://github.com/kpuatamazon/sockeye/tree/heafield-quantize This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services