Speed. All those `std::string` and `std::unordered_map` objects don't come cheaply.
I compared an integrated fork with a custom operator. https://github.com/kpuatamazon/incubator-mxnet/tree/intgemm integrated version end-to-end Sockeye performance (based on 1.6.0): ``` real 2m57.962s user 7m3.986s sys 0m6.724s ``` Custom operator version (based on 1.7.x. because it had to be for custom operators): ``` real 3m16.879s user 7m43.727s sys 0m8.273s ``` Conditions: `unset MXNET_ENGINE_TYPE; export OMP_NUM_THREADS=2; numactl -C 0-7 translate.sh` Both were compiled with the MKL backend hack for the remaining fp32 operations. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-mxnet/issues/17006#issuecomment-636870342