apeforest commented on issue #17331: [mxnet 2.0] [item 2.4] Turning on large tensor support by default URL: https://github.com/apache/incubator-mxnet/issues/17331#issuecomment-589829033 Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator `broadcast_axis` (from 138ms to 177ms) and `MXNDArraySyncCopyToCPU` (from 592ms to 679ms). Running operator-level profiler we could also identify the performance drop in `broadcast_axis` alone. w/o USE_INT64_TENSOR_SIZE flag: ```[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]``` w/ USE_INT64_TENSOR_SIZE flag: ```[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}``` Also, as I look into the implementation of broadcast_axis operator, many modulo and multiplication operator on the indices are involved. The next step will be to find an optimal implementation of broadcast_axis to reduce the ALU on indices in the kernel.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services