apeforest commented on issue #17331: [mxnet 2.0] [item 2.4] Turning on large 
tensor support by default
URL: 
https://github.com/apache/incubator-mxnet/issues/17331#issuecomment-589829033
 
 
   Thanks to @JonTanS for running the profiler, we have ping pointed the 
performance degradation in operator `broadcast_axis` (from 138ms to 177ms) and 
`MXNDArraySyncCopyToCPU` (from 592ms to 679ms). 
   
   Running operator-level profiler we could also identify the performance drop 
in `broadcast_axis` alone.
   
   w/o USE_INT64_TENSOR_SIZE flag:
   ```[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 
'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 
'avg_time_forward_broadcast_axis': 2.7753}]}]```
   
   w/ USE_INT64_TENSOR_SIZE flag:
   ```[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 
'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 
'avg_time_forward_broadcast_axis': 6.3178}]}```
   
   Also, as I look into the implementation of broadcast_axis operator, many 
modulo and multiplication operator on the indices are involved. The next step 
will be to find an optimal implementation of broadcast_axis to reduce the ALU 
on indices in the kernel.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to