A related problem is excessive code generation.  Take `np.delete` for example.  

https://github.com/apache/incubator-mxnet/blob/16e2b15f6e334ca88f29b9c14e55547df2c136fc/src/operator/numpy/np_delete_op-inl.h#L337-L355

That's:
- MSHADOW_TYPE_SWITCH: 8 types on CPU and 7 types on GPU.
- MXNET_NDIM_SWITCH cases 1 through 5.
- MSHADOW_TYPE_SWITCH: 8 types on CPU and 7 types on GPU.
- MXNET_ASSIGN_REQ_SWITCH: 2 cases

That's 8*5*8*2 = 640 ways on CPU and 7*5*7*2 = 490 ways on GPU. 

This problem operates on a single axis. It reduces to: size of outer loop (i.e. 
the product of dimensions before the axis), the size of the axis in question, 
and the size of the data after the axis (i.e. the product of dimensions after 
the axis).  After this simplification, there's no ndim dispatch.  Supports 
arbitrary dimensionality with a factor of 5 reduction in compilation to 128 
cases.  

By the way, in the common case where the types are the same and output is 
kWriteTo, a loop over memory copies is much faster.  

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/19688#issuecomment-749052168

Reply via email to