sxjscience commented on a change in pull request #16979: [Bugfix] [Numpy] Add `kAddTo` and kNullOp to Transpose URL: https://github.com/apache/incubator-mxnet/pull/16979#discussion_r354451534
########## File path: src/operator/tensor/pseudo2DTranspose_op-inl.cuh ########## @@ -39,22 +39,29 @@ namespace mxnet { namespace op { namespace cuda { - -template <typename DType, typename CType> +/*! + * \brief The `transpose_pseudo2D` based on chosen vectorized types. It transpose an array of + * shape (k, m, n) to (k, n, m) + * \param out Pointer to output memory. + * \param inp Pointer to input memory. + * \param m First of tensor dimensions. + * \param n Second of tensor dimensions. + * \param nIterY The number of iterations in the y-dim of the thread to cover all rows. (1-->m) + * \param nIterZ The number of iterations in the z-dim of the thread to cover all rows. (1-->m) + * \tparam DType Data type + * \tparam CType The type to load the data. + * \tparam TSR the vectorized ratio. + * \tparam is_addto Whether to perform out += transpose(data) or out = transpose(data) + */ +template <typename DType, typename CType, int TSR, bool is_addto> __global__ void transpose_pseudo2D(DType* out, DType* inp, const index_t m, const index_t n, const index_t nIterY, const index_t nIterZ) { - const index_t TSR = sizeof(CType)/sizeof(DType); // TypeSizeRatio Review comment: I moved it to the template to avoid the `CType tmp[0];` error. Because in the inner loop we are using switch to switch over all possible dtype sizes, some will have TSR = 0. I guess it would be better to use `max(sizeof(CType)/sizeof(DType), 1)`. Let me try that ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services