DickJC123 commented on a change in pull request #13749: Add NHWC layout support to Pooling (cpu, gpu cuda, gpu cuDNN) URL: https://github.com/apache/incubator-mxnet/pull/13749#discussion_r254117531
########## File path: src/operator/nn/pool_utils.h ########## @@ -98,14 +98,16 @@ struct lp_grad<DType, 1> { template<typename DType> struct lp_grad<DType, 2> { static MSHADOW_XINLINE DType Map(const DType grad, const DType in_data, const DType out_data) { - return grad * in_data / out_data; + // Avoid nan result if both grad and out_data are 0. + return (grad == DType(0.0)) ? DType(0.0) : grad * in_data / out_data; } }; template<typename DType> struct lp_grad<DType, 3> { static MSHADOW_XINLINE DType Map(const DType grad, const DType in_data, const DType out_data) { - return grad * in_data * in_data / (out_data * out_data); + // Avoid nan result if both grad and out_data are 0. + return (grad == DType(0.0)) ? DType(0.0) : grad * in_data * in_data / (out_data * out_data); Review comment: I've pushed my solution to your comment in commit https://github.com/apache/incubator-mxnet/pull/13749/commits/098bc49f1d288ea9f2b64453aefcc1537ca5254e. The checking of grad == 0.0 that you highlighted only succeeded because of the quirks of our check_consistency() routine in test_utils.py, which uses the symbol forward() output as the gradient. Per your suggestion, I'm now using a check of out_data == 0 as the more general way of quieting the test failures. The test failures I was seeing often occurred in float16 lp-3 pooling. By example, take the case where a pool window of 2 has identical inputs 2^-9 and 2^-9. The forward output for this case is the cube root of (2^-9)^3 + (2^-9)^3. If this is calculated in float16, the 2^-27 terms underflow to 0 and the output is 0. The backward output is then grad * 2^-9 * 2^-9 / (0 * 0) = +inf (or nan if grad is also 0). When performed in float32, no underflow occurs in the forward op, and +infs are avoided in the backward op. My conclusion: float16 is ill-equipped to perform the forward pooling operation for lp-2 and lp-3. Part of my solution here thus involves promoting the calculation to be in float32 for cpu and mxnet cuda implementations of float16-i/o pooling. This is consistent with other operators like float16-i/o Convolution and Batchnorm, which perform internal calculations in float32. I've run test_pooling_versions() thousands of times with no failures in this mode. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services