DickJC123 opened a new issue #18731: URL: https://github.com/apache/incubator-mxnet/issues/18731
## Description test_default_float_dtype tests (among other things) the true_divide op's ability to divide an int array by an int scalar to produce either a float32 or float64 output array based on the setting of is_np_default_dtype. However, the test overlooks a backend error that is created during the test, which can surface during a test teardown of a future test. This is not only confusing, but also points to issues with the true_divide op itself. The simplest way to expose this issue is with: ``` MXNET_ENGINE_TYPE=NaiveEngine pytest --verbose -s tests/python/unittest/test_numpy_default_dtype.py::test_default_float_dtype ``` Alternatively, one can add the line `mxnet.nd.waitall()` to the end of test_default_float_dtype, at which point the MXNET_ENGINE_TYPE setting is not needed. In both cases, the error is: ``` ... E mxnet.base.MXNetError: Traceback (most recent call last): E File "include/mxnet/././tensor_blob.h", line 256 E MXNetError: Check failed: mshadow: :DataType<DType>::kFlag == type_flag_: TBlob.get_with_shape: data type do not match specified type.Expected: double v.s. given float python/mxnet/_ffi/_ctypes/function.py:115: MXNetError ``` So basically the test incorrectly feels that the output is of the expected type, when in fact the backend engine's execution of the function fails (with a type mismatch). I ran into this problem when my in-development PR https://github.com/apache/incubator-mxnet/pull/18694 began to include a mxnet.nd.waitall() call between unittests for enhanced test isolation. In order to make progress with my PR, I investigated the issues with true_divide, and have a fixing commit https://github.com/apache/incubator-mxnet/pull/18694/commits/65f16d8638747aa0e2a4bfe911ddeef675c56b9c as part of that PR. I'd appreciate a review of these changes, or an alternate fix proposed. Thanks @haojin2 , @reminisce or @JiangZhaoh . Basically, the fix addresses two issues: First, the output dtype of true_divide is a based on a parameter set by the main python thread and accessed within C++ by mxnet::common::GetDefaultDtype(). While this can be either float32 or float64, the backend implementation has a hard-coded cast of the output tensor pointer to float: https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/src/operator/numpy/np_true_divide-inl.h#L66-L72 Also, before performing the above execution, the worker thread performs the check: https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/src/operator/numpy/np_true_divide-inl.h#L62-L65 I think it's totally appropriate for the InferType function to call mxnet::common::GetDefaultDtype(), as this is performed by the main python thread and synchronized to changes in the default dtype. However, once the output type is determined, I don't think it's appropriate for the backend thread to once-again check the actual output type against the current value of GetDefaultDtype(). That would be a race. The backend thread should merely execute the op for the dtype that is set, assuming it's supported. Both issues are addressed by the commit referenced above. See "To Reproduce" below for a small program that should pass on any proposed alternative fix. Finally, I'll acknowledge that the implementation of true_divide(int_array, int_array), potentially with broadcast, has similar issues, but the implementation is trickier and best left to our numpy module developers. ### Error Message see above. ## To Reproduce ``` import mxnet as mx import numpy as _np from mxnet import numpy as np @mx.use_np_default_dtype def test_true_divide(a,b, synchronize=False): x = np.true_divide(a,b) if synchronize: # Make sure the backend, which presently calls mxnet::common::GetDefaultDtype(), is done. mx.nd.waitall() return x # divide int array with int scalar to produce float64 array for synchronize in [True, False]: out = test_true_divide(mx.nd.array([1,], dtype='int'), 9, synchronize=synchronize) print('true_divide output value (of type {}) = {:.15f}'.format(out.dtype, out.asnumpy()[0])) ``` ### Steps to reproduce 1. <copy-paste above code to a file test_true_divide.py> 2. python test_true_divide.py ## What have you tried to solve it? 1. Commit included in referenced in-development PR. 2. ## Environment ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org