DickJC123 opened a new issue #18731:
URL: https://github.com/apache/incubator-mxnet/issues/18731


   ## Description
   test_default_float_dtype tests (among other things) the true_divide op's 
ability to divide an int array by an int scalar to produce either a float32 or 
float64 output array based on the setting of is_np_default_dtype.  However, the 
test overlooks a backend error that is created during the test, which can 
surface during a test teardown of a future test.  This is not only confusing, 
but also points to issues with the true_divide op itself.
   
   The simplest way to expose this issue is with:
   ```
   MXNET_ENGINE_TYPE=NaiveEngine pytest --verbose -s 
tests/python/unittest/test_numpy_default_dtype.py::test_default_float_dtype
   ```
   Alternatively, one can add the line `mxnet.nd.waitall()` to the end of 
test_default_float_dtype, at which point the MXNET_ENGINE_TYPE setting is not 
needed.  In both cases, the error is:
   ```
   ...
   E           mxnet.base.MXNetError: Traceback (most recent call last):
   E             File "include/mxnet/././tensor_blob.h", line 256
   E           MXNetError: Check failed: mshadow: :DataType<DType>::kFlag == 
type_flag_: TBlob.get_with_shape: data type do not match specified 
type.Expected: double v.s. given float
   
   python/mxnet/_ffi/_ctypes/function.py:115: MXNetError
   ```
   So basically the test incorrectly feels that the output is of the expected 
type, when in fact the backend engine's execution of the function fails (with a 
type mismatch).  I ran into this problem when my in-development PR 
https://github.com/apache/incubator-mxnet/pull/18694 began to include a 
mxnet.nd.waitall() call between unittests for enhanced test isolation.
   
   In order to make progress with my PR, I investigated the issues with 
true_divide, and have a fixing commit 
https://github.com/apache/incubator-mxnet/pull/18694/commits/65f16d8638747aa0e2a4bfe911ddeef675c56b9c
 as part of that PR.  I'd appreciate a review of these changes, or an alternate 
fix proposed.  Thanks @haojin2 , @reminisce or @JiangZhaoh . 
   
   Basically, the fix addresses two issues:
   
   First, the output dtype of true_divide is a based on a parameter set by the 
main python thread and accessed within C++ by mxnet::common::GetDefaultDtype(). 
 While this can be either float32 or float64, the backend implementation has a 
hard-coded cast of the output tensor pointer to float:
   
https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/src/operator/numpy/np_true_divide-inl.h#L66-L72
   
   Also, before performing the above execution, the worker thread performs the 
check: 
https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/src/operator/numpy/np_true_divide-inl.h#L62-L65
   
   I think it's totally appropriate for the InferType function to call 
mxnet::common::GetDefaultDtype(), as this is performed by the main python 
thread and synchronized to changes in the default dtype.  However, once the 
output type is determined, I don't think it's appropriate for the backend 
thread to once-again check the actual output type against the current value of 
GetDefaultDtype().  That would be a race.  The backend thread should merely 
execute the op for the dtype that is set, assuming it's supported.
   
   Both issues are addressed by the commit referenced above.
   
   See "To Reproduce" below for a small program that should pass on any 
proposed alternative fix.
   
   Finally, I'll acknowledge that the implementation of true_divide(int_array, 
int_array), potentially with broadcast, has similar issues, but the 
implementation is trickier and best left to our numpy module developers.
   
    
   ### Error Message
   see above.
   
   ## To Reproduce
   ```
   import mxnet as mx
   import numpy as _np
   from mxnet import numpy as np
   
   @mx.use_np_default_dtype
   def test_true_divide(a,b, synchronize=False):
       x = np.true_divide(a,b)
       if synchronize:
           # Make sure the backend, which presently calls 
mxnet::common::GetDefaultDtype(), is done.
           mx.nd.waitall()
       return x
   
   
   # divide int array with int scalar to produce float64 array
   for synchronize in [True, False]:
       out = test_true_divide(mx.nd.array([1,], dtype='int'), 9, 
synchronize=synchronize)
       print('true_divide output value (of type {}) = 
{:.15f}'.format(out.dtype, out.asnumpy()[0]))
   ```
   ### Steps to reproduce
   
   
   1. <copy-paste above code to a file test_true_divide.py>
   2. python test_true_divide.py
   
   ## What have you tried to solve it?
   
   1. Commit included in referenced in-development PR.
   2.
   
   ## Environment
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to