jorisvandenbossche commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616786742



##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
         array = PyObject_to_object(out)
 
         if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
+            if zero_copy_only or not self.null_count:
+                # zero_copy doesn't allow for nulls to be in the array
+                array = np.take(array['dictionary'], array['indices'])
+            else:
+                missings = array["indices"] < 0
+                array = np.take(array['dictionary'], array['indices'])
+                array[missings] = np.NaN

Review comment:
       I understand that it can be confusing, but the fact that 
`da.to_pandas()[1] == NaN` is the case, is a pandas behaviour / implementation 
detail (pandas doesn't use either NaN or None under the hood for a Categorical, 
but returns NaN when accessing the value. And, in a potential next version with 
pandas using nullable dtypes, this would no longer be true). 
   
   I think the relevant cases where we would want consistency is for the 
different `to_numpy` conversions:
   
   ```
   In [20]: pa.array(["a", "b", None]).to_numpy(zero_copy_only=False)
   Out[20]: array(['a', 'b', None], dtype=object)
   
   In [21]: pa.array(["a", "b", 
None]).dictionary_encode().to_numpy(zero_copy_only=False)
   Out[21]: array(['a', 'b', 'b'], dtype=object)
   ```
   
   where of course right now the last one is buggy (which is what this PR is 
solving), but so I would expect those two above to return the same in case of 
`to_numpy()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to