jorisvandenbossche commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616786742
##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,13 @@ cdef class Array(_PandasConvertible):
array = PyObject_to_object(out)
if isinstance(array, dict):
- array = np.take(array['dictionary'], array['indices'])
+ if zero_copy_only or not self.null_count:
+ # zero_copy doesn't allow for nulls to be in the array
+ array = np.take(array['dictionary'], array['indices'])
+ else:
+ missings = array["indices"] < 0
+ array = np.take(array['dictionary'], array['indices'])
+ array[missings] = np.NaN
Review comment:
I understand that it can be confusing, but the fact that
`da.to_pandas()[1] == NaN` is the case, is a pandas behaviour / implementation
detail (pandas doesn't use either NaN or None under the hood for a Categorical,
but returns NaN when accessing the value. And, in a potential next version with
pandas using nullable dtypes, this would no longer be true).
I think the relevant cases where we would want consistency is for the
different `to_numpy` conversions:
```
In [20]: pa.array(["a", "b", None]).to_numpy(zero_copy_only=False)
Out[20]: array(['a', 'b', None], dtype=object)
In [21]: pa.array(["a", "b",
None]).dictionary_encode().to_numpy(zero_copy_only=False)
Out[21]: array(['a', 'b', 'b'], dtype=object)
```
where of course right now the last one is buggy (which is what this PR is
solving), but so I would expect those two above to return the same in case of
`to_numpy()`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]