amol- commented on a change in pull request #10101:
URL: https://github.com/apache/arrow/pull/10101#discussion_r616690521
##########
File path: python/pyarrow/array.pxi
##########
@@ -1170,7 +1170,9 @@ cdef class Array(_PandasConvertible):
array = PyObject_to_object(out)
if isinstance(array, dict):
+ missings = array["indices"] < 0
array = np.take(array['dictionary'], array['indices'])
+ array[missings] = np.NaN
Review comment:
Added an optimization based on `zero_copy_only` option (as it doesn't
allow nulls) and on `self.null_count` as the null_count is cached (
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/array/data.cc#L120-L121
) thus should frequently not add to the overhead.
Also confirmed that `-1` is used to signal NULL values when converting to
`numpy` arrays (
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/python/arrow_to_pandas.cc#L1637
)
A further optimization might have been to have `ConvertArrayToPandas`
return the count of null values (as it is invoking `IsValid` on them anyway)
but that requires a more widespread change and thus I think should be deferred
until proved necessary.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]