jorisvandenbossche commented on issue #49002:
URL: https://github.com/apache/arrow/issues/49002#issuecomment-3846614769
> When converting pyarrow arrays, that are not extension types or do not
have a `type_mapper` argument specified in `to_pandas()`, the PyArrow codepath
will go through the C++ layer transforming the pyarrow array into a numpy one.
From there pandas series is constructed with the use of the pandas api.
That's indeed correct. For converting `Table.to_pandas()`, we do have some
custom code that ensures we use the new str dtype for pandas 3+, so then the
conversion is correct also with all nulls:
```
>>> pa.table({"a": pa.array([None], type="str")}).to_pandas()
a
0 NaN
>>> pa.table({"a": pa.array([None], type="str")}).to_pandas().dtypes
a str
dtype: object
```
But for `Array.to_pandas()` we don't have that logic and currently indeed
directly use the C++ code path that will create a numpy object dtype array. It
is then the `pandas.Series(..)` constructor that will infer `str` dtype when
being passed an object array. But in case of all None values, pandas does not
know it should be string dtype, and thus leaves it as object.
> What one can do is to use `types_mapper=pd.ArrowDtype` and then the dunder
`__from_arrow__` method takes precedence over the C++ conversion:
One small correction, if you want to get the default string dtype, one has
to use `types_mapper=dict(pa.string()=pd.StringDtype(na_value=np.nan))`, and
not the experimental `pd.ArrowDtype`
> Also we might add a check in `_array_like_to_pandas` to default to the
`pd.ArrowDtype` for string and similar types (not overwriting `types_mapper`)?
Yes, I think we can indeed do that.
We currently have the following in the table conversion logic:
https://github.com/apache/arrow/blob/d2315fe00345b87a28f8fb268a1017934d4bf58a/python/pyarrow/pandas_compat.py#L936-L944
We can do something similar in `_array_like_to_pandas`, if `types_mapper` is
not specified, and if `_pandas_api.uses_string_dtype()`, and if the array's
type is string/large_string/string_view, then set `dtype` to
`pd.StringDtype(na_value=np.nan)`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]