jorisvandenbossche commented on issue #49002:
URL: https://github.com/apache/arrow/issues/49002#issuecomment-3846614769

   > When converting pyarrow arrays, that are not extension types or do not 
have a `type_mapper` argument specified in `to_pandas()`, the PyArrow codepath 
will go through the C++ layer transforming the pyarrow array into a numpy one. 
From there pandas series is constructed with the use of the pandas api.
   
   That's indeed correct. For converting `Table.to_pandas()`, we do have some 
custom code that ensures we use the new str dtype for pandas 3+, so then the 
conversion is correct also with all nulls:
   
   ```
   >>> pa.table({"a": pa.array([None], type="str")}).to_pandas()
        a
   0  NaN
   >>> pa.table({"a": pa.array([None], type="str")}).to_pandas().dtypes
   a    str
   dtype: object
   ```
   
   But for `Array.to_pandas()` we don't have that logic and currently indeed 
directly use the C++ code path that will create a numpy object dtype array. It 
is then the `pandas.Series(..)` constructor that will infer `str` dtype when 
being passed an object array. But in case of all None values, pandas does not 
know it should be string dtype, and thus leaves it as object.
   
   
   
   
   > What one can do is to use `types_mapper=pd.ArrowDtype` and then the dunder 
`__from_arrow__` method takes precedence over the C++ conversion:
   
   One small correction, if you want to get the default string dtype, one has 
to use `types_mapper=dict(pa.string()=pd.StringDtype(na_value=np.nan))`, and 
not the experimental `pd.ArrowDtype`
   
   > Also we might add a check in `_array_like_to_pandas` to default to the 
`pd.ArrowDtype` for string and similar types (not overwriting `types_mapper`)?
   
   Yes, I think we can indeed do that. 
   
   We currently have the following in the table conversion logic:
   
   
https://github.com/apache/arrow/blob/d2315fe00345b87a28f8fb268a1017934d4bf58a/python/pyarrow/pandas_compat.py#L936-L944
   
   We can do something similar in `_array_like_to_pandas`, if `types_mapper` is 
not specified, and if `_pandas_api.uses_string_dtype()`, and if the array's 
type is string/large_string/string_view, then set `dtype` to 
`pd.StringDtype(na_value=np.nan)`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to