jorisvandenbossche commented on issue #47460:
URL: https://github.com/apache/arrow/issues/47460#issuecomment-3257621992
Sticking to the string topic here for a moment, you say:
> it appears this is functionnaly a string and it's read as an object
because all values are null. If a single value is not null, the column becomes
a string.
First, reading the Parquet file with pyarrow itself (into a Arrow table,
without conversion to pandas) correctly reads both all-null and partially-null
string columns as strings:
```python
>>> import pyarrow.parquet as pq
>>> table = pq.read_table("Downloads/test_parquet_null.parquet")
>>> table
pyarrow.Table
test_int64: int64
test_name: string
test_ts: timestamp[ns]
string_partially_null: string
----
test_int64: [[1,2,null]]
test_name: [[null,null,null]]
test_ts: [[null,null,null]]
string_partially_null: [[null,"toto","toto"]]
```
So the confusion is only for the arrow->pandas conversion. But currently
with released pandas, both string columns get converted to object dtype (as
that is the default way that pandas stores strings right now):
```python
>>> table.to_pandas().dtypes
test_int64 float64
test_name object
test_ts datetime64[ns]
string_partially_null object
dtype: object
```
The upcoming version of pandas 3.0 will have a proper string dtype, and in
that case both columns will be string columns:
```python
# using pandas 3.0 dev
>>> table.to_pandas().dtypes
test_int64 float64
test_name str
test_ts datetime64[ns]
string_partially_null str
dtype: object
```
In summary, I don't really understand what you mean with "it's read as an
object because all values are null" (I don't see any difference between the
all-null vs partially-null column)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]