Joris Van den Bossche created ARROW-18124:
---------------------------------------------

             Summary: [Python] Support converting to non-nano datetime64 for 
pandas >= 2.0
                 Key: ARROW-18124
                 URL: https://issues.apache.org/jira/browse/ARROW-18124
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Joris Van den Bossche
             Fix For: 11.0.0


Pandas is adding capabilities to store non-nanosecond datetime64 data. At the 
moment, we however always do convert to nanosecond, regardless of the timestamp 
resolution of the arrow table (and regardless of the pandas metadata).

Using the development version of pandas:

{code}
In [1]: df = pd.DataFrame({"col": np.arange("2012-01-01", 10, 
dtype="datetime64[s]")})

In [2]: df.dtypes
Out[2]: 
col    datetime64[s]
dtype: object

In [3]: table = pa.table(df)

In [4]: table.schema
Out[4]: 
col: timestamp[s]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 423

In [6]: table.to_pandas().dtypes
Out[6]: 
col    datetime64[ns]
dtype: object
{code}

This is because we have a {{coerce_temporal_nanoseconds}} conversion option 
which we hardcode to True (for top-level columns, we hardcode it to False for 
nested data). 

When users have pandas >= 2, we should support converting with preserving the 
resolution. We should certainly do so if the pandas metadata indicates which 
resolution was originally used (to ensure correct roundtrip). 
We _could_ (and at some point also _should_) also do that by default if there 
is no pandas metadata (but maybe only later depending on how stable this new 
feature is in pandas, as it is potentially a breaking change for our users if 
you use eg pyarrow to read a parquet file).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to