[ 
https://issues.apache.org/jira/browse/ARROW-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633652#comment-17633652
 ] 

Miles Granger commented on ARROW-18298:
---------------------------------------

I thought initially it was just how it was presented, as going back to pandas 
in this example from the table gives the "correct" representation of the value:
{code:python}
In [9]: import pandas as pd
   ...: import pyarrow
   ...: ts = pd.Timestamp("2022-10-21 22:46:17", tz="America/Los_Angeles")
   ...: df = pd.DataFrame(\{"TS": [ts]})
   ...: table = pyarrow.Table.from_pandas(df)

In [10]: print(df)
                         TS
0 2022-10-21 22:46:17-07:00

In [11]: print(table.to_pandas())
                         TS
0 2022-10-21 22:46:17-07:00
{code}
However, placing mixed timezones makes the behavior more apparent in that it is 
coercing to the first timezone.
{code:python}
In [12]: import pandas as pd
    ...: import pyarrow
    ...: ts = pd.Timestamp("2022-10-21 22:46:17", tz="America/Los_Angeles")
    ...: df = pd.DataFrame({"TS": [ts, pd.Timestamp("2022-10-21 22:46:17", 
tz="UTC")]})
    ...: table = pyarrow.Table.from_pandas(df)

In [13]: print(df)
                          TS
0  2022-10-21 22:46:17-07:00
1  2022-10-21 22:46:17+00:00

In [14]: print(table)
pyarrow.Table
TS: timestamp[us, tz=America/Los_Angeles]
----
TS: [[2022-10-22 05:46:17.000000,2022-10-21 22:46:17.000000]]

In [15]: print(table.to_pandas())
                         TS
0 2022-10-21 22:46:17-07:00
1 2022-10-21 15:46:17-07:00
{code}
I believe {{TimestampArray}} needs to store everything in the array similarly, 
and that's why it's doing this. I'm not sure what the right solution here is at 
the moment. In some way it seems like it's doing us a favor by aligning the 
values to the same timezone, as the first mixing of timezones gives an 
{{object}} dtype for that column, while after doing the roundtrip, it (the 
pandas Series) gets the arguably better {{datetime64[ns, America/Los_Angeles]}} 
dtype.

> [Python] datetime shifted when using pyarrow.Table.from_pandas to load a 
> pandas DateFrame containing datetime with timezone
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-18298
>                 URL: https://issues.apache.org/jira/browse/ARROW-18298
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 9.0.0
>         Environment: MacOS M1, Python 3.8.13
>            Reporter: Adam Ling
>            Priority: Major
>
> Problem:
> When using pyarrow.Table.from_pandas to load a pandas DataFrame which 
> contains a timestamp object with timezone information, the created Table 
> object will shift the datetime, while still keeping the timezone information. 
> Please see my scripts.
>  
> Reproduce scripts:
> {code:java}
> import pandas as pd
> import pyarrow
> ts = pd.Timestamp("2022-10-21 22:46:17", tz="America/Los_Angeles")
> df = pd.DataFrame({"TS": [ts]})
> table = pyarrow.Table.from_pandas(df)
> print(df)
> """
>                          TS
> 0 2022-10-21 22:46:17-07:00
> """
> print(table)
> """
> pyarrow.Table
> TS: timestamp[ns, tz=America/Los_Angeles]
> ----
> TS: [[2022-10-22 05:46:17.000000000]]""" {code}
> Expected results:
> The table should not shift the datetime when timezone information is provided.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to