Joris Van den Bossche created ARROW-5912:
--------------------------------------------

             Summary: [Python] conversion from datetime objects with mixed 
timezones should normalize to UTC
                 Key: ARROW-5912
                 URL: https://issues.apache.org/jira/browse/ARROW-5912
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Joris Van den Bossche


Currently, when having objects with mixed timezones, they are each separately 
interpreted as their local time:

{code:python}
>>> ts_pd_paris = pd.Timestamp("1970-01-01 01:00", tz="Europe/Paris")
>>> ts_pd_paris    
Timestamp('1970-01-01 01:00:00+0100', tz='Europe/Paris')
>>> ts_pd_helsinki = pd.Timestamp("1970-01-01 02:00", tz="Europe/Helsinki")
>>> ts_pd_helsinki
Timestamp('1970-01-01 02:00:00+0200', tz='Europe/Helsinki')

>>> a = pa.array([ts_pd_paris, ts_pd_helsinki])                                 
>>>                                                                             
>>>  
>>> a
<pyarrow.lib.TimestampArray object at 0x7f7856c4a360>
[
  1970-01-01 01:00:00.000000,
  1970-01-01 02:00:00.000000
]
>>> a.type
TimestampType(timestamp[us])
{code}

So both times are actually about the same moment in time (the same value in 
UTC; in pandas their stored {{value}} is also the same), but once converted to 
pyarrow, they are both tz-naive but no longer the same time. That seems rather 
unexpected and a source for bugs.

I think a better option would be to normalize to UTC, and result in a tz-aware 
TimestampArray with UTC as timezone. 
That is also the behaviour of pandas if you force the conversion to result in 
datetimes (by default pandas will keep them as object array preserving the 
different timezones).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to