Tim Swast created ARROW-4965:
--------------------------------

             Summary: [Python] Timestamp array type detection should use tzname 
of datetime.datetime objects
                 Key: ARROW-4965
                 URL: https://issues.apache.org/jira/browse/ARROW-4965
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
         Environment: $ python --version
Python 3.7.2

$ pip freeze
numpy==1.16.2
pyarrow==0.12.1
pytz==2018.9
six==1.12.0

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.14.3
BuildVersion:   18D109
(pyarrow) 
            Reporter: Tim Swast


The type detection from datetime objects to array appears to ignore the 
presence of a tzinfo on the datetime object, instead storing them as naive 
timestamp columns.

Python code:

{code:python}
import datetime
import pytz
import pyarrow as pa

naive_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10)
utc_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10, tzinfo=pytz.utc)
tzaware_datetime = utc_datetime.astimezone(pytz.timezone('America/Los_Angeles'))

def inspect(varname):
    print(varname)
    arr = globals()[varname]
    print(arr.type)
    print(arr)
    print()

auto_naive_arr = pa.array([naive_datetime])
inspect("auto_naive_arr")

auto_utc_arr = pa.array([utc_datetime])
inspect("auto_utc_arr")

auto_tzaware_arr = pa.array([tzaware_datetime])
inspect("auto_tzaware_arr")

auto_mixed_arr = pa.array([utc_datetime, tzaware_datetime])
inspect("auto_mixed_arr")

naive_type = pa.timestamp("us", naive_datetime.tzname())
utc_type = pa.timestamp("us", utc_datetime.tzname())
tzaware_type = pa.timestamp("us", tzaware_datetime.tzname())

naive_arr = pa.array([naive_datetime], type=naive_type)
inspect("naive_arr")

utc_arr = pa.array([utc_datetime], type=utc_type)
inspect("utc_arr")

tzaware_arr = pa.array([tzaware_datetime], type=tzaware_type)
inspect("tzaware_arr")

mixed_arr = pa.array([utc_datetime, tzaware_datetime], type=utc_type)
inspect("mixed_arr")
{code}

This prints:

{noformat}
$ python detect_timezone.py
auto_naive_arr
timestamp[us]
[
  1547381470000000
]

auto_utc_arr
timestamp[us]
[
  1547381470000000
]

auto_tzaware_arr
timestamp[us]
[
  1547352670000000
]

auto_mixed_arr
timestamp[us]
[
  1547381470000000,
  1547352670000000
]

naive_arr
timestamp[us]
[
  1547381470000000
]

utc_arr
timestamp[us, tz=UTC]
[
  1547381470000000
]

tzaware_arr
timestamp[us, tz=PST]
[
  1547352670000000
]

mixed_arr
timestamp[us, tz=UTC]
[
  1547381470000000,
  1547352670000000
]
{noformat}

But I would expect the following types instead:

* {{naive_datetime}}: {{timestamp[us]}}
* {{auto_utc_arr}}: {{timestamp[us, tz=UTC]}}
* {{auto_tzaware_arr}}: {{timestamp[us, tz=PST]}} (Or maybe 
{{tz='America/Los_Angeles'}}. I'm not sure why {{pytz}} returns {{PST}} as the 
{{tzname}})
* {{auto_mixed_arr}}: {{timestamp[us, tz=UTC]}}

Also, in the "mixed" case, I'd expect the actual stored microseconds to be the 
same for both rows, since {{utc_datetime}} and {{tzaware_datetime}} both refer 
to the same point in time. It seems reasonable for any naive datetime objects 
mixed in with tz-aware datetimes to be interpreted as UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to