Tim Swast created ARROW-4965: -------------------------------- Summary: [Python] Timestamp array type detection should use tzname of datetime.datetime objects Key: ARROW-4965 URL: https://issues.apache.org/jira/browse/ARROW-4965 Project: Apache Arrow Issue Type: Improvement Components: Python Environment: $ python --version Python 3.7.2
$ pip freeze numpy==1.16.2 pyarrow==0.12.1 pytz==2018.9 six==1.12.0 $ sw_vers ProductName: Mac OS X ProductVersion: 10.14.3 BuildVersion: 18D109 (pyarrow) Reporter: Tim Swast The type detection from datetime objects to array appears to ignore the presence of a tzinfo on the datetime object, instead storing them as naive timestamp columns. Python code: {code:python} import datetime import pytz import pyarrow as pa naive_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10) utc_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10, tzinfo=pytz.utc) tzaware_datetime = utc_datetime.astimezone(pytz.timezone('America/Los_Angeles')) def inspect(varname): print(varname) arr = globals()[varname] print(arr.type) print(arr) print() auto_naive_arr = pa.array([naive_datetime]) inspect("auto_naive_arr") auto_utc_arr = pa.array([utc_datetime]) inspect("auto_utc_arr") auto_tzaware_arr = pa.array([tzaware_datetime]) inspect("auto_tzaware_arr") auto_mixed_arr = pa.array([utc_datetime, tzaware_datetime]) inspect("auto_mixed_arr") naive_type = pa.timestamp("us", naive_datetime.tzname()) utc_type = pa.timestamp("us", utc_datetime.tzname()) tzaware_type = pa.timestamp("us", tzaware_datetime.tzname()) naive_arr = pa.array([naive_datetime], type=naive_type) inspect("naive_arr") utc_arr = pa.array([utc_datetime], type=utc_type) inspect("utc_arr") tzaware_arr = pa.array([tzaware_datetime], type=tzaware_type) inspect("tzaware_arr") mixed_arr = pa.array([utc_datetime, tzaware_datetime], type=utc_type) inspect("mixed_arr") {code} This prints: {noformat} $ python detect_timezone.py auto_naive_arr timestamp[us] [ 1547381470000000 ] auto_utc_arr timestamp[us] [ 1547381470000000 ] auto_tzaware_arr timestamp[us] [ 1547352670000000 ] auto_mixed_arr timestamp[us] [ 1547381470000000, 1547352670000000 ] naive_arr timestamp[us] [ 1547381470000000 ] utc_arr timestamp[us, tz=UTC] [ 1547381470000000 ] tzaware_arr timestamp[us, tz=PST] [ 1547352670000000 ] mixed_arr timestamp[us, tz=UTC] [ 1547381470000000, 1547352670000000 ] {noformat} But I would expect the following types instead: * {{naive_datetime}}: {{timestamp[us]}} * {{auto_utc_arr}}: {{timestamp[us, tz=UTC]}} * {{auto_tzaware_arr}}: {{timestamp[us, tz=PST]}} (Or maybe {{tz='America/Los_Angeles'}}. I'm not sure why {{pytz}} returns {{PST}} as the {{tzname}}) * {{auto_mixed_arr}}: {{timestamp[us, tz=UTC]}} Also, in the "mixed" case, I'd expect the actual stored microseconds to be the same for both rows, since {{utc_datetime}} and {{tzaware_datetime}} both refer to the same point in time. It seems reasonable for any naive datetime objects mixed in with tz-aware datetimes to be interpreted as UTC. -- This message was sent by Atlassian JIRA (v7.6.3#76005)