[ https://issues.apache.org/jira/browse/ARROW-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markovtsev Vadim updated ARROW-8066: ------------------------------------ Description: The original description is at [https://github.com/pandas-dev/pandas/issues/32587] |h4. Code Sample, a copy-pastable example if possible import pandas as pd from datetime import datetime, timezone df = pd.DataFrame.from_records([ (1, datetime.now().replace(tzinfo=timezone.utc)), (2, datetime.now().replace(tzinfo=timezone.min))], columns=["1", "2"]) print(df["2"]) print() df.to_feather("/tmp/1") df2 = pd.read_feather("/tmp/1") print(df2["2"]) This code will output: {{0 2020-03-10 18:13:49.405598+00:00 1 2020-03-10 18:13:49.405626-23:59 Name: 2, dtype: object 0 2020-03-10 18:13:49.405598 1 2020-03-10 18:13:49.405626 Name: 2, dtype: datetime64[ns]}}h4. Problem description The round-trip dtype changed from the correct {{object}} to incorrect {{datetime64}}. Thus the timezones were discarded in Arrow and the timestamps became invalid.h4. Expected Output (identical) {{0 2020-03-10 18:13:49.405598+00:00 1 2020-03-10 18:13:49.405626-23:59 Name: 2, dtype: object 0 2020-03-10 18:13:49.405598+00:00 1 2020-03-10 18:13:49.405626-23:59 Name: 2, dtype: object}}h4. Output of {{pd.show_versions()}}| was: The description is at [https://github.com/pandas-dev/pandas/issues/32587] #### Code Sample, a copy-pastable example if possible {code:python} import pandas as pd from datetime import datetime, timezone df = pd.DataFrame.from_records([ (1, datetime.now().replace(tzinfo=timezone.utc)), (2, datetime.now().replace(tzinfo=timezone.min))], columns=["1", "2"]) print(df["2"]) print() df.to_feather("/tmp/1") df2 = pd.read_feather("/tmp/1") print(df2["2"]) {code} This code will output: {noformat} 0 2020-03-10 18:13:49.405598+00:00 1 2020-03-10 18:13:49.405626-23:59 Name: 2, dtype: object 0 2020-03-10 18:13:49.405598 1 2020-03-10 18:13:49.405626 Name: 2, dtype: datetime64[ns] {noformat} #### Problem description The round-trip dtype changed from the correct `object` to incorrect `datetime64`. Thus the timezones were discarded in Arrow and the timestamps became invalid. #### Expected Output (identical) {noformat} 0 2020-03-10 18:13:49.405598+00:00 1 2020-03-10 18:13:49.405626-23:59 Name: 2, dtype: object 0 2020-03-10 18:13:49.405598+00:00 1 2020-03-10 18:13:49.405626-23:59 Name: 2, dtype: object {noformat} #### Output of ``pd.show_versions()`` {noformat} INSTALLED VERSIONS ------------------ commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-40-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.0.1 numpy : 1.17.4 pytz : 2019.2 dateutil : 2.7.3 pip : 19.3.1 setuptools : 42.0.1 Cython : 0.29.14 pytest : 5.3.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.0 html5lib : None pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.10.3 IPython : 7.10.0 pandas_datareader: None bs4 : 4.8.1 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.1.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.16.0 pytables : None pytest : 5.3.1 pyxlsb : None s3fs : None scipy : 1.2.1 sqlalchemy : 1.3.12 tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None {noformat} > PyArrow discards timezones > -------------------------- > > Key: ARROW-8066 > URL: https://issues.apache.org/jira/browse/ARROW-8066 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.16.0 > Reporter: Markovtsev Vadim > Priority: Major > > The original description is at > [https://github.com/pandas-dev/pandas/issues/32587] > > |h4. Code Sample, a copy-pastable example if possible > import pandas as pd from datetime import datetime, timezone df = > pd.DataFrame.from_records([ (1, datetime.now().replace(tzinfo=timezone.utc)), > (2, datetime.now().replace(tzinfo=timezone.min))], columns=["1", "2"]) > print(df["2"]) print() df.to_feather("/tmp/1") df2 = > pd.read_feather("/tmp/1") print(df2["2"]) > This code will output: {{0 2020-03-10 18:13:49.405598+00:00 > 1 2020-03-10 18:13:49.405626-23:59 > Name: 2, dtype: object > 0 2020-03-10 18:13:49.405598 > 1 2020-03-10 18:13:49.405626 > Name: 2, dtype: datetime64[ns]}}h4. Problem description > The round-trip dtype changed from the correct {{object}} to incorrect > {{datetime64}}. Thus the timezones were discarded in Arrow and the timestamps > became invalid.h4. Expected Output > (identical) {{0 2020-03-10 18:13:49.405598+00:00 > 1 2020-03-10 18:13:49.405626-23:59 > Name: 2, dtype: object > 0 2020-03-10 18:13:49.405598+00:00 > 1 2020-03-10 18:13:49.405626-23:59 > Name: 2, dtype: object}}h4. Output of {{pd.show_versions()}}| > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)