[jira] [Updated] (ARROW-8066) PyArrow discards timezones

Markovtsev Vadim (Jira) Tue, 10 Mar 2020 13:03:12 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markovtsev Vadim updated ARROW-8066:
------------------------------------
    Description: 
The original description is at 
[https://github.com/pandas-dev/pandas/issues/32587]

 
|h4. Code Sample, a copy-pastable example if possible
import pandas as pd from datetime import datetime, timezone df = 
pd.DataFrame.from_records([ (1, datetime.now().replace(tzinfo=timezone.utc)), 
(2, datetime.now().replace(tzinfo=timezone.min))], columns=["1", "2"]) 
print(df["2"]) print() df.to_feather("/tmp/1") df2 = pd.read_feather("/tmp/1") 
print(df2["2"])
This code will output: {{0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object

0   2020-03-10 18:13:49.405598
1   2020-03-10 18:13:49.405626
Name: 2, dtype: datetime64[ns]}}h4. Problem description
The round-trip dtype changed from the correct {{object}} to incorrect 
{{datetime64}}. Thus the timezones were discarded in Arrow and the timestamps 
became invalid.h4. Expected Output
(identical) {{0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object

0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object}}h4. Output of {{pd.show_versions()}}|
 
 
 
 

  was:
The description is at [https://github.com/pandas-dev/pandas/issues/32587]

#### Code Sample, a copy-pastable example if possible

{code:python}
import pandas as pd
from datetime import datetime, timezone

df = pd.DataFrame.from_records([
    (1, datetime.now().replace(tzinfo=timezone.utc)),
    (2, datetime.now().replace(tzinfo=timezone.min))],
    columns=["1", "2"])

print(df["2"])
print()

df.to_feather("/tmp/1") 
df2 = pd.read_feather("/tmp/1")

print(df2["2"])
{code}


This code will output:


{noformat}
0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object

0   2020-03-10 18:13:49.405598
1   2020-03-10 18:13:49.405626
Name: 2, dtype: datetime64[ns]
{noformat}


#### Problem description

The round-trip dtype changed from the correct `object` to incorrect 
`datetime64`. Thus the timezones were discarded in Arrow and the timestamps 
became invalid.

#### Expected Output

(identical)


{noformat}
0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object

0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object
{noformat}


#### Output of ``pd.show_versions()``

{noformat}
INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.3.0-40-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.17.4
pytz             : 2019.2
dateutil         : 2.7.3
pip              : 19.3.1
setuptools       : 42.0.1
Cython           : 0.29.14
pytest           : 5.3.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : None
pymysql          : None
psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
jinja2           : 2.10.3
IPython          : 7.10.0
pandas_datareader: None
bs4              : 4.8.1
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
pytest           : 5.3.1
pyxlsb           : None
s3fs             : None
scipy            : 1.2.1
sqlalchemy       : 1.3.12
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None
{noformat}




> PyArrow discards timezones
> --------------------------
>
>                 Key: ARROW-8066
>                 URL: https://issues.apache.org/jira/browse/ARROW-8066
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>            Reporter: Markovtsev Vadim
>            Priority: Major
>
> The original description is at 
> [https://github.com/pandas-dev/pandas/issues/32587]
>  
> |h4. Code Sample, a copy-pastable example if possible
> import pandas as pd from datetime import datetime, timezone df = 
> pd.DataFrame.from_records([ (1, datetime.now().replace(tzinfo=timezone.utc)), 
> (2, datetime.now().replace(tzinfo=timezone.min))], columns=["1", "2"]) 
> print(df["2"]) print() df.to_feather("/tmp/1") df2 = 
> pd.read_feather("/tmp/1") print(df2["2"])
> This code will output: {{0    2020-03-10 18:13:49.405598+00:00
> 1    2020-03-10 18:13:49.405626-23:59
> Name: 2, dtype: object
> 0   2020-03-10 18:13:49.405598
> 1   2020-03-10 18:13:49.405626
> Name: 2, dtype: datetime64[ns]}}h4. Problem description
> The round-trip dtype changed from the correct {{object}} to incorrect 
> {{datetime64}}. Thus the timezones were discarded in Arrow and the timestamps 
> became invalid.h4. Expected Output
> (identical) {{0    2020-03-10 18:13:49.405598+00:00
> 1    2020-03-10 18:13:49.405626-23:59
> Name: 2, dtype: object
> 0    2020-03-10 18:13:49.405598+00:00
> 1    2020-03-10 18:13:49.405626-23:59
> Name: 2, dtype: object}}h4. Output of {{pd.show_versions()}}|
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8066) PyArrow discards timezones

Reply via email to