[jira] [Updated] (ARROW-8066) PyArrow discards timezones

Markovtsev Vadim (Jira) Tue, 10 Mar 2020 13:06:11 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markovtsev Vadim updated ARROW-8066:
------------------------------------
    Description: 
The original description is at 
[https://github.com/pandas-dev/pandas/issues/32587]

h3. Code Sample, a copy-pastable example if possible

{code:python}
import pandas as pd
from datetime import datetime, timezone

df = pd.DataFrame.from_records([
    (1, datetime.now().replace(tzinfo=timezone.utc)),
    (2, datetime.now().replace(tzinfo=timezone.min))],
    columns=["1", "2"])

print(df["2"])
print()

df.to_feather("/tmp/1") 
df2 = pd.read_feather("/tmp/1")

print(df2["2"])
{code}


This code will output:

{noformat}
0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object

0   2020-03-10 18:13:49.405598
1   2020-03-10 18:13:49.405626
Name: 2, dtype: datetime64[ns]

{noformat}

h3. Problem description

The round-trip dtype changed from the correct `object` to incorrect 
`datetime64`. Thus the timezones were discarded in Arrow and the timestamps 
became invalid.

h3. Expected Output

(identical)

{noformat}
0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object

0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object
{noformat}


h3. Output of ``pd.show_versions()``


{noformat}
INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.3.0-40-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.17.4
pytz             : 2019.2
dateutil         : 2.7.3
pip              : 19.3.1
setuptools       : 42.0.1
Cython           : 0.29.14
pytest           : 5.3.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : None
pymysql          : None
psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
jinja2           : 2.10.3
IPython          : 7.10.0
pandas_datareader: None
bs4              : 4.8.1
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
pytest           : 5.3.1
pyxlsb           : None
s3fs             : None
scipy            : 1.2.1
sqlalchemy       : 1.3.12
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

{noformat}


  was:
The original description is at 
[https://github.com/pandas-dev/pandas/issues/32587]

#### Code Sample, a copy-pastable example if possible

{code:python}
import pandas as pd
from datetime import datetime, timezone

df = pd.DataFrame.from_records([
    (1, datetime.now().replace(tzinfo=timezone.utc)),
    (2, datetime.now().replace(tzinfo=timezone.min))],
    columns=["1", "2"])

print(df["2"])
print()

df.to_feather("/tmp/1") 
df2 = pd.read_feather("/tmp/1")

print(df2["2"])
{code}


This code will output:

{noformat}
0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object

0   2020-03-10 18:13:49.405598
1   2020-03-10 18:13:49.405626
Name: 2, dtype: datetime64[ns]

{noformat}

#### Problem description

The round-trip dtype changed from the correct `object` to incorrect 
`datetime64`. Thus the timezones were discarded in Arrow and the timestamps 
became invalid.

#### Expected Output

(identical)

{noformat}
0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object

0    2020-03-10 18:13:49.405598+00:00
1    2020-03-10 18:13:49.405626-23:59
Name: 2, dtype: object
{noformat}


#### Output of ``pd.show_versions()``


{noformat}
INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.3.0-40-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.17.4
pytz             : 2019.2
dateutil         : 2.7.3
pip              : 19.3.1
setuptools       : 42.0.1
Cython           : 0.29.14
pytest           : 5.3.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : None
pymysql          : None
psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
jinja2           : 2.10.3
IPython          : 7.10.0
pandas_datareader: None
bs4              : 4.8.1
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
pytest           : 5.3.1
pyxlsb           : None
s3fs             : None
scipy            : 1.2.1
sqlalchemy       : 1.3.12
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

{noformat}



> PyArrow discards timezones
> --------------------------
>
>                 Key: ARROW-8066
>                 URL: https://issues.apache.org/jira/browse/ARROW-8066
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>            Reporter: Markovtsev Vadim
>            Priority: Major
>
> The original description is at 
> [https://github.com/pandas-dev/pandas/issues/32587]
> h3. Code Sample, a copy-pastable example if possible
> {code:python}
> import pandas as pd
> from datetime import datetime, timezone
> df = pd.DataFrame.from_records([
>     (1, datetime.now().replace(tzinfo=timezone.utc)),
>     (2, datetime.now().replace(tzinfo=timezone.min))],
>     columns=["1", "2"])
> print(df["2"])
> print()
> df.to_feather("/tmp/1") 
> df2 = pd.read_feather("/tmp/1")
> print(df2["2"])
> {code}
> This code will output:
> {noformat}
> 0    2020-03-10 18:13:49.405598+00:00
> 1    2020-03-10 18:13:49.405626-23:59
> Name: 2, dtype: object
> 0   2020-03-10 18:13:49.405598
> 1   2020-03-10 18:13:49.405626
> Name: 2, dtype: datetime64[ns]
> {noformat}
> h3. Problem description
> The round-trip dtype changed from the correct `object` to incorrect 
> `datetime64`. Thus the timezones were discarded in Arrow and the timestamps 
> became invalid.
> h3. Expected Output
> (identical)
> {noformat}
> 0    2020-03-10 18:13:49.405598+00:00
> 1    2020-03-10 18:13:49.405626-23:59
> Name: 2, dtype: object
> 0    2020-03-10 18:13:49.405598+00:00
> 1    2020-03-10 18:13:49.405626-23:59
> Name: 2, dtype: object
> {noformat}
> h3. Output of ``pd.show_versions()``
> {noformat}
> INSTALLED VERSIONS
> ------------------
> commit           : None
> python           : 3.7.5.final.0
> python-bits      : 64
> OS               : Linux
> OS-release       : 5.3.0-40-generic
> machine          : x86_64
> processor        : x86_64
> byteorder        : little
> LC_ALL           : None
> LANG             : en_US.UTF-8
> LOCALE           : en_US.UTF-8
> pandas           : 1.0.1
> numpy            : 1.17.4
> pytz             : 2019.2
> dateutil         : 2.7.3
> pip              : 19.3.1
> setuptools       : 42.0.1
> Cython           : 0.29.14
> pytest           : 5.3.1
> hypothesis       : None
> sphinx           : None
> blosc            : None
> feather          : None
> xlsxwriter       : None
> lxml.etree       : 4.5.0
> html5lib         : None
> pymysql          : None
> psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
> jinja2           : 2.10.3
> IPython          : 7.10.0
> pandas_datareader: None
> bs4              : 4.8.1
> bottleneck       : None
> fastparquet      : None
> gcsfs            : None
> lxml.etree       : 4.5.0
> matplotlib       : 3.1.2
> numexpr          : None
> odfpy            : None
> openpyxl         : None
> pandas_gbq       : None
> pyarrow          : 0.16.0
> pytables         : None
> pytest           : 5.3.1
> pyxlsb           : None
> s3fs             : None
> scipy            : 1.2.1
> sqlalchemy       : 1.3.12
> tables           : None
> tabulate         : None
> xarray           : None
> xlrd             : None
> xlwt             : None
> xlsxwriter       : None
> numba            : None
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8066) PyArrow discards timezones

Reply via email to