[ 
https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108736#comment-17108736
 ] 

Rauli Ruohonen commented on ARROW-8816:
---------------------------------------

Ah, I see. I thought that the output was wrong, because fastparquet also reads 
it incorrectly. But using both from pandas is not an independent test, because 
pandas is shared between the tests. Checking with parquet-tools, the output 
does look correct (9246182400000 is 2263-01-01 00:00:00, and the extra field 
gives "datetime" for pandas_type and "object" for numpy_type; AFAICS the reader 
has no basis to assume that unchecked cast to datetime64 would be safe).

Still, it's something of a pitfall that you can successfully save data (using 
default options), and when you later try to load it using the same software 
(using default options), it fails. If timestamp_as_object is required to read 
the data, one could symmetrically also require it to write the data, and avoid 
surprises upon loading.

OTOH raising an exception when you actually can produce correct output would 
also be slightly odd. One solution would be to have a 
timestamp_as_object='infer' option (instead of just True/False) that would be 
the default, so that the current writing behavior would be matched with 
symmetric reading behavior that would produce datetime64 when possible, and 
datetime when not.

>From one pragmatic perspective it'd be safer to raise an exception when trying 
>to write these things unless explicitly requested, because there are readers 
>that fail with them in common use (such as current pyarrow and fastparquet). 
>Maybe the reasoning why write_table defaults to parquet version 1.0 output 
>instead of 2.0 is similar...?

IMHO the important thing is to always be able to read back in what one wrote 
(possibly with wider types) if the write was successful, provided that one uses 
the same pyarrow version and the default options for both reading and writing.

> [Python] Year 2263 or later datetimes get mangled when written using pandas
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8816
>                 URL: https://issues.apache.org/jira/browse/ARROW-8816
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0, 0.17.0
>         Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, 
> python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, 
> python 3.8.2, ubuntu 20.04 (linux).
>            Reporter: Rauli Ruohonen
>            Priority: Major
>
> Using pyarrow 0.17.0, this
>  
> {code:java}
> import datetime
> import pandas as pd
> def try_with_year(year):
>     print(f'Year {year:_}:')
>     df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
>     df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
>     try:
>         print(pd.read_parquet('foo.parquet', engine='pyarrow'))
>     except Exception as exc:
>         print(repr(exc))
>     print()
> try_with_year(2_263)
> try_with_year(2_262)
> {code}
>  
> prints
>  
> {noformat}
> Year 2_263:
> ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out 
> of bounds timestamp: 9246182400000')
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> and using pyarrow 0.16.0, it prints
>  
>  
> {noformat}
> Year 2_263:
>                               x
> 0 1678-06-12 00:25:26.290448384
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> The issue is that 2263-01-01 is out of bounds for a timestamp stored using 
> epoch nanoseconds, but not out of bounds for a Python datetime.
> While pyarrow 0.17.0 refuses to read the erroneous output, it is still 
> possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or 
> fastparquet), yielding the same result as with 0.16.0 above (i.e. only 
> reading has changed in 0.17.0, not writing). It would be better if an error 
> was raised when attempting to write the file instead of silently producing 
> erroneous output.
> The reason I suspect this is a pyarrow issue instead of a pandas issue is 
> this modified example:
>  
> {code:java}
> import datetime
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
> table = pa.Table.from_pandas(df)
> print(table[0])
> try:
>     print(table.to_pandas())
> except Exception as exc:
>     print(repr(exc))
> {code}
> which prints
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
> ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out 
> of bounds timestamp: 9246182400000000'){noformat}
> on pyarrow 0.17.0 and
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
>                               x
> 0 1678-06-12 00:25:26.290448384{noformat}
> on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, 
> pyarrow prints the correct timestamp when asked to produce it as a string (so 
> it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() 
> round-trip fails.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to