[ 
https://issues.apache.org/jira/browse/ARROW-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118972#comment-17118972
 ] 

Joris Van den Bossche commented on ARROW-8944:
----------------------------------------------

Yes, for existing parquet files written with the "0001-01-01" dates in it, that 
seems like a decent solution. 
And so when writing new parquet files, best ensure you don't have "object" 
dtype columns (at least for columns with datetimes in it, for string columns 
that is normal), to avoid the whole issue.

> [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8944
>                 URL: https://issues.apache.org/jira/browse/ARROW-8944
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.17.0, 0.17.1
>         Environment: pandas==1.0.3
> pyarrow==0.17.1
> Python==3,7.6 @ Windows 10 64Bit
>            Reporter: Daniel Figus
>            Priority: Major
>
> The following pandas -> parquet -> pandas roudtrip raises an out of bounds 
> timestamp error with pyarrow 0.17.0 and 0.17.1:
> {code:python}
> import pandas
> target = 'ts_roundtrip.parquet'
> dataframe = pandas.DataFrame({'id':[1,2,3],'timestamp':['', '', '']})
> dataframe['timestamp'] = 
> pandas.to_datetime(dataframe['timestamp'],errors='raise')
> dataframe2 = pandas.DataFrame({'id':[4,5,6,7],'timestamp':['', 
> '2020-03-02T03:03:17.791062Z','','']})
> dataframe2['timestamp'] = 
> pandas.to_datetime(dataframe2['timestamp'],errors='raise')
> dataframe = dataframe.append(dataframe2)
> print(dataframe.head(10))
> dataframe.to_parquet(target, coerce_timestamps=None, index=False, 
> version='2.0')
> dataframe_new = pandas.read_parquet(target)
> print(dataframe_new.head())
> {code}
> Output:
> {noformat}
>    id                         timestamp
> 0   1                               NaT
> 1   2                               NaT
> 2   3                               NaT
> 0   4                               NaT
> 1   5  2020-03-02 03:03:17.791062+00:00
> 2   6                               NaT
> 3   7                               NaT
> Traceback (most recent call last):
>   File "c:\some\path\pyarrow_ts_test.py", line 16, in <module>
>     dataframe_new = pandas.read_parquet(target)
>   File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 310, 
> in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 125, 
> in read
>     path, columns=columns, **kwargs
>   File "pyarrow\array.pxi", line 587, in 
> pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table._to_pandas
>   File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 
> 766, in table_to_blockmanager
>     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
>   File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 
> 1102, in _table_to_blocks
>     list(extension_columns.keys()))
>   File "pyarrow\table.pxi", line 1107, in pyarrow.lib.table_to_blocks
>   File "pyarrow\error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would 
> result in out of bounds timestamp: -62135596800000000
> {noformat}
> Background: 
>  We have a dataset with a timestamp column that is sparsely populated and 
> originates from many json files. So it is very likely that in some of those 
> json files there is no timestamp (as string in ISO format) and instead just 
> an empty string. Each JSON file was read into a pandas dataframe, the 
> timestamp column casted to datetime and all dataframes appended. That was 
> done with pyarrow<0.17.0 and those parquet files cannot be read any longer 
> and result in the above mentioned error message as well.
> A closer look at our old parquets show that the NaTs are converted to 
> "1754-08-30 22:43:41.128654848" when reading back to a pandas dataframe :(. 
> You get the same result when you run the above code and pyarrow==0.16.0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to