[ 
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Wujciak-Jens updated ARROW-16184:
---------------------------------------
    Component/s: Python

> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -------------------------------------------------------------------------
>
>                 Key: ARROW-16184
>                 URL: https://issues.apache.org/jira/browse/ARROW-16184
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Raphael Taylor-Davies
>            Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
> following code results in the schema changing when reading/writing a parquet 
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond 
> units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing 
> nanosecond timestamps. This is fine, however, the arrow schema embedded 
> within the parquet metadata still lists the data as being a nanosecond array.
>  
> This was discovered as part of the investigation into a bug report on the 
> arrow-rs parquet implementation - 
> [https://github.com/apache/arrow-rs/issues/1459]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to