[ https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jacob Wujciak-Jens updated ARROW-16184: --------------------------------------- Component/s: Python > [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet > ------------------------------------------------------------------------- > > Key: ARROW-16184 > URL: https://issues.apache.org/jira/browse/ARROW-16184 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Raphael Taylor-Davies > Priority: Minor > > As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the > following code results in the schema changing when reading/writing a parquet > file. > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field<created: timestamp[ns]> (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond > units) > {code} > This was closed as a limitation of the parquet 1.x format for representing > nanosecond timestamps. This is fine, however, the arrow schema embedded > within the parquet metadata still lists the data as being a nanosecond array. > > This was discovered as part of the investigation into a bug report on the > arrow-rs parquet implementation - > [https://github.com/apache/arrow-rs/issues/1459] > > -- This message was sent by Atlassian Jira (v8.20.1#820001)