Hi – I have a pandas dataframe that I want to output to parquet. The dataframe
has a timestamp field with timezone information. I need control over the schema
at output, so I am using ParquetWriter and a schema with the timestamp column
defined as:
('timestamp', pa.timestamp('s', tz=self._timezone)),
Where timezone is a string, e.g. ‘America/Los_Angeles’. I’m then writing out
the file using this code:
schema = pa.schema(fields)
table = pa.Table.from_pandas(self._df, schema,
preserve_index=False).replace_schema_metadata()
writer = pq.ParquetWriter(os.path.join(file_path,
'{}.parquet'.format(self._file_name)), schema=schema)
writer.write_table(table)
writer.close()
However, upon reading the resulting file, the timestamp is in UTC:
timestamp datetime64[ns, UTC]
But, if I output the same pandas dataframe to parquet directly, the timestamp
is localized. Is this expected behavior? I’m using pyarrow 1.0.0. I tried
playing with the ‘flavor’ argument of ParquetWriter, but this just seemed to
generate naïve UTC timestamps.
Thanks,
Dave