[Python] Different behavior between pandas.to_parquet and ParquetWriter.write_table?

David Gallagher Mon, 03 Aug 2020 08:11:03 -0700

Hi – I have a pandas dataframe that I want to output to parquet. The dataframe 
has a timestamp field with timezone information. I need control over the schema 
at output, so I am using ParquetWriter and a schema with the timestamp column 
defined as:



('timestamp', pa.timestamp('s', tz=self._timezone)),

Where timezone is a string, e.g. ‘America/Los_Angeles’. I’m then writing out 
the file using this code:


schema = pa.schema(fields)
table = pa.Table.from_pandas(self._df, schema, 
preserve_index=False).replace_schema_metadata()
writer = pq.ParquetWriter(os.path.join(file_path, 
'{}.parquet'.format(self._file_name)), schema=schema)
writer.write_table(table)
writer.close()

However, upon reading the resulting file, the timestamp is in UTC:

timestamp              datetime64[ns, UTC]

But, if I output the same pandas dataframe to parquet directly, the timestamp 
is localized. Is this expected behavior? I’m using pyarrow 1.0.0. I tried 
playing with the ‘flavor’ argument of ParquetWriter, but this just seemed to 
generate naïve UTC timestamps.

Thanks,

Dave

[Python] Different behavior between pandas.to_parquet and ParquetWriter.write_table?

Reply via email to