Hi all, Very sorry if people already responded to this at: lucas.pic...@microsoft.com There was an INVALID identifier attached to the end of the reply address for some reason which may have caused replies to be lost.
I've been messing around with Spark and PyArrow Parquet reading. In my testing I've found that a Parquet file written by Spark containing a datetime column, results in different datetimes from Spark and PyArrow. The attached script demonstrates this. Output: Spark Reading the parquet file into a DataFrame: *[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)), Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]* PyArrow table has dates as UTC (7 hours ahead) *<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>[ Timestamp('2015-07-06 06:50:00')]* Pandas DF from pyarrow table has dates as UTC (7 hours ahead) * Date0 2015-07-06 06:50:001 2015-07-06 06:30:00* I would've expected to end up with the same datetime from both readers since there was no timezone attached at any point. It just a date and time value. Am I missing anything here? Or is this a bug. I attempted to intercept the timestamp values before pyarrow turns them into python objects so I could add timezone information which may fix this issue: The goal is to qualify the TimestampValue with a timezone (by creating a new column in the arrow table based off the previous one). If this can be done before the Value's are converted to python it may fix the issue I was having. But it doesn't appear that I can create a new Timestamp type column with the values from the old timestamp column. Here is the code I'm using: def chunkedToArray(data): for chunk in data.iterchunks(): for value in chunk: yield value def datetimeColumnsAddTimezone(table): for i, field in enumerate(table.schema): if field.type == pa.timestamp('ns'): newField = pa.field(field.name, pa.timestamp('ns', tz='GMT'), field.nullable, field.metadata) newArray = pa.array([val for val in chunkedToArray(table[i].data)], pa.timestamp('ns', tz='GMT')) newColumn = pa.Column.from_array(newField, newArray) table = table.remove_column(i) table = table.add_column(i, newColumn) return table Cheers, Lucas Pickup