Quick follow up. I'm trying to work around this myself in the meantime. The 
goal is to qualify the TimestampValue with a timezone (by creating a new column 
in the arrow table based off the previous one). If this can be done before the 
Value's are converted to python it may fix the issue I was having. But it 
doesn't appear that I can create a new Timestamp type column with the values 
from the old timestamp column.

Here is the code I'm using:

def chunkedToArray(data):
    for chunk in data.iterchunks():
        for value in chunk:
            yield value

def datetimeColumnsAddTimezone(table):
    for i, field in enumerate(table.schema):
        if field.type == pa.timestamp('ns'):
            newField = pa.field(field.name, pa.timestamp('ns', tz='GMT'), 
field.nullable, field.metadata)
            newArray = pa.array([val for val in chunkedToArray(table[i].data)], 
pa.timestamp('ns', tz='GMT'))
            newColumn = pa.Column.from_array(newField, newArray)
            table = table.remove_column(i)
            table = table.add_column(i, newColumn)
   return table

Cheers, Lucas Pickup

From: Lucas Pickup [mailto:lucas.pic...@microsoft.com.INVALID]
Sent: Friday, August 25, 2017 3:23 PM
To: dev@arrow.apache.org
Subject: Reading Parquet datetime column gives different answer in Spark vs 
PyArrow

Hi all,

I've been messing around with Spark and PyArrow Parquet reading. In my testing 
I've found that a Parquet file written by Spark containing a datetime column, 
results in different datetimes from Spark and PyArrow.

The attached script demonstrates this.

Output:
Spark Reading the parquet file into a DataFrame:
[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)), 
Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]

PyArrow table has dates as UTC (7 hours ahead)
<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>
[
  Timestamp('2015-07-06 06:50:00')
]
Pandas DF from pyarrow table has dates as UTC (7 hours ahead)
                 Date
0 2015-07-06 06:50:00
1 2015-07-06 06:30:00

I would've expected to end up with the same datetime from both readers since 
there was no timezone attached at any point. It just a date and time value.
Am I missing anything here? Or is this a bug.

Cheers, Lucas Pickup


Reply via email to