Lucas, Wes' explanation is correct. If you are using Spark 2.2, you can set spark config "spark.sql.session.timeZone" to "UTC".
I have written an documentation explaining this. I can clean it up for ARROW-1425. On Mon, Aug 28, 2017 at 5:23 PM, Wes McKinney <wesmck...@gmail.com> wrote: > see https://issues.apache.org/jira/browse/ARROW-1425 > > On Mon, Aug 28, 2017 at 12:32 PM, Wes McKinney <wesmck...@gmail.com> > wrote: > > hi Lucas, > > > > Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge > > of the Spark timestamp issue (which is a known, and not a bug per se) > > should be able to give some extra context about this. > > > > My understanding is that when you read timezone-naive data in Spark, > > it is treated as session-local by the Spark runtime, and so the values > > that are written to Parquet will change based on the runtime locale. I > > think you can resolve this by casting the Spark timestamps to UTC to > > force normalization or setting the runtime locale to GMT/UTC. My > > apologies if I am mistaken about this. > > > > In Arrow, timestamps have two forms: > > > > * Time zone naive (where tz=None in Python); there is no notion of UTC > > or session-localness. > > * Time zone aware, the integer values are internally normalized to UTC > > > > The difficulty is that when you have time zone naive data, Spark may > > interpret the values differently based on your system locale. This is > > a pretty serious rough edge in my opinion; we should at minimum add a > > guide to using Spark and pyarrow together in the pyarrow documentation > > so that these "gotchas" can be well explained in a single place. > > > > - Wes > > > > On Mon, Aug 28, 2017 at 12:20 PM, Lucas Pickup > > <lucas.tot0.pic...@gmail.com> wrote: > >> Hi all, > >> > >> Very sorry if people already responded to this at: > >> lucas.pic...@microsoft.com There was an INVALID identifier attached to > the > >> end of the reply address for some reason which may have caused replies > to > >> be lost. > >> > >> I've been messing around with Spark and PyArrow Parquet reading. In my > >> testing I've found that a Parquet file written by Spark containing a > >> datetime column, results in different datetimes from Spark and PyArrow. > >> > >> The attached script demonstrates this. > >> > >> Output: > >> > >> Spark Reading the parquet file into a DataFrame: > >> *[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)), > >> Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]* > >> > >> PyArrow table has dates as UTC (7 hours ahead) > >> > >> > >> > >> *<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>[ > >> Timestamp('2015-07-06 06:50:00')]* > >> > >> Pandas DF from pyarrow table has dates as UTC (7 hours ahead) > >> > >> > >> > >> * Date0 2015-07-06 06:50:001 2015-07-06 06:30:00* > >> > >> I would've expected to end up with the same datetime from both readers > >> since there was no timezone attached at any point. It just a date and > time > >> value. > >> Am I missing anything here? Or is this a bug. > >> > >> I attempted to intercept the timestamp values before pyarrow turns them > >> into python objects so I could add timezone information which may fix > this > >> issue: > >> > >> The goal is to qualify the TimestampValue with a timezone (by creating a > >> new column in the arrow table based off the previous one). If this can > be > >> done before the Value's are converted to python it may fix the issue I > was > >> having. But it doesn't appear that I can create a new Timestamp type > column > >> with the values from the old timestamp column. > >> > >> Here is the code I'm using: > >> > >> def chunkedToArray(data): > >> for chunk in data.iterchunks(): > >> for value in chunk: > >> yield value > >> > >> def datetimeColumnsAddTimezone(table): > >> for i, field in enumerate(table.schema): > >> if field.type == pa.timestamp('ns'): > >> newField = pa.field(field.name, pa.timestamp('ns', > tz='GMT'), > >> field.nullable, field.metadata) > >> newArray = pa.array([val for val in > >> chunkedToArray(table[i].data)], pa.timestamp('ns', tz='GMT')) > >> newColumn = pa.Column.from_array(newField, newArray) > >> table = table.remove_column(i) > >> table = table.add_column(i, newColumn) > >> return table > >> > >> Cheers, Lucas Pickup >