Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

Li Jin Tue, 29 Aug 2017 08:01:28 -0700

Lucas,

Wes' explanation is correct. If you are using Spark 2.2, you can set spark
config "spark.sql.session.timeZone" to "UTC".


I have written an documentation explaining this. I can clean it up for
ARROW-1425.

On Mon, Aug 28, 2017 at 5:23 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> see https://issues.apache.org/jira/browse/ARROW-1425
>
> On Mon, Aug 28, 2017 at 12:32 PM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> > hi Lucas,
> >
> > Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge
> > of the Spark timestamp issue (which is a known, and not a bug per se)
> > should be able to give some extra context about this.
> >
> > My understanding is that when you read timezone-naive data in Spark,
> > it is treated as session-local by the Spark runtime, and so the values
> > that are written to Parquet will change based on the runtime locale. I
> > think you can resolve this by casting the Spark timestamps to UTC to
> > force normalization or setting the runtime locale to GMT/UTC. My
> > apologies if I am mistaken about this.
> >
> > In Arrow, timestamps have two forms:
> >
> > * Time zone naive (where tz=None in Python); there is no notion of UTC
> > or session-localness.
> > * Time zone aware, the integer values are internally normalized to UTC
> >
> > The difficulty is that when you have time zone naive data, Spark may
> > interpret the values differently based on your system locale. This is
> > a pretty serious rough edge in my opinion; we should at minimum add a
> > guide to using Spark and pyarrow together in the pyarrow documentation
> > so that these "gotchas" can be well explained in a single place.
> >
> > - Wes
> >
> > On Mon, Aug 28, 2017 at 12:20 PM, Lucas Pickup
> > <lucas.tot0.pic...@gmail.com> wrote:
> >> Hi all,
> >>
> >> Very sorry if people already responded to this at:
> >> lucas.pic...@microsoft.com There was an INVALID identifier attached to
> the
> >> end of the reply address for some reason which may have caused replies
> to
> >> be lost.
> >>
> >> I've been messing around with Spark and PyArrow Parquet reading. In my
> >> testing I've found that a Parquet file written by Spark containing a
> >> datetime column, results in different datetimes from Spark and PyArrow.
> >>
> >> The attached script demonstrates this.
> >>
> >> Output:
> >>
> >> Spark Reading the parquet file into a DataFrame:
> >> *[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)),
> >> Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]*
> >>
> >> PyArrow table has dates as UTC (7 hours ahead)
> >>
> >>
> >>
> >> *<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>[
> >> Timestamp('2015-07-06 06:50:00')]*
> >>
> >> Pandas DF from pyarrow table has dates as UTC (7 hours ahead)
> >>
> >>
> >>
> >> *                 Date0 2015-07-06 06:50:001 2015-07-06 06:30:00*
> >>
> >> I would've expected to end up with the same datetime from both readers
> >> since there was no timezone attached at any point. It just a date and
> time
> >> value.
> >> Am I missing anything here? Or is this a bug.
> >>
> >> I attempted to intercept the timestamp values before pyarrow turns them
> >> into python objects so I could add timezone information which may fix
> this
> >> issue:
> >>
> >> The goal is to qualify the TimestampValue with a timezone (by creating a
> >> new column in the arrow table based off the previous one). If this can
> be
> >> done before the Value's are converted to python it may fix the issue I
> was
> >> having. But it doesn't appear that I can create a new Timestamp type
> column
> >> with the values from the old timestamp column.
> >>
> >> Here is the code I'm using:
> >>
> >> def chunkedToArray(data):
> >>     for chunk in data.iterchunks():
> >>         for value in chunk:
> >>             yield value
> >>
> >>  def datetimeColumnsAddTimezone(table):
> >>     for i, field in enumerate(table.schema):
> >>         if field.type == pa.timestamp('ns'):
> >>             newField = pa.field(field.name, pa.timestamp('ns',
> tz='GMT'),
> >> field.nullable, field.metadata)
> >>             newArray = pa.array([val for val in
> >> chunkedToArray(table[i].data)], pa.timestamp('ns', tz='GMT'))
> >>             newColumn = pa.Column.from_array(newField, newArray)
> >>             table = table.remove_column(i)
> >>             table = table.add_column(i, newColumn)
> >>    return table
> >>
> >> Cheers, Lucas Pickup
>

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

Reply via email to