Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

Bryan Cutler Tue, 05 Sep 2017 11:47:18 -0700

Hi Lucas,

The assessments from Wes and Li are right on.  Just to add to that, and
unfortunately make things even more complicated.. Spark does not always use
the config "spark.sql.session.timeZone", so it doesn't really help with
your example. It would be used if instead you generated timestamps through
SparkSQL.  I believe to get your example working properly you will need to
add the 2 lines below to the top of your script. This does change the
default system timezone in your python session though:


os.environ["TZ"] = "UTC"
time.tzset()



On Tue, Aug 29, 2017 at 8:00 AM, Li Jin <ice.xell...@gmail.com> wrote:

> Lucas,
>
> Wes' explanation is correct. If you are using Spark 2.2, you can set spark
> config "spark.sql.session.timeZone" to "UTC".
>
> I have written an documentation explaining this. I can clean it up for
> ARROW-1425.
>
> On Mon, Aug 28, 2017 at 5:23 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > see https://issues.apache.org/jira/browse/ARROW-1425
> >
> > On Mon, Aug 28, 2017 at 12:32 PM, Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > hi Lucas,
> > >
> > > Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge
> > > of the Spark timestamp issue (which is a known, and not a bug per se)
> > > should be able to give some extra context about this.
> > >
> > > My understanding is that when you read timezone-naive data in Spark,
> > > it is treated as session-local by the Spark runtime, and so the values
> > > that are written to Parquet will change based on the runtime locale. I
> > > think you can resolve this by casting the Spark timestamps to UTC to
> > > force normalization or setting the runtime locale to GMT/UTC. My
> > > apologies if I am mistaken about this.
> > >
> > > In Arrow, timestamps have two forms:
> > >
> > > * Time zone naive (where tz=None in Python); there is no notion of UTC
> > > or session-localness.
> > > * Time zone aware, the integer values are internally normalized to UTC
> > >
> > > The difficulty is that when you have time zone naive data, Spark may
> > > interpret the values differently based on your system locale. This is
> > > a pretty serious rough edge in my opinion; we should at minimum add a
> > > guide to using Spark and pyarrow together in the pyarrow documentation
> > > so that these "gotchas" can be well explained in a single place.
> > >
> > > - Wes
> > >
> > > On Mon, Aug 28, 2017 at 12:20 PM, Lucas Pickup
> > > <lucas.tot0.pic...@gmail.com> wrote:
> > >> Hi all,
> > >>
> > >> Very sorry if people already responded to this at:
> > >> lucas.pic...@microsoft.com There was an INVALID identifier attached
> to
> > the
> > >> end of the reply address for some reason which may have caused replies
> > to
> > >> be lost.
> > >>
> > >> I've been messing around with Spark and PyArrow Parquet reading. In my
> > >> testing I've found that a Parquet file written by Spark containing a
> > >> datetime column, results in different datetimes from Spark and
> PyArrow.
> > >>
> > >> The attached script demonstrates this.
> > >>
> > >> Output:
> > >>
> > >> Spark Reading the parquet file into a DataFrame:
> > >> *[Row(Date=datetime.datetime(2015, 7, 5, 23, 50)),
> > >> Row(Date=datetime.datetime(2015, 7, 5, 23, 30))]*
> > >>
> > >> PyArrow table has dates as UTC (7 hours ahead)
> > >>
> > >>
> > >>
> > >> *<pyarrow.lib.TimestampArray object at 0x0000029F3AFE79A8>[
> > >> Timestamp('2015-07-06 06:50:00')]*
> > >>
> > >> Pandas DF from pyarrow table has dates as UTC (7 hours ahead)
> > >>
> > >>
> > >>
> > >> *                 Date0 2015-07-06 06:50:001 2015-07-06 06:30:00*
> > >>
> > >> I would've expected to end up with the same datetime from both readers
> > >> since there was no timezone attached at any point. It just a date and
> > time
> > >> value.
> > >> Am I missing anything here? Or is this a bug.
> > >>
> > >> I attempted to intercept the timestamp values before pyarrow turns
> them
> > >> into python objects so I could add timezone information which may fix
> > this
> > >> issue:
> > >>
> > >> The goal is to qualify the TimestampValue with a timezone (by
> creating a
> > >> new column in the arrow table based off the previous one). If this can
> > be
> > >> done before the Value's are converted to python it may fix the issue I
> > was
> > >> having. But it doesn't appear that I can create a new Timestamp type
> > column
> > >> with the values from the old timestamp column.
> > >>
> > >> Here is the code I'm using:
> > >>
> > >> def chunkedToArray(data):
> > >>     for chunk in data.iterchunks():
> > >>         for value in chunk:
> > >>             yield value
> > >>
> > >>  def datetimeColumnsAddTimezone(table):
> > >>     for i, field in enumerate(table.schema):
> > >>         if field.type == pa.timestamp('ns'):
> > >>             newField = pa.field(field.name, pa.timestamp('ns',
> > tz='GMT'),
> > >> field.nullable, field.metadata)
> > >>             newArray = pa.array([val for val in
> > >> chunkedToArray(table[i].data)], pa.timestamp('ns', tz='GMT'))
> > >>             newColumn = pa.Column.from_array(newField, newArray)
> > >>             table = table.remove_column(i)
> > >>             table = table.add_column(i, newColumn)
> > >>    return table
> > >>
> > >> Cheers, Lucas Pickup
> >
>

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

Reply via email to