Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-09-05 Thread Bryan Cutler
Hi Lucas, The assessments from Wes and Li are right on. Just to add to that, and unfortunately make things even more complicated.. Spark does not always use the config "spark.sql.session.timeZone", so it doesn't really help with your example. It would be used if instead you generated timestamps

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-29 Thread Li Jin
Lucas, Wes' explanation is correct. If you are using Spark 2.2, you can set spark config "spark.sql.session.timeZone" to "UTC". I have written an documentation explaining this. I can clean it up for ARROW-1425. On Mon, Aug 28, 2017 at 5:23 PM, Wes McKinney wrote: > see

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-28 Thread Wes McKinney
hi Lucas, Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge of the Spark timestamp issue (which is a known, and not a bug per se) should be able to give some extra context about this. My understanding is that when you read timezone-naive data in Spark, it is treated as

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-28 Thread Lucas Pickup
Here is the pyspark script I used to see this difference. On Mon, 28 Aug 2017 at 09:20 Lucas Pickup wrote: > Hi all, > > Very sorry if people already responded to this at: > lucas.pic...@microsoft.com There was an INVALID identifier attached to > the end of the

Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-28 Thread Lucas Pickup
Hi all, Very sorry if people already responded to this at: lucas.pic...@microsoft.com There was an INVALID identifier attached to the end of the reply address for some reason which may have caused replies to be lost. I've been messing around with Spark and PyArrow Parquet reading. In my testing

RE: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-25 Thread Lucas Pickup
, 2017 3:23 PM To: dev@arrow.apache.org Subject: Reading Parquet datetime column gives different answer in Spark vs PyArrow Hi all, I've been messing around with Spark and PyArrow Parquet reading. In my testing I've found that a Parquet file written by Spark containing a datetime column

Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-25 Thread Lucas Pickup
Hi all, I've been messing around with Spark and PyArrow Parquet reading. In my testing I've found that a Parquet file written by Spark containing a datetime column, results in different datetimes from Spark and PyArrow. The attached script demonstrates this. Output: Spark Reading the parquet