Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-09-05 Thread Bryan Cutler
Hi Lucas, The assessments from Wes and Li are right on. Just to add to that, and unfortunately make things even more complicated.. Spark does not always use the config "spark.sql.session.timeZone", so it doesn't really help with your example. It would be used if instead you generated timestamps

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-29 Thread Li Jin
Lucas, Wes' explanation is correct. If you are using Spark 2.2, you can set spark config "spark.sql.session.timeZone" to "UTC". I have written an documentation explaining this. I can clean it up for ARROW-1425. On Mon, Aug 28, 2017 at 5:23 PM, Wes McKinney wrote: > see

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-28 Thread Wes McKinney
hi Lucas, Bryan Cutler, Holden Karau, Li Jin, or someone with deeper knowledge of the Spark timestamp issue (which is a known, and not a bug per se) should be able to give some extra context about this. My understanding is that when you read timezone-naive data in Spark, it is treated as

Re: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-28 Thread Lucas Pickup
Here is the pyspark script I used to see this difference. On Mon, 28 Aug 2017 at 09:20 Lucas Pickup wrote: > Hi all, > > Very sorry if people already responded to this at: > lucas.pic...@microsoft.com There was an INVALID identifier attached to > the end of the

RE: Reading Parquet datetime column gives different answer in Spark vs PyArrow

2017-08-25 Thread Lucas Pickup
Quick follow up. I'm trying to work around this myself in the meantime. The goal is to qualify the TimestampValue with a timezone (by creating a new column in the arrow table based off the previous one). If this can be done before the Value's are converted to python it may fix the issue I was