Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/18664
  
    To Wes's concern, I think we are only dealing with values in UTC here, both 
Spark and Arrow internally represents timestamp as microseconds since epoch.
    
    To the two issues Bryan and Ueshin brought up:
    Issue 1: 
    I agree with Ueshin we should stick to `SESSION_LOCAL_TIMEZONE`. 
    Bryan brought up a good point there in pyspark `df.toPandas()`, 
`df.collect()` and the python udf (through `Timestamp.fromInternal`) doesn't 
respect `SESSION_LOCAL_TIMEZONE` and therefore is confusing and inconsistent 
with Spark SQL behavior such as `df.show()`. Since it's going to be either 
inconsistent with Spark SQL (df.show()) or inconsistent with PySpark (i.e., the 
default df.toPandas()), I'd rather we do the right thing (by using 
`SESSION_LOCAL_TIMEZONE`) and fix other PySpark behavior separately. 
    
    Issue 2:
    I agree with Bryan that we leave the timezone as is. 
    I don't think there is performance issue because like Wes mentioned, it's 
just metadata operation. I think converting it back to system timezone defeat 
the purpose of using session timezone and throwing away the tzinfo seems 
unnecessary.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to