Takuya Ueshin created SPARK-22395: ------------------------------------- Summary: Fix the behavior of timestamp values for Pandas to respect session timezone Key: SPARK-22395 URL: https://issues.apache.org/jira/browse/SPARK-22395 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Takuya Ueshin
When converting Pandas DataFrame/Series from/to Spark DataFrame using {{toPandas()}} or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone. For example, let's say we use {{"America/Los_Angeles"}} as session timezone and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}. The timestamp value from current {{toPandas()}} will be the following: {noformat} >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as >>> ts") >>> df.show() +-------------------+ | ts| +-------------------+ |1970-01-01 00:00:01| +-------------------+ >>> df.toPandas() ts 0 1970-01-01 17:00:01 {noformat} As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it respects Python timezone. As we discussed in https://github.com/apache/spark/pull/18664, we consider this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org