Shubham Chaurasia created SPARK-34675: -----------------------------------------
Summary: TimeZone inconsistencies when JVM and session timezones are different Key: SPARK-34675 URL: https://issues.apache.org/jira/browse/SPARK-34675 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.7 Reporter: Shubham Chaurasia Inserted following data with UTC as both JVM and session timezone. Spark-shell launch command {code} bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC' {code} Table creation {code:scala} sql("use ts").show sql("create table spark_parquet(type string, t timestamp) stored as parquet").show sql("create table spark_orc(type string, t timestamp) stored as orc").show sql("create table spark_avro(type string, t timestamp) stored as avro").show sql("create table spark_text(type string, t timestamp) stored as textfile").show sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 01:02:03')").show sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 01:02:03')").show sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 01:02:03')").show sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 01:02:03')").show {code} Used following function to check and verify the returned timestamps {code:scala} scala> :paste // Entering paste mode (ctrl-D to finish) def showTs( db: String, tables: String* ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { sql("use " + db).show import scala.collection.mutable.ListBuffer var results = new ListBuffer[org.apache.spark.sql.DataFrame]() for (tbl <- tables) { val query = "select * from " + tbl println("Executing - " + query); results += sql(query) } println("user.timezone - " + System.getProperty("user.timezone")) println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) println("spark.sql.session.timeZone - " + spark.conf.get("spark.sql.session.timeZone")) var unionDf = results(0) for (i <- 1 until results.length) { unionDf = unionDf.unionAll(results(i)) } val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), r.getTimestamp(1).getTime)) val renamed = augmented.withColumnRenamed("_1", "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") renamed.show(false) return renamed } // Exiting paste mode, now interpreting. scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - UTC TimeZone.getDefault - UTC spark.sql.session.timeZone - UTC +----------------------+-------------------+------------+ |type |ts |millis | +----------------------+-------------------+------------+ |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| |FROM SPARK-EXT ORC |1989-01-05 01:02:03|599965323000| |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| +----------------------+-------------------+------------+ {code} 1. Set session timezone to America/Los_Angeles {code:scala} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - UTC TimeZone.getDefault - UTC spark.sql.session.timeZone - America/Los_Angeles +----------------------+-------------------+------------+ |type |ts |millis | +----------------------+-------------------+------------+ |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT ORC |1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT AVRO |1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT TEXT |1989-01-04 17:02:03|599965323000| +----------------------+-------------------+------------+ {code} 2. Started shell (JVM) in America/Los_Angeles timezone (which sets session timezone also to America/Los_Angeles) {code:scala} bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles' scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") Hive Session ID = 10ff355c-318d-4cb8-870f-a388652133b1 ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - America/Los_Angeles TimeZone.getDefault - America/Los_Angeles spark.sql.session.timeZone - America/Los_Angeles +----------------------+-------------------+------------+ |type |ts |millis | +----------------------+-------------------+------------+ |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT ORC |1989-01-05 01:02:03|599994123000| |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599994123000| |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599994123000| +----------------------+-------------------+------------+ {code} As we can see in 1 and 2, parquet and other formats are behaving differently. In 1 - {{1989-01-04 17:02:03|599965323000}} seems correct according to {{America/Los_Angeles}} timezone as the original value inserted in UTC was {{1989-01-05 01:02:03 which is equal to 599965323000 UTC}} In 2 - only parquet seems to be correct while the other formats are producing {{1989-01-05 01:02:03|599994123000}} which should not be the case if we are using a different timezone ( {{America/Los_Angeles}} ). I think they are coming from individual file format readers (avro, orc, text) but then they don't produce a converted value in {{America/Los_Angeles}} timezone. I saw orc reader adjusting offset according to writer and reader(JVM) timezone(probably avro is doing the same) but then we are not seeing the end value in spark according to {{spark.sql.session.timeZone}} Are there any guidelines/docs around how to use timezones with spark ? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org