[ https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342311#comment-17342311 ]
Shubham Chaurasia commented on SPARK-34675: ------------------------------------------- Thanks for the previous investigations [~maxgekk] [~dongjoon]. I saw something strange on the latest master. I was doing the same experiment and found that if we have say timezone T1 and we change either shell timezone using export TZ or if we pass using {{extraJavaOptions}}, we see different values of timezones. My system was in UTC. 1) I get following values {code} user.timezone - America/Los_Angeles TimeZone.getDefault - America/Los_Angeles spark.sql.session.timeZone - America/Los_Angeles +------------------------+-------------------+------------+ |type |timestamp |millis | +------------------------+-------------------+------------+ |FROM BEELINE-EXT PARQUET|1989-01-04 16:00:00|599961600000| |FROM BEELINE-EXT ORC |1989-01-05 00:00:00|599990400000| |FROM BEELINE-EXT AVRO |1989-01-04 16:00:00|599961600000| |FROM BEELINE-EXT TEXT |1989-01-05 00:00:00|599990400000| +------------------------+-------------------+------------+ {code} when I either change the shell timezone like {code} export TZ=America/Los_Angeles {code} or if I pass using extraJavaOptions like {code} bin/spark-shell --master local --conf spark.driver.extraJavaOptions='-Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles' {code} Not only with America/Los_Angeles timezone, I tested with Asia/Kolkata as well and was seeing the same behavior with above steps. Result with Asia/Kolkata - {code} user.timezone - Asia/Kolkata TimeZone.getDefault - Asia/Kolkata spark.sql.session.timeZone - Asia/Kolkata +------------------------+-------------------+------------+ |type |timestamp |millis | +------------------------+-------------------+------------+ |FROM BEELINE-EXT PARQUET|1989-01-05 05:30:00|599961600000| |FROM BEELINE-EXT ORC |1989-01-05 00:00:00|599941800000| |FROM BEELINE-EXT AVRO |1989-01-05 05:30:00|599961600000| |FROM BEELINE-EXT TEXT |1989-01-05 00:00:00|599941800000| +------------------------+-------------------+------------+ {code} > TimeZone inconsistencies when JVM and session timezones are different > --------------------------------------------------------------------- > > Key: SPARK-34675 > URL: https://issues.apache.org/jira/browse/SPARK-34675 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.7 > Reporter: Shubham Chaurasia > Priority: Major > Fix For: 3.2.0 > > > Inserted following data with UTC as both JVM and session timezone. > Spark-shell launch command > {code} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf > spark.executor.extraJavaOptions='-Duser.timezone=UTC' > {code} > Table creation > {code:scala} > sql("use ts").show > sql("create table spark_parquet(type string, t timestamp) stored as > parquet").show > sql("create table spark_orc(type string, t timestamp) stored as orc").show > sql("create table spark_avro(type string, t timestamp) stored as avro").show > sql("create table spark_text(type string, t timestamp) stored as > textfile").show > sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 > 01:02:03')").show > sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 > 01:02:03')").show > sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 > 01:02:03')").show > sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 > 01:02:03')").show > {code} > Used following function to check and verify the returned timestamps > {code:scala} > scala> :paste > // Entering paste mode (ctrl-D to finish) > def showTs( > db: String, > tables: String* > ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { > sql("use " + db).show > import scala.collection.mutable.ListBuffer > var results = new ListBuffer[org.apache.spark.sql.DataFrame]() > for (tbl <- tables) { > val query = "select * from " + tbl > println("Executing - " + query); > results += sql(query) > } > println("user.timezone - " + System.getProperty("user.timezone")) > println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) > println("spark.sql.session.timeZone - " + > spark.conf.get("spark.sql.session.timeZone")) > var unionDf = results(0) > for (i <- 1 until results.length) { > unionDf = unionDf.unionAll(results(i)) > } > val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), > r.getTimestamp(1).getTime)) > val renamed = augmented.withColumnRenamed("_1", > "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") > renamed.show(false) > return renamed > } > // Exiting paste mode, now interpreting. > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - UTC > +----------------------+-------------------+------------+ > > |type |ts |millis | > +----------------------+-------------------+------------+ > |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT ORC |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| > +----------------------+-------------------+------------+ > {code} > 1. Set session timezone to America/Los_Angeles > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - UTC > TimeZone.getDefault - UTC > spark.sql.session.timeZone - America/Los_Angeles > +----------------------+-------------------+------------+ > |type |ts |millis | > +----------------------+-------------------+------------+ > |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT ORC |1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT AVRO |1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT TEXT |1989-01-04 17:02:03|599965323000| > +----------------------+-------------------+------------+ > {code} > 2. Started shell (JVM) in America/Los_Angeles timezone (which sets session > timezone also to America/Los_Angeles) > {code:scala} > bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf > spark.sql.catalogImplementation=hive --conf > spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf > spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf > spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles' > scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") > Hive Session ID = 10ff355c-318d-4cb8-870f-a388652133b1 > ++ > || > ++ > ++ > Executing - select * from spark_parquet > Executing - select * from spark_orc > Executing - select * from spark_avro > Executing - select * from spark_text > user.timezone - America/Los_Angeles > TimeZone.getDefault - America/Los_Angeles > spark.sql.session.timeZone - America/Los_Angeles > +----------------------+-------------------+------------+ > > |type |ts |millis | > +----------------------+-------------------+------------+ > |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| > |FROM SPARK-EXT ORC |1989-01-05 01:02:03|599994123000| > |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599994123000| > |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599994123000| > +----------------------+-------------------+------------+ > {code} > As we can see in 1 and 2, parquet and other formats are behaving differently. > In 1 - {{1989-01-04 17:02:03|599965323000}} seems correct according to > {{America/Los_Angeles}} timezone as the original value inserted in UTC was > {{1989-01-05 01:02:03 which is equal to 599965323000 UTC}} > In 2 - only parquet seems to be correct while the other formats are producing > {{1989-01-05 01:02:03|599994123000}} which should not be the case if we are > using a different timezone ( {{America/Los_Angeles}} ). I think they are > coming from individual file format readers (avro, orc, text) but then they > don't produce a converted value in {{America/Los_Angeles}} timezone. I saw > orc reader adjusting offset according to writer and reader(JVM) > timezone(probably avro is doing the same) but then we are not seeing the end > value in spark according to {{spark.sql.session.timeZone}} > Are there any guidelines/docs around how to use timezones with spark ? > cc [~cloud_fan] [~hyukjin.kwon] [~dongjoon] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org