[ 
https://issues.apache.org/jira/browse/SPARK-34675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298962#comment-17298962
 ] 

Dongjoon Hyun commented on SPARK-34675:
---------------------------------------

Thank you for verification, [~ShubhamChaurasia] . The result difference between 
data sources seems to be the real problem here because we cannot change Spark 
2.4. BTW, Apache Spark community is preparing Apache Spark 2.4.8 as an official 
EOF release in this month (Release Manager: [~viirya] )

> TimeZone inconsistencies when JVM and session timezones are different
> ---------------------------------------------------------------------
>
>                 Key: SPARK-34675
>                 URL: https://issues.apache.org/jira/browse/SPARK-34675
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.7
>            Reporter: Shubham Chaurasia
>            Priority: Major
>
> Inserted following data with UTC as both JVM and session timezone.
> Spark-shell launch command
> {code}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC'
> {code}
> Table creation  
> {code:scala}
> sql("use ts").show
> sql("create table spark_parquet(type string, t timestamp) stored as 
> parquet").show
> sql("create table spark_orc(type string, t timestamp) stored as orc").show
> sql("create table spark_avro(type string, t timestamp) stored as avro").show
> sql("create table spark_text(type string, t timestamp) stored as 
> textfile").show
> sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 
> 01:02:03')").show
> sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 
> 01:02:03')").show
> {code}
> Used following function to check and verify the returned timestamps
> {code:scala}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> def showTs(
>     db: String,
>     tables: String*
> ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
>   sql("use " + db).show
>   import scala.collection.mutable.ListBuffer
>   var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
>   for (tbl <- tables) {
>     val query = "select * from " + tbl
>     println("Executing - " + query);
>     results += sql(query)
>   }
>   println("user.timezone - " + System.getProperty("user.timezone"))
>   println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
>   println("spark.sql.session.timeZone - " + 
> spark.conf.get("spark.sql.session.timeZone"))
>   var unionDf = results(0)
>   for (i <- 1 until results.length) {
>     unionDf = unionDf.unionAll(results(i))
>   }
>   val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), 
> r.getTimestamp(1).getTime))
>   val renamed = augmented.withColumnRenamed("_1", 
> "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
> renamed.show(false)
>   return renamed
> }
> // Exiting paste mode, now interpreting.
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - UTC
> +----------------------+-------------------+------------+                     
>   
> |type                  |ts                 |millis      |
> +----------------------+-------------------+------------+
> |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
> +----------------------+-------------------+------------+
> {code}
> 1. Set session timezone to America/Los_Angeles
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - UTC
> TimeZone.getDefault - UTC
> spark.sql.session.timeZone - America/Los_Angeles
> +----------------------+-------------------+------------+
> |type                  |ts                 |millis      |
> +----------------------+-------------------+------------+
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC    |1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
> +----------------------+-------------------+------------+
> {code}
> 2. Started shell (JVM) in America/Los_Angeles timezone (which sets session 
> timezone also to America/Los_Angeles)
> {code:scala}
> bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf 
> spark.sql.catalogImplementation=hive --conf 
> spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf 
> spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'
> scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
> Hive Session ID = 10ff355c-318d-4cb8-870f-a388652133b1
> ++
> ||
> ++
> ++
> Executing - select * from spark_parquet
> Executing - select * from spark_orc
> Executing - select * from spark_avro
> Executing - select * from spark_text
> user.timezone - America/Los_Angeles
> TimeZone.getDefault - America/Los_Angeles
> spark.sql.session.timeZone - America/Los_Angeles
> +----------------------+-------------------+------------+                     
>   
> |type                  |ts                 |millis      |
> +----------------------+-------------------+------------+
> |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
> |FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599994123000|
> |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599994123000|
> |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599994123000|
> +----------------------+-------------------+------------+
> {code}
> As we can see in 1 and 2, parquet and other formats are behaving differently. 
> In 1 - {{1989-01-04 17:02:03|599965323000}} seems correct according to 
> {{America/Los_Angeles}} timezone as the original value inserted in UTC was 
> {{1989-01-05 01:02:03 which is equal to 599965323000 UTC}}
> In 2 - only parquet seems to be correct while the other formats are producing 
> {{1989-01-05 01:02:03|599994123000}} which should not be the case if we are 
> using a different timezone ( {{America/Los_Angeles}} ).  I think they are 
> coming from individual file format readers (avro, orc, text) but then they 
> don't produce a converted value in  {{America/Los_Angeles}} timezone.  I saw 
> orc reader adjusting offset according to writer and reader(JVM) 
> timezone(probably avro is doing the same) but then we are not seeing the end 
> value in spark according to {{spark.sql.session.timeZone}}
> Are there any guidelines/docs around how to use timezones with spark ? 
> cc [~cloud_fan] [~hyukjin.kwon] [~dongjoon]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to