Yin Huai created SPARK-10519:
--------------------------------

             Summary: Investigate if we should encode timezone information to a 
timestamp value stored in JSON
                 Key: SPARK-10519
                 URL: https://issues.apache.org/jira/browse/SPARK-10519
             Project: Spark
          Issue Type: Task
          Components: SQL
            Reporter: Yin Huai
            Priority: Minor


Since Spark 1.3, we store a timestamp in JSON without encoding the timezone 
information and the string representation of a timestamp stored in JSON 
implicitly using the local timezone (see 
[1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454],
 
[2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38],
 
[3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41],
 
[4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]).
 This behavior may cause the data consumers got different values when they are 
in a different timezone with the data producers.

Since JSON is string based, if we encode timezone information to timestamp 
value, downstream applications may need to change their code (for example, 
java.sql.Timestamp.valueOf only supports the format of {{yyyy-\[m]m-\[d]d 
hh:mm:ss\[.f...]}}).

We should investigate what we should do about this issue. Right now, I can 
think of three options:

1. Encoding timezone info in the timestamp value, which can break user code and 
may change the semantic of timestamp (our timestamp value is timezone-less).
2. When saving a timestamp value to json, we treat this value as a value in the 
local timezone and convert it to UTC time. Then, when save the data, we do not 
encode timezone info in the value.
3. We do not change our current behavior. But, in our doc, we explicitly say 
that users need to use a single timezone for their datasets (e.g. always use 
UTC time). 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to