GitHub user ueshin opened a pull request:

    https://github.com/apache/spark/pull/16750

    [SPARK-18937][SQL] Timezone support in CSV/JSON parsing

    ## What changes were proposed in this pull request?
    
    This is a follow-up pr of #16308.
    
    This pr enables timezone support in CSV/JSON parsing.
    
    We should introduce `timeZone` option for CSV/JSON datasources (the default 
value of the option is session local timezone).
    
    The datasources should use the `timeZone` option to format/parse to 
write/read timestamp values.
    Notice that while reading, if the timestampFormat has the timezone info, 
the timezone will not be used because we should respect the timezone in the 
values.
    
    For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the 
values written with the default timezone option, which is `"GMT"` because 
session local timezone is `"GMT"` here, are:
    
    ```scala
    scala> spark.conf.set("spark.sql.session.timeZone", "GMT")
    
    scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts")
    df: org.apache.spark.sql.DataFrame = [ts: timestamp]
    
    scala> df.show()
    +-------------------+
    |ts                 |
    +-------------------+
    |2016-01-01 00:00:00|
    +-------------------+
    
    
    scala> df.write.json("/path/to/gmtjson")
    ```
    
    ```sh
    $ cat /path/to/gmtjson/part-*
    {"ts":"2016-01-01T00:00:00.000Z"}
    ```
    
    whereas setting the option to `"PST"`, they are:
    
    ```scala
    scala> df.write.option("timeZone", "PST").json("/path/to/pstjson")
    ```
    
    ```sh
    $ cat /path/to/pstjson/part-*
    {"ts":"2015-12-31T16:00:00.000-08:00"}
    ```
    
    We can properly read these files even if the timezone option is wrong 
because the timestamp values have timezone info:
    
    ```scala
    scala> val schema = new StructType().add("ts", TimestampType)
    schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(ts,TimestampType,true))
    
    scala> spark.read.schema(schema).json("/path/to/gmtjson").show()
    +-------------------+
    |ts                 |
    +-------------------+
    |2016-01-01 00:00:00|
    +-------------------+
    
    scala> spark.read.schema(schema).option("timeZone", 
"PST").json("/path/to/gmtjson").show()
    +-------------------+
    |ts                 |
    +-------------------+
    |2016-01-01 00:00:00|
    +-------------------+
    ```
    
    And even if `timezoneFormat` doesn't contain timezone info, we can properly 
read the values with setting correct timezone option:
    
    ```scala
    scala> df.write.option("timestampFormat", 
"yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson")
    ```
    
    ```sh
    $ cat /path/to/jstjson/part-*
    {"ts":"2016-01-01T09:00:00"}
    ```
    
    ```scala
    // wrong result
    scala> spark.read.schema(schema).option("timestampFormat", 
"yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show()
    +-------------------+
    |ts                 |
    +-------------------+
    |2016-01-01 09:00:00|
    +-------------------+
    
    // correct result
    scala> spark.read.schema(schema).option("timestampFormat", 
"yyyy-MM-dd'T'HH:mm:ss").option("timeZone", 
"JST").json("/path/to/jstjson").show()
    +-------------------+
    |ts                 |
    +-------------------+
    |2016-01-01 00:00:00|
    +-------------------+
    ```
    
    This pr also makes `JsonToStruct` and `StructToJson` 
`TimeZoneAwareExpression` to be able to evaluate values with timezone option.
    
    ## How was this patch tested?
    
    Existing tests and added some tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ueshin/apache-spark issues/SPARK-18937

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16750
    
----
commit aa052f4d11929192b749752f4b73772664d0460c
Author: Takuya UESHIN <ues...@happy-camper.st>
Date:   2017-01-05T09:29:42Z

    Add timeZone option to JSONOptions.

commit 890879e24b3f63509a000585e18b288961a4e5cf
Author: Takuya UESHIN <ues...@happy-camper.st>
Date:   2017-01-06T05:11:41Z

    Apply timeZone option to JSON datasources.

commit f08b78c16ac444550e7ea0857d0275b9a91b7561
Author: Takuya UESHIN <ues...@happy-camper.st>
Date:   2017-01-06T06:03:34Z

    Apply timeZone option to CSV datasources.

commit 551cff99785927be3ef68c4393dca4dabb3c2ba0
Author: Takuya UESHIN <ues...@happy-camper.st>
Date:   2017-01-06T08:39:26Z

    Modify python files.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to