[jira] [Comment Edited] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

Wes McKinney (JIRA) Tue, 25 Jul 2017 10:11:20 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100371#comment-16100371
 ]


Wes McKinney edited comment on SPARK-21375 at 7/25/17 5:10 PM:
---------------------------------------------------------------

What is the summary of how you're handling the time zone issue? The way that 
Spark handles timestamps without time zones seems problematic to me. Is there a 
way to configure your Spark system to force UTC locale? Otherwise the same code 
could yield different answers in different locales on the same input data. 

The way that Arrow handles this is by disallowing system locale, instead 
providing time zone naive and time zone aware timestamps (pandas does the same 
thing with its data):

* Time zone naive timestamps, where timezone = null in the Arrow metadata. The 
time components (day, hour, minute, etc) are computed without considering the 
system locale, so it's as though the locale is UTC

* Time zone aware timestamps: the physical representation is internally 
normalized to UTC, and time zone changes do not alter the underlying int64 
timestamp values. So changing the time zone is a metadata only conversion


was (Author: wesmckinn):
What is the summary of how you're handling the time zone issue? The way that 
Spark handles timestamps without time zones seems problematic to me. Is there a 
way to configure your Spark system to force UTC locale? Otherwise the same code 
could yield different answers in different locales on the same input data. 

The way that Arrow handles this is by disallowing system locale, instead time 
zone naive and time zone aware timestamps:

* Time zone naive timestamps, where timezone = null in the Arrow metadata. The 
time components (day, hour, minute, etc) are computed without considering the 
system locale, so it's as though the locale is UTC

* Time zone aware timestamps: the physical representation is internally 
normalized to UTC, and time zone changes do not alter the underlying int64 
timestamp values. So changing the time zone is a metadata only conversion

> Add date and timestamp support to ArrowConverters for toPandas() collection
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-21375
>                 URL: https://issues.apache.org/jira/browse/SPARK-21375
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> Date and timestamp are not yet supported in DataFrame.toPandas() using 
> ArrowConverters.  These are common types for data analysis used in both Spark 
> and Pandas and should be supported.
> There is a discrepancy with the way that PySpark and Arrow store timestamps, 
> without timezone specified, internally.  PySpark takes a UTC timestamp that 
> is adjusted to local time and Arrow is in UTC time.  Hopefully there is a 
> clean way to resolve this.
> Spark internal storage spec:
> * *DateType* stored as days
> * *Timestamp* stored as microseconds 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

Reply via email to