[jira] [Comment Edited] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

Bryan Cutler (JIRA) Tue, 25 Jul 2017 10:50:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100434#comment-16100434
 ]


Bryan Cutler edited comment on SPARK-21375 at 7/25/17 5:49 PM:
---------------------------------------------------------------

Thanks for the details [~wesmckinn].  The approach that Arrow uses makes sense 
to me, but as far as I know there is no way for Spark to create time zone naive 
timestamps, please correct me if I'm wrong [~cloud_fan] [~ueshin].  When 
creating a {{Dataset}} with a {{TimestampType}} that does not specify a time 
zone, Spark will always assume it is from {{DateTimeUtils.defaultTimeZone()}} 
which corresponds to System time zone.  In the PR for this we are discussing 
what time zone to use, which will be used in the Arrow data

1. Force "UTC"
Spark SQL has timestamp value as the number of micros since 1970-01-01 
00:00:00.0 UTC internally.

2. {{SQLConf.SESSION_LOCAL_TIMEZONE}}
Spark SQL represents and calculates in timezone related operations using this 
timezone. If there isn't the config value, the value will fallback to 
DateTimeUtils.defaultTimeZone().

3. {{DateTimeUtils.defaultTimeZone()}}
The system timezone.


was (Author: bryanc):
Thanks for the details [~wesmckinn].  The approach that Arrow uses makes sense 
to me, but as far as I know there is no way for Spark to create time zone naive 
timestamps, please correct me if I'm wrong [~cloud_fan] [~ueshin].  When 
creating a {Dataset} with a {{TimestampType}} that does not specify a time 
zone, Spark will always assume it is from {{DateTimeUtils.defaultTimeZone()}} 
which corresponds to System time zone.  In the PR for this we are discussing 
what time zone to use, which will be used in the Arrow data

1. Force "UTC"
Spark SQL has timestamp value as the number of micros since 1970-01-01 
00:00:00.0 UTC internally.

2. {{SQLConf.SESSION_LOCAL_TIMEZONE}}
Spark SQL represents and calculates in timezone related operations using this 
timezone. If there isn't the config value, the value will fallback to 
DateTimeUtils.defaultTimeZone().

3. {{DateTimeUtils.defaultTimeZone()}}
The system timezone.

> Add date and timestamp support to ArrowConverters for toPandas() collection
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-21375
>                 URL: https://issues.apache.org/jira/browse/SPARK-21375
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> Date and timestamp are not yet supported in DataFrame.toPandas() using 
> ArrowConverters.  These are common types for data analysis used in both Spark 
> and Pandas and should be supported.
> There is a discrepancy with the way that PySpark and Arrow store timestamps, 
> without timezone specified, internally.  PySpark takes a UTC timestamp that 
> is adjusted to local time and Arrow is in UTC time.  Hopefully there is a 
> clean way to resolve this.
> Spark internal storage spec:
> * *DateType* stored as days
> * *Timestamp* stored as microseconds 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

Reply via email to