[ https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16580275#comment-16580275 ]
Eric Wohlstadter commented on SPARK-21375: ------------------------------------------ [~bryanc] Hi Brian, I'm using the Spark-Arrow conversion support inside of a DataSourceV2 {{SupportsColumnBatchScan}} DataReader. It uses {{ArrowStreamReader}} to read from the external data source, and converts the input from the stream to Spark's {{ArrowColumnVector}}. I'm having trouble when the original input comes from a Hive TimeStamp (without timezone). It looks like {{ArrowColumnVector}} requires {{TimeStampMicroTZVector.}} So I need to fill in a time zone when creating the {{TimeStampMicroTZVector}} on the Writer-side of the arrow stream. This creates some inconsistency when the two ends of the arrow stream are in different time zones. I'm wondering if I might be missing some other way of handling this correctly. Would you happen to know a better way to handle conversion of Timestamp (without time zone) using the Spark-Arrow conversion support? /cc [~dongjoon] [~hyukjin.kwon] > Add date and timestamp support to ArrowConverters for toPandas() collection > --------------------------------------------------------------------------- > > Key: SPARK-21375 > URL: https://issues.apache.org/jira/browse/SPARK-21375 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL > Affects Versions: 2.3.0 > Reporter: Bryan Cutler > Assignee: Bryan Cutler > Priority: Major > Fix For: 2.3.0 > > > Date and timestamp are not yet supported in DataFrame.toPandas() using > ArrowConverters. These are common types for data analysis used in both Spark > and Pandas and should be supported. > There is a discrepancy with the way that PySpark and Arrow store timestamps, > without timezone specified, internally. PySpark takes a UTC timestamp that > is adjusted to local time and Arrow is in UTC time. Hopefully there is a > clean way to resolve this. > Spark internal storage spec: > * *DateType* stored as days > * *Timestamp* stored as microseconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org