[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...
Github user javierluraschi commented on the issue: https://github.com/apache/spark/pull/22913 Yes, I believe that's the case. This change https://github.com/apache/arrow/pull/2887 maps R bindings from `POSIXct` to `timestamp` instead of `date`, looks like there is not need for this one, I'll close it and send a new PR if we find out something else is missing. Thanks everyone for your feedback, while not a change in Spark, ended up helping the Arrow project. Best. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user javierluraschi closed the pull request at: https://github.com/apache/spark/pull/22913 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user javierluraschi commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230833894 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Right... let me keep just the original PR commit for now. So yes, we could map to date but we would loose time, so the best mapping we have is to `timestamp`. Let me check back in the `arrow` project as to why `POSIXct` is being mapped to `Date` not `TimeStamp` (see [arrow/r/src/array.cpp#L461](https://github.com/apache/arrow/blob/3a1dd3feb9ca09f92168f46fcfc06c01305df3ec/r/src/array.cpp#L461)), if we can change that then I would agree we don't need this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user javierluraschi commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230607581 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Good catch, thanks. Yes, this should be mapped to `Date` in the Arrow schema, not `TimeStamp`. To give more background, Arrow Dates can have a unit of `DateUnit.DAY` or `DateUnit.MILLISECOND` (see [arrow/vector/types/DateUnit.java#L21-L22](https://github.com/apache/arrow/blob/73d379f4631cd3013371f60876a52615171e6c3b/java/vector/src/main/java/org/apache/arrow/vector/types/DateUnit.java#L21-L22)), currently, if a date with milliseconds is passed, this simply fails; therefore, this change does not affect other type conversions and is fine to map all Arrow dates to Spark dates since now all cases are properly handled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
GitHub user javierluraschi opened a pull request: https://github.com/apache/spark/pull/22913 [SPARK-25902][SQL] Add support for dates with milliseconds in Apache Arrow bindings Currently, the Apache Arrow bindings for Java only support `Date` with the metric set to `DateUnit.DAY`, see [ArrowUtils.scala#L72](https://github.com/apache/spark/blob/8c2edf46d0f89e5ec54968218d89f30a3f8190bc/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala#L72). However, the Spark Arrow bindings for `R` are adding support to map `POSIXct` to `Date` using `DateUnit.MILLISECOND`, see https://github.com/rstudio/sparklyr/issues/1733. This PR does not change existing functionality but rather, adds support for the `DateUnit.MILLISECOND` metric in `Date` to avoid hitting a `java.lang.UnsupportedOperationException: Unsupported data type: Date(MILLISECOND)` exception while loading an Arrow stream to Spark. This patch was tested with https://github.com/rstudio/sparklyr/issues/1733, by building Spark from source and then: ```r devtools::install_github("apache/arrow", subdir = "r") devtools::install_github("rstudio/sparklyr", ref = "feature/arrow") Sys.setenv("SPARK_HOME_VERSION" = "2.3.2") library(sparklyr) library(arrow) sc <- spark_connect(master = "local", spark_home = "") dates <- data.frame(dates = c( as.POSIXlt(Sys.time(), "GMT"), as.POSIXlt(Sys.time(), "EST")) ) dates_tbl <- sdf_copy_to(sc, dates, overwrite = T) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/javierluraschi/spark arrow-date-millis Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22913.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22913 commit f809942de6c241cc7b499c19d0250185ebe26122 Author: Javier Luraschi Date: 2018-10-31T18:44:50Z add support for date milliseconds in arrow bindings --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org