[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user javierluraschi closed the pull request at: https://github.com/apache/spark/pull/22913 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230953015 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- I think it should map to `Date` and any extra time would be truncated. Looking at the Arrow format from https://github.com/apache/arrow/blob/master/format/Schema.fbs ``` /// Date is either a 32-bit or 64-bit type representing elapsed time since UNIX /// epoch (1970-01-01), stored in either of two units: /// /// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no /// leap seconds), where the values are evenly divisible by 8640 /// * Days (32 bits) since the UNIX epoch ``` So it's expected to be a specific number of days without any additional milliseconds. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user javierluraschi commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230833894 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Right... let me keep just the original PR commit for now. So yes, we could map to date but we would loose time, so the best mapping we have is to `timestamp`. Let me check back in the `arrow` project as to why `POSIXct` is being mapped to `Date` not `TimeStamp` (see [arrow/r/src/array.cpp#L461](https://github.com/apache/arrow/blob/3a1dd3feb9ca09f92168f46fcfc06c01305df3ec/r/src/array.cpp#L461)), if we can change that then I would agree we don't need this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230698731 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Yea, but date is a date. Otherwise, users should use timestamp types if they don't want to truncate. I don't think we should switch the type to cover its capability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230635196 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -414,6 +416,21 @@ final int getInt(int rowId) { } } + private static class DateMilliAccessor extends ArrowVectorAccessor { + +private final DateMilliVector accessor; + +DateMilliAccessor(DateMilliVector vector) { + super(vector); + this.accessor = vector; +} + +@Override +final long getLong(int rowId) { --- End diff -- We should use `getInt()` to represent the number of days from 1970-01-01 if we map the type to date type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230628333 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Notice that Spark doesn't have date type with milliseconds, so if we want to map to date type, the hours, minutes, ... will be lost. Otherwise we have to map to timestamp type. Which is the proper behavior? cc @BryanCutler --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user javierluraschi commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230607581 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Good catch, thanks. Yes, this should be mapped to `Date` in the Arrow schema, not `TimeStamp`. To give more background, Arrow Dates can have a unit of `DateUnit.DAY` or `DateUnit.MILLISECOND` (see [arrow/vector/types/DateUnit.java#L21-L22](https://github.com/apache/arrow/blob/73d379f4631cd3013371f60876a52615171e6c3b/java/vector/src/main/java/org/apache/arrow/vector/types/DateUnit.java#L21-L22)), currently, if a date with milliseconds is passed, this simply fails; therefore, this change does not affect other type conversions and is fine to map all Arrow dates to Spark dates since now all cases are properly handled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22913#discussion_r230544838 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -71,6 +71,7 @@ object ArrowUtils { case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale) case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => TimestampType +case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => TimestampType --- End diff -- Wait .. is it correct to map it to `TimestampType`? Looks this is why `Date` with `MILLISECOND` is not added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...
GitHub user javierluraschi opened a pull request: https://github.com/apache/spark/pull/22913 [SPARK-25902][SQL] Add support for dates with milliseconds in Apache Arrow bindings Currently, the Apache Arrow bindings for Java only support `Date` with the metric set to `DateUnit.DAY`, see [ArrowUtils.scala#L72](https://github.com/apache/spark/blob/8c2edf46d0f89e5ec54968218d89f30a3f8190bc/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala#L72). However, the Spark Arrow bindings for `R` are adding support to map `POSIXct` to `Date` using `DateUnit.MILLISECOND`, see https://github.com/rstudio/sparklyr/issues/1733. This PR does not change existing functionality but rather, adds support for the `DateUnit.MILLISECOND` metric in `Date` to avoid hitting a `java.lang.UnsupportedOperationException: Unsupported data type: Date(MILLISECOND)` exception while loading an Arrow stream to Spark. This patch was tested with https://github.com/rstudio/sparklyr/issues/1733, by building Spark from source and then: ```r devtools::install_github("apache/arrow", subdir = "r") devtools::install_github("rstudio/sparklyr", ref = "feature/arrow") Sys.setenv("SPARK_HOME_VERSION" = "2.3.2") library(sparklyr) library(arrow) sc <- spark_connect(master = "local", spark_home = "") dates <- data.frame(dates = c( as.POSIXlt(Sys.time(), "GMT"), as.POSIXlt(Sys.time(), "EST")) ) dates_tbl <- sdf_copy_to(sc, dates, overwrite = T) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/javierluraschi/spark arrow-date-millis Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22913.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22913 commit f809942de6c241cc7b499c19d0250185ebe26122 Author: Javier Luraschi Date: 2018-10-31T18:44:50Z add support for date milliseconds in arrow bindings --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org