[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-09 Thread javierluraschi
Github user javierluraschi closed the pull request at:

https://github.com/apache/spark/pull/22913


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/22913#discussion_r230953015
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -71,6 +71,7 @@ object ArrowUtils {
 case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
 case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
 case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => 
TimestampType
+case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => 
TimestampType
--- End diff --

I think it should map to `Date` and any extra time would be truncated. 
Looking at the Arrow format from 
https://github.com/apache/arrow/blob/master/format/Schema.fbs
```
/// Date is either a 32-bit or 64-bit type representing elapsed time since 
UNIX
/// epoch (1970-01-01), stored in either of two units:
///
/// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch 
(no
///   leap seconds), where the values are evenly divisible by 8640
/// * Days (32 bits) since the UNIX epoch
```
So it's expected to be a specific number of days without any additional 
milliseconds.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-05 Thread javierluraschi
Github user javierluraschi commented on a diff in the pull request:

https://github.com/apache/spark/pull/22913#discussion_r230833894
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -71,6 +71,7 @@ object ArrowUtils {
 case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
 case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
 case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => 
TimestampType
+case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => 
TimestampType
--- End diff --

Right... let me keep just the original PR commit for now. So yes, we could 
map to date but we would loose time, so the best mapping we have is to 
`timestamp`.

Let me check back in the `arrow` project as to why `POSIXct` is being 
mapped to `Date` not `TimeStamp` (see 
[arrow/r/src/array.cpp#L461](https://github.com/apache/arrow/blob/3a1dd3feb9ca09f92168f46fcfc06c01305df3ec/r/src/array.cpp#L461)),
 if we can change that then I would agree we don't need this change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22913#discussion_r230698731
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -71,6 +71,7 @@ object ArrowUtils {
 case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
 case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
 case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => 
TimestampType
+case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => 
TimestampType
--- End diff --

Yea, but date is a date. Otherwise, users should use timestamp types if 
they don't want to truncate. I don't think we should switch the type to cover 
its capability.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-04 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22913#discussion_r230635196
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -414,6 +416,21 @@ final int getInt(int rowId) {
 }
   }
 
+  private static class DateMilliAccessor extends ArrowVectorAccessor {
+
+private final DateMilliVector accessor;
+
+DateMilliAccessor(DateMilliVector vector) {
+  super(vector);
+  this.accessor = vector;
+}
+
+@Override
+final long getLong(int rowId) {
--- End diff --

We should use `getInt()` to represent the number of days from 1970-01-01 if 
we map the type to date type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-04 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22913#discussion_r230628333
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -71,6 +71,7 @@ object ArrowUtils {
 case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
 case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
 case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => 
TimestampType
+case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => 
TimestampType
--- End diff --

Notice that Spark doesn't have date type with milliseconds, so if we want 
to map to date type, the hours, minutes, ... will be lost. Otherwise we have to 
map to timestamp type.
Which is the proper behavior? cc @BryanCutler 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-04 Thread javierluraschi
Github user javierluraschi commented on a diff in the pull request:

https://github.com/apache/spark/pull/22913#discussion_r230607581
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -71,6 +71,7 @@ object ArrowUtils {
 case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
 case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
 case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => 
TimestampType
+case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => 
TimestampType
--- End diff --

Good catch, thanks. Yes, this should be mapped to `Date` in the Arrow 
schema, not `TimeStamp`.

To give more background, Arrow Dates can have a unit of `DateUnit.DAY` or 
`DateUnit.MILLISECOND` (see 
[arrow/vector/types/DateUnit.java#L21-L22](https://github.com/apache/arrow/blob/73d379f4631cd3013371f60876a52615171e6c3b/java/vector/src/main/java/org/apache/arrow/vector/types/DateUnit.java#L21-L22)),
 currently, if a date with milliseconds is passed, this simply fails; 
therefore, this change does not affect other type conversions and is fine to 
map all Arrow dates to Spark dates since now all cases are properly handled.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22913#discussion_r230544838
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -71,6 +71,7 @@ object ArrowUtils {
 case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
 case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
 case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => 
TimestampType
+case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => 
TimestampType
--- End diff --

Wait .. is it correct to map it to `TimestampType`? Looks this is why 
`Date` with `MILLISECOND` is not added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-10-31 Thread javierluraschi
GitHub user javierluraschi opened a pull request:

https://github.com/apache/spark/pull/22913

[SPARK-25902][SQL] Add support for dates with milliseconds in Apache Arrow 
bindings

Currently, the Apache Arrow bindings for Java only support `Date` with the 
metric set to `DateUnit.DAY`, see 
[ArrowUtils.scala#L72](https://github.com/apache/spark/blob/8c2edf46d0f89e5ec54968218d89f30a3f8190bc/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala#L72).

However, the Spark Arrow bindings for `R` are adding support to map 
`POSIXct` to `Date` using `DateUnit.MILLISECOND`, see 
https://github.com/rstudio/sparklyr/issues/1733.

This PR does not change existing functionality but rather, adds support for 
the `DateUnit.MILLISECOND` metric in `Date` to avoid hitting a 
`java.lang.UnsupportedOperationException: Unsupported data type: 
Date(MILLISECOND)` exception while loading an Arrow stream to Spark.

This patch was tested with https://github.com/rstudio/sparklyr/issues/1733, 
by building Spark from source and then:

```r
devtools::install_github("apache/arrow", subdir = "r")
devtools::install_github("rstudio/sparklyr", ref = "feature/arrow")
Sys.setenv("SPARK_HOME_VERSION" = "2.3.2")

library(sparklyr)
library(arrow)
sc <- spark_connect(master = "local", spark_home = 
"")

dates <- data.frame(dates = c(
  as.POSIXlt(Sys.time(), "GMT"),
  as.POSIXlt(Sys.time(), "EST"))
)

dates_tbl <- sdf_copy_to(sc, dates, overwrite = T)
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/javierluraschi/spark arrow-date-millis

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22913.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22913


commit f809942de6c241cc7b499c19d0250185ebe26122
Author: Javier Luraschi 
Date:   2018-10-31T18:44:50Z

add support for date milliseconds in arrow bindings




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org