[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164971509
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java
 ---
@@ -146,9 +146,7 @@ public UTF8String getUTF8String(int ordinal) {
   @Override
   public CalendarInterval getInterval(int ordinal) {
 if (columns[ordinal].isNullAt(rowId)) return null;
--- End diff --

Let me try to prepare a PR tonight.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20438


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164944493
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java
 ---
@@ -146,9 +146,7 @@ public UTF8String getUTF8String(int ordinal) {
   @Override
   public CalendarInterval getInterval(int ordinal) {
 if (columns[ordinal].isNullAt(rowId)) return null;
--- End diff --

see https://github.com/apache/spark/pull/20438#discussion_r164719793 . In 
this PR I just fixed the returning null issue for `getStruct` and 
`getInterval`, because they are non-abstract. There should be a follow up to 
clearly document that `ColumnVector.getBinary/getUTF8String/...` should return 
null if this slot is null. Then we can remove these null checks here. I 
appreciate it if you have time to take this :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164943981
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java ---
@@ -139,9 +139,7 @@ public UTF8String getUTF8String(int ordinal) {
   @Override
   public CalendarInterval getInterval(int ordinal) {
 if (data.getChild(ordinal).isNullAt(rowId)) return null;
--- End diff --

Can we remove this null check now?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164943754
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java
 ---
@@ -146,9 +146,7 @@ public UTF8String getUTF8String(int ordinal) {
   @Override
   public CalendarInterval getInterval(int ordinal) {
 if (columns[ordinal].isNullAt(rowId)) return null;
--- End diff --

Do we still need this null check?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164942783
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -236,9 +238,29 @@ public MapData getMap(int ordinal) {
   public abstract byte[] getBinary(int rowId);
 
   /**
-   * Returns the ordinal's child column vector.
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. An interval type vector is the same as a struct type vector 
with 2 fields: `months`
+   * and `microseconds`.
+   *
+   * To support interval type, implementations must implement {@link 
#getChild(int)} and define 2
+   * child vectors: the first child vector is an int type vector, 
containing all the month values of
+   * all the interval values in this vector. The second child vector is a 
long type vector,
+   * containing all the microsecond values of all the interval values in 
this vector.
+   */
+  public final CalendarInterval getInterval(int rowId) {
+if (isNullAt(rowId)) return null;
+final int months = getChild(0).getInt(rowId);
+final long microseconds = getChild(1).getLong(rowId);
+return new CalendarInterval(months, microseconds);
+  }
+
+  /**
+   * @return child [[ColumnVector]] at the given ordinal.
*/
-  public abstract ColumnVector getChild(int ordinal);
+  protected abstract ColumnVector getChild(int ordinal);
--- End diff --

added


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164807808
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -236,9 +238,29 @@ public MapData getMap(int ordinal) {
   public abstract byte[] getBinary(int rowId);
 
   /**
-   * Returns the ordinal's child column vector.
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. An interval type vector is the same as a struct type vector 
with 2 fields: `months`
+   * and `microseconds`.
+   *
+   * To support interval type, implementations must implement {@link 
#getChild(int)} and define 2
+   * child vectors: the first child vector is an int type vector, 
containing all the month values of
+   * all the interval values in this vector. The second child vector is a 
long type vector,
+   * containing all the microsecond values of all the interval values in 
this vector.
+   */
+  public final CalendarInterval getInterval(int rowId) {
+if (isNullAt(rowId)) return null;
+final int months = getChild(0).getInt(rowId);
+final long microseconds = getChild(1).getLong(rowId);
+return new CalendarInterval(months, microseconds);
+  }
+
+  /**
+   * @return child [[ColumnVector]] at the given ordinal.
*/
-  public abstract ColumnVector getChild(int ordinal);
+  protected abstract ColumnVector getChild(int ordinal);
--- End diff --

Since `ColumnVector` is public, could you add some description in PR 
description for this kind of visibility change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164806777
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -236,9 +238,29 @@ public MapData getMap(int ordinal) {
   public abstract byte[] getBinary(int rowId);
 
   /**
-   * Returns the ordinal's child column vector.
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. An interval type vector is the same as a struct type vector 
with 2 fields: `months`
+   * and `microseconds`.
+   *
+   * To support interval type, implementations must implement {@link 
#getChild(int)} and define 2
+   * child vectors: the first child vector is an int type vector, 
containing all the month values of
+   * all the interval values in this vector. The second child vector is a 
long type vector,
+   * containing all the microsecond values of all the interval values in 
this vector.
+   */
+  public final CalendarInterval getInterval(int rowId) {
+if (isNullAt(rowId)) return null;
+final int months = getChild(0).getInt(rowId);
+final long microseconds = getChild(1).getLong(rowId);
+return new CalendarInterval(months, microseconds);
+  }
+
+  /**
+   * @return child [[ColumnVector]] at the given ordinal.
*/
-  public abstract ColumnVector getChild(int ordinal);
+  protected abstract ColumnVector getChild(int ordinal);
--- End diff --

Oh, I see. Now, it became `protected`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164802008
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -236,9 +238,29 @@ public MapData getMap(int ordinal) {
   public abstract byte[] getBinary(int rowId);
 
   /**
-   * Returns the ordinal's child column vector.
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. An interval type vector is the same as a struct type vector 
with 2 fields: `months`
+   * and `microseconds`.
+   *
+   * To support interval type, implementations must implement {@link 
#getChild(int)} and define 2
--- End diff --

It's a little annoying to type `calendar interval type` all the time...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164801042
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -235,10 +237,30 @@ public MapData getMap(int ordinal) {
*/
   public abstract byte[] getBinary(int rowId);
 
+  /**
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. A interval type vector is same as a struct type vector with 
2 fields: `months` and
+   * `microseconds`.
+   *
+   * To support interval type, implementations must implement {@link 
#getChild(int)} and define 2
+   * child vectors: the first child vector is a int type vector, 
containing all the month values of
+   * all the interval values in this vector. The second child vector is a 
long type vector,
+   * containing all the microsecond values of all the interval values in 
this vector.
+   */
+  public final CalendarInterval getInterval(int rowId) {
+if (isNullAt(rowId)) return null;
+final int months = getChild(0).getInt(rowId);
--- End diff --

It's from the previous code, probably it tries to make the JVM happy and 
run the code faster.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164769822
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -236,9 +238,29 @@ public MapData getMap(int ordinal) {
   public abstract byte[] getBinary(int rowId);
 
   /**
-   * Returns the ordinal's child column vector.
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. An interval type vector is the same as a struct type vector 
with 2 fields: `months`
+   * and `microseconds`.
+   *
+   * To support interval type, implementations must implement {@link 
#getChild(int)} and define 2
--- End diff --

nit: `interval type` -> `calender interval type`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164730300
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java ---
@@ -135,9 +135,7 @@ public UTF8String getUTF8String(int ordinal) {
 
   @Override
   public CalendarInterval getInterval(int ordinal) {
-int month = data.getChild(0).getInt(offset + ordinal);
-long microseconds = data.getChild(1).getLong(offset + ordinal);
-return new CalendarInterval(month, microseconds);
+return data.getInterval(offset + ordinal);
--- End diff --

Good catch! I'm going to require `ColumnVector.getXXX` to return null if 
the value is null, but I'll do it in another PR, to update all the documents 
and define the behavior of batched `getXXX` methods.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164728359
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java 
---
@@ -87,7 +87,7 @@ public static CalendarInterval fromString(String s) {
 }
   }
 
-  public static long toLongWithRange(String fieldName,
+  private static long toLongWithRange(String fieldName,
--- End diff --

Why?! It's much harder (if at all possible) to test `private` methods (been 
bitten few times this week and remember the pain).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164729684
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -235,10 +237,30 @@ public MapData getMap(int ordinal) {
*/
   public abstract byte[] getBinary(int rowId);
 
+  /**
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. A interval type vector is same as a struct type vector with 
2 fields: `months` and
+   * `microseconds`.
+   *
+   * To support interval type, implementations must implement {@link 
#getChild(int)} and define 2
+   * child vectors: the first child vector is a int type vector, 
containing all the month values of
+   * all the interval values in this vector. The second child vector is a 
long type vector,
+   * containing all the microsecond values of all the interval values in 
this vector.
+   */
+  public final CalendarInterval getInterval(int rowId) {
+if (isNullAt(rowId)) return null;
+final int months = getChild(0).getInt(rowId);
+final long microseconds = getChild(1).getLong(rowId);
+return new CalendarInterval(months, microseconds);
+  }
+
   /**
* Returns the ordinal's child column vector.
--- End diff --

`@return [[ColumnVector]] at the ordinal`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164729086
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -235,10 +237,30 @@ public MapData getMap(int ordinal) {
*/
   public abstract byte[] getBinary(int rowId);
 
+  /**
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. A interval type vector is same as a struct type vector with 
2 fields: `months` and
--- End diff --

"**An** interval" + "is **the** same as"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164729429
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -235,10 +237,30 @@ public MapData getMap(int ordinal) {
*/
   public abstract byte[] getBinary(int rowId);
 
+  /**
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. A interval type vector is same as a struct type vector with 
2 fields: `months` and
+   * `microseconds`.
+   *
+   * To support interval type, implementations must implement {@link 
#getChild(int)} and define 2
+   * child vectors: the first child vector is a int type vector, 
containing all the month values of
+   * all the interval values in this vector. The second child vector is a 
long type vector,
+   * containing all the microsecond values of all the interval values in 
this vector.
+   */
+  public final CalendarInterval getInterval(int rowId) {
+if (isNullAt(rowId)) return null;
+final int months = getChild(0).getInt(rowId);
--- End diff --

What's the purpose of `final` keyword here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164729250
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -235,10 +237,30 @@ public MapData getMap(int ordinal) {
*/
   public abstract byte[] getBinary(int rowId);
 
+  /**
+   * Returns the calendar interval type value for rowId.
+   *
+   * In Spark, calendar interval type value is basically an integer value 
representing the number of
+   * months in this interval, and a long value representing the number of 
microseconds in this
+   * interval. A interval type vector is same as a struct type vector with 
2 fields: `months` and
+   * `microseconds`.
+   *
+   * To support interval type, implementations must implement {@link 
#getChild(int)} and define 2
+   * child vectors: the first child vector is a int type vector, 
containing all the month values of
--- End diff --

is **an** int type vector


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164717772
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -195,6 +196,7 @@
* struct field.
*/
   public final ColumnarRow getStruct(int rowId) {
+if (isNullAt(rowId)) return null;
--- End diff --

Good catch!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20438#discussion_r164719793
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java ---
@@ -135,9 +135,7 @@ public UTF8String getUTF8String(int ordinal) {
 
   @Override
   public CalendarInterval getInterval(int ordinal) {
-int month = data.getChild(0).getInt(offset + ordinal);
-long microseconds = data.getChild(1).getLong(offset + ordinal);
-return new CalendarInterval(month, microseconds);
+return data.getInterval(offset + ordinal);
--- End diff --

We should insert `if (data.isNullAt(offset + ordinal)) return null;` to be 
consistent with other `ColumnarXxx`s?
Or I guess we can remove these null-checks from all other `ColumnarXxx`s 
and leave it to `ColumnVector.getInterval()`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...

2018-01-30 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20438

[SPARK-23272][SQL] add calendar interval type support to ColumnVector

## What changes were proposed in this pull request?

`ColumnVector` is aimed to support all the data types, but 
`CalendarIntervalType` is missing. Actually we do support interval type for 
inner fields, e.g. `ColumnarRow`, `ColumnarArray` both supports interval type, 
it's weird if we don't support interval type at the top level.

This PR adds the interval type support.

## How was this patch tested?

a new test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark interval

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20438.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20438


commit 4e538913901de6cb439e9919e958a60ab698fe24
Author: Wenchen Fan 
Date:   2018-01-30T11:30:34Z

add calendar interval type support to ColumnVector




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org