[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164971509 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java --- @@ -146,9 +146,7 @@ public UTF8String getUTF8String(int ordinal) { @Override public CalendarInterval getInterval(int ordinal) { if (columns[ordinal].isNullAt(rowId)) return null; --- End diff -- Let me try to prepare a PR tonight. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20438 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164944493 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java --- @@ -146,9 +146,7 @@ public UTF8String getUTF8String(int ordinal) { @Override public CalendarInterval getInterval(int ordinal) { if (columns[ordinal].isNullAt(rowId)) return null; --- End diff -- see https://github.com/apache/spark/pull/20438#discussion_r164719793 . In this PR I just fixed the returning null issue for `getStruct` and `getInterval`, because they are non-abstract. There should be a follow up to clearly document that `ColumnVector.getBinary/getUTF8String/...` should return null if this slot is null. Then we can remove these null checks here. I appreciate it if you have time to take this :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164943981 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java --- @@ -139,9 +139,7 @@ public UTF8String getUTF8String(int ordinal) { @Override public CalendarInterval getInterval(int ordinal) { if (data.getChild(ordinal).isNullAt(rowId)) return null; --- End diff -- Can we remove this null check now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164943754 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnarRow.java --- @@ -146,9 +146,7 @@ public UTF8String getUTF8String(int ordinal) { @Override public CalendarInterval getInterval(int ordinal) { if (columns[ordinal].isNullAt(rowId)) return null; --- End diff -- Do we still need this null check? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164942783 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -236,9 +238,29 @@ public MapData getMap(int ordinal) { public abstract byte[] getBinary(int rowId); /** - * Returns the ordinal's child column vector. + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. An interval type vector is the same as a struct type vector with 2 fields: `months` + * and `microseconds`. + * + * To support interval type, implementations must implement {@link #getChild(int)} and define 2 + * child vectors: the first child vector is an int type vector, containing all the month values of + * all the interval values in this vector. The second child vector is a long type vector, + * containing all the microsecond values of all the interval values in this vector. + */ + public final CalendarInterval getInterval(int rowId) { +if (isNullAt(rowId)) return null; +final int months = getChild(0).getInt(rowId); +final long microseconds = getChild(1).getLong(rowId); +return new CalendarInterval(months, microseconds); + } + + /** + * @return child [[ColumnVector]] at the given ordinal. */ - public abstract ColumnVector getChild(int ordinal); + protected abstract ColumnVector getChild(int ordinal); --- End diff -- added --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164807808 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -236,9 +238,29 @@ public MapData getMap(int ordinal) { public abstract byte[] getBinary(int rowId); /** - * Returns the ordinal's child column vector. + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. An interval type vector is the same as a struct type vector with 2 fields: `months` + * and `microseconds`. + * + * To support interval type, implementations must implement {@link #getChild(int)} and define 2 + * child vectors: the first child vector is an int type vector, containing all the month values of + * all the interval values in this vector. The second child vector is a long type vector, + * containing all the microsecond values of all the interval values in this vector. + */ + public final CalendarInterval getInterval(int rowId) { +if (isNullAt(rowId)) return null; +final int months = getChild(0).getInt(rowId); +final long microseconds = getChild(1).getLong(rowId); +return new CalendarInterval(months, microseconds); + } + + /** + * @return child [[ColumnVector]] at the given ordinal. */ - public abstract ColumnVector getChild(int ordinal); + protected abstract ColumnVector getChild(int ordinal); --- End diff -- Since `ColumnVector` is public, could you add some description in PR description for this kind of visibility change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164806777 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -236,9 +238,29 @@ public MapData getMap(int ordinal) { public abstract byte[] getBinary(int rowId); /** - * Returns the ordinal's child column vector. + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. An interval type vector is the same as a struct type vector with 2 fields: `months` + * and `microseconds`. + * + * To support interval type, implementations must implement {@link #getChild(int)} and define 2 + * child vectors: the first child vector is an int type vector, containing all the month values of + * all the interval values in this vector. The second child vector is a long type vector, + * containing all the microsecond values of all the interval values in this vector. + */ + public final CalendarInterval getInterval(int rowId) { +if (isNullAt(rowId)) return null; +final int months = getChild(0).getInt(rowId); +final long microseconds = getChild(1).getLong(rowId); +return new CalendarInterval(months, microseconds); + } + + /** + * @return child [[ColumnVector]] at the given ordinal. */ - public abstract ColumnVector getChild(int ordinal); + protected abstract ColumnVector getChild(int ordinal); --- End diff -- Oh, I see. Now, it became `protected`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164802008 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -236,9 +238,29 @@ public MapData getMap(int ordinal) { public abstract byte[] getBinary(int rowId); /** - * Returns the ordinal's child column vector. + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. An interval type vector is the same as a struct type vector with 2 fields: `months` + * and `microseconds`. + * + * To support interval type, implementations must implement {@link #getChild(int)} and define 2 --- End diff -- It's a little annoying to type `calendar interval type` all the time... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164801042 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -235,10 +237,30 @@ public MapData getMap(int ordinal) { */ public abstract byte[] getBinary(int rowId); + /** + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. A interval type vector is same as a struct type vector with 2 fields: `months` and + * `microseconds`. + * + * To support interval type, implementations must implement {@link #getChild(int)} and define 2 + * child vectors: the first child vector is a int type vector, containing all the month values of + * all the interval values in this vector. The second child vector is a long type vector, + * containing all the microsecond values of all the interval values in this vector. + */ + public final CalendarInterval getInterval(int rowId) { +if (isNullAt(rowId)) return null; +final int months = getChild(0).getInt(rowId); --- End diff -- It's from the previous code, probably it tries to make the JVM happy and run the code faster. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164769822 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -236,9 +238,29 @@ public MapData getMap(int ordinal) { public abstract byte[] getBinary(int rowId); /** - * Returns the ordinal's child column vector. + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. An interval type vector is the same as a struct type vector with 2 fields: `months` + * and `microseconds`. + * + * To support interval type, implementations must implement {@link #getChild(int)} and define 2 --- End diff -- nit: `interval type` -> `calender interval type` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164730300 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java --- @@ -135,9 +135,7 @@ public UTF8String getUTF8String(int ordinal) { @Override public CalendarInterval getInterval(int ordinal) { -int month = data.getChild(0).getInt(offset + ordinal); -long microseconds = data.getChild(1).getLong(offset + ordinal); -return new CalendarInterval(month, microseconds); +return data.getInterval(offset + ordinal); --- End diff -- Good catch! I'm going to require `ColumnVector.getXXX` to return null if the value is null, but I'll do it in another PR, to update all the documents and define the behavior of batched `getXXX` methods. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164728359 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java --- @@ -87,7 +87,7 @@ public static CalendarInterval fromString(String s) { } } - public static long toLongWithRange(String fieldName, + private static long toLongWithRange(String fieldName, --- End diff -- Why?! It's much harder (if at all possible) to test `private` methods (been bitten few times this week and remember the pain). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164729684 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -235,10 +237,30 @@ public MapData getMap(int ordinal) { */ public abstract byte[] getBinary(int rowId); + /** + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. A interval type vector is same as a struct type vector with 2 fields: `months` and + * `microseconds`. + * + * To support interval type, implementations must implement {@link #getChild(int)} and define 2 + * child vectors: the first child vector is a int type vector, containing all the month values of + * all the interval values in this vector. The second child vector is a long type vector, + * containing all the microsecond values of all the interval values in this vector. + */ + public final CalendarInterval getInterval(int rowId) { +if (isNullAt(rowId)) return null; +final int months = getChild(0).getInt(rowId); +final long microseconds = getChild(1).getLong(rowId); +return new CalendarInterval(months, microseconds); + } + /** * Returns the ordinal's child column vector. --- End diff -- `@return [[ColumnVector]] at the ordinal` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164729086 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -235,10 +237,30 @@ public MapData getMap(int ordinal) { */ public abstract byte[] getBinary(int rowId); + /** + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. A interval type vector is same as a struct type vector with 2 fields: `months` and --- End diff -- "**An** interval" + "is **the** same as" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164729429 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -235,10 +237,30 @@ public MapData getMap(int ordinal) { */ public abstract byte[] getBinary(int rowId); + /** + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. A interval type vector is same as a struct type vector with 2 fields: `months` and + * `microseconds`. + * + * To support interval type, implementations must implement {@link #getChild(int)} and define 2 + * child vectors: the first child vector is a int type vector, containing all the month values of + * all the interval values in this vector. The second child vector is a long type vector, + * containing all the microsecond values of all the interval values in this vector. + */ + public final CalendarInterval getInterval(int rowId) { +if (isNullAt(rowId)) return null; +final int months = getChild(0).getInt(rowId); --- End diff -- What's the purpose of `final` keyword here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164729250 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -235,10 +237,30 @@ public MapData getMap(int ordinal) { */ public abstract byte[] getBinary(int rowId); + /** + * Returns the calendar interval type value for rowId. + * + * In Spark, calendar interval type value is basically an integer value representing the number of + * months in this interval, and a long value representing the number of microseconds in this + * interval. A interval type vector is same as a struct type vector with 2 fields: `months` and + * `microseconds`. + * + * To support interval type, implementations must implement {@link #getChild(int)} and define 2 + * child vectors: the first child vector is a int type vector, containing all the month values of --- End diff -- is **an** int type vector --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164717772 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -195,6 +196,7 @@ * struct field. */ public final ColumnarRow getStruct(int rowId) { +if (isNullAt(rowId)) return null; --- End diff -- Good catch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20438#discussion_r164719793 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java --- @@ -135,9 +135,7 @@ public UTF8String getUTF8String(int ordinal) { @Override public CalendarInterval getInterval(int ordinal) { -int month = data.getChild(0).getInt(offset + ordinal); -long microseconds = data.getChild(1).getLong(offset + ordinal); -return new CalendarInterval(month, microseconds); +return data.getInterval(offset + ordinal); --- End diff -- We should insert `if (data.isNullAt(offset + ordinal)) return null;` to be consistent with other `ColumnarXxx`s? Or I guess we can remove these null-checks from all other `ColumnarXxx`s and leave it to `ColumnVector.getInterval()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20438: [SPARK-23272][SQL] add calendar interval type sup...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20438 [SPARK-23272][SQL] add calendar interval type support to ColumnVector ## What changes were proposed in this pull request? `ColumnVector` is aimed to support all the data types, but `CalendarIntervalType` is missing. Actually we do support interval type for inner fields, e.g. `ColumnarRow`, `ColumnarArray` both supports interval type, it's weird if we don't support interval type at the top level. This PR adds the interval type support. ## How was this patch tested? a new test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark interval Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20438.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20438 commit 4e538913901de6cb439e9919e958a60ab698fe24 Author: Wenchen FanDate: 2018-01-30T11:30:34Z add calendar interval type support to ColumnVector --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org