[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20395 [SPARK-23218][SQL] simplify ColumnVector.getArray ## What changes were proposed in this pull request? `ColumnVector` is very flexible about how to implement array type. As a result `ColumnVector` has 3 abstract methods for array type: `arrayData`, `getArrayOffset`, `getArrayLength`. For example, in `WritableColumnVector` we use the first child vector as the array data vector, and store offsets and lengths in 2 arrays in the parent vector. `ArrowColumnVector` has a different implementation. This PR simplifies `ColumnVector` by using only one abstract for array type: `getArray`. ## How was this patch tested? existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark vector Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20395.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20395 commit f3ca6c73e86928d0a087fbfd36de968ae873bbe3 Author: Wenchen Fan Date: 2018-01-25T14:30:57Z simplify ColumnVector.getArray --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r163915680 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -450,13 +439,11 @@ final boolean isNullAt(int rowId) { } @Override -final int getArrayLength(int rowId) { - return accessor.getInnerValueCountAt(rowId); -} - -@Override -final int getArrayOffset(int rowId) { - return accessor.getOffsetBuffer().getInt(rowId * accessor.OFFSET_WIDTH); +final ColumnarArray getArray(int rowId) { + int index = rowId * accessor.OFFSET_WIDTH; + int start = offsets.getInt(index); + int end = offsets.getInt(index + accessor.OFFSET_WIDTH); --- End diff -- Does this code work when `rowId == arrayLength`? In other words, is the length of `offsetBuffer` equal to `(rowId + 1) * accessor.OFFSET_WIDTH`? cc: @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r163965864 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala --- @@ -479,10 +475,10 @@ object ColumnarBatchBenchmark { Array Vector Read: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -On Heap Read Size Only 416 / 423393.5 2.5 1.0X -Off Heap Read Size Only396 / 404413.6 2.4 1.1X -On Heap Read Elements 2569 / 2590 63.8 15.7 0.2X -Off Heap Read Elements3302 / 49.6 20.2 0.1X +On Heap Read Size Only 426 / 437384.9 2.6 1.0X +Off Heap Read Size Only406 / 421404.0 2.5 1.0X +On Heap Read Elements 2636 / 2642 62.2 16.1 0.2X +Off Heap Read Elements3770 / 3774 43.5 23.0 0.1X --- End diff -- Thank you for updating this in the PR. It seems to cause a little performance degradation. Could you update `OrcReadBenchmark` and `FilterPushdownBenchmark`, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164030610 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala --- @@ -479,10 +475,10 @@ object ColumnarBatchBenchmark { Array Vector Read: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -On Heap Read Size Only 416 / 423393.5 2.5 1.0X -Off Heap Read Size Only396 / 404413.6 2.4 1.1X -On Heap Read Elements 2569 / 2590 63.8 15.7 0.2X -Off Heap Read Elements3302 / 49.6 20.2 0.1X +On Heap Read Size Only 426 / 437384.9 2.6 1.0X +Off Heap Read Size Only406 / 421404.0 2.5 1.0X +On Heap Read Elements 2636 / 2642 62.2 16.1 0.2X +Off Heap Read Elements3770 / 3774 43.5 23.0 0.1X --- End diff -- orc and parquet vectorized reader don't support array type yet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164030712 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -450,13 +439,11 @@ final boolean isNullAt(int rowId) { } @Override -final int getArrayLength(int rowId) { - return accessor.getInnerValueCountAt(rowId); -} - -@Override -final int getArrayOffset(int rowId) { - return accessor.getOffsetBuffer().getInt(rowId * accessor.OFFSET_WIDTH); +final ColumnarArray getArray(int rowId) { + int index = rowId * accessor.OFFSET_WIDTH; + int start = offsets.getInt(index); + int end = offsets.getInt(index + accessor.OFFSET_WIDTH); --- End diff -- I believe it is, by looking at `accessor.getInnerValueCountAt`. cc @BryanCutler for confirmation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164047732 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -433,10 +418,14 @@ final long getLong(int rowId) { private static class ArrayAccessor extends ArrowVectorAccessor { private final ListVector accessor; +private final ArrowBuf offsets; +private final ArrowColumnVector arrayData; ArrayAccessor(ListVector vector) { super(vector); this.accessor = vector; + this.offsets = vector.getOffsetBuffer(); --- End diff -- Maybe `vector.getOffsetBuffer()` will change when loading the next batch so we can't reuse the `ArrowBuf`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164049147 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -433,10 +418,14 @@ final long getLong(int rowId) { private static class ArrayAccessor extends ArrowVectorAccessor { private final ListVector accessor; +private final ArrowBuf offsets; +private final ArrowColumnVector arrayData; ArrayAccessor(ListVector vector) { super(vector); this.accessor = vector; + this.offsets = vector.getOffsetBuffer(); --- End diff -- good catch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164065557 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -182,57 +187,54 @@ } /** - * Returns the length of the array for rowId. - */ - public abstract int getArrayLength(int rowId); - - /** - * Returns the offset of the array for rowId. - */ - public abstract int getArrayOffset(int rowId); - - /** - * Returns the struct for rowId. + * Returns the struct type value for rowId. + * + * To support struct type, implementations must implement {@link #getChild(int)} and make this + * vector a tree structure. The number of child vectors must be same as the number of fields of + * the struct type, and each child vector is responsible to store the data for its corresponding + * struct field. */ public final ColumnarRow getStruct(int rowId) { return new ColumnarRow(this, rowId); } /** - * Returns the array for rowId. + * Returns the array type value for rowId. + * + * To support array type, implementations must construct an {@link ColumnarArray} and return it in + * this method. {@link ColumnarArray} requires a {@link ColumnVector} that stores the data of all --- End diff -- Shall we move the details of `ColumnarArray` implementation to `ColumnarArray` class? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164066651 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -450,13 +437,12 @@ final boolean isNullAt(int rowId) { } @Override -final int getArrayLength(int rowId) { - return accessor.getInnerValueCountAt(rowId); -} - -@Override -final int getArrayOffset(int rowId) { - return accessor.getOffsetBuffer().getInt(rowId * accessor.OFFSET_WIDTH); +final ColumnarArray getArray(int rowId) { + ArrowBuf offsets = accessor.getOffsetBuffer(); + int index = rowId * accessor.OFFSET_WIDTH; + int start = offsets.getInt(index); + int end = offsets.getInt(index + accessor.OFFSET_WIDTH); --- End diff -- At the last rowId, what `offsets.getInt(index + accessor.OFFSET_WIDTH)` returns? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164067505 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -602,7 +603,17 @@ public final int appendStruct(boolean isNull) { // `WritableColumnVector` puts the data of array in the first child column vector, and puts the // array offsets and lengths in the current column vector. @Override - public WritableColumnVector arrayData() { return childColumns[0]; } + public final ColumnarArray getArray(int rowId) { +return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId)); + } + + public WritableColumnVector arrayData() { +return childColumns[0]; + } + + public abstract int getArrayLength(int rowId); + + public abstract int getArrayOffset(int rowId); --- End diff -- Shall we make these two methods private/protected? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164086245 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -602,7 +603,17 @@ public final int appendStruct(boolean isNull) { // `WritableColumnVector` puts the data of array in the first child column vector, and puts the // array offsets and lengths in the current column vector. @Override - public WritableColumnVector arrayData() { return childColumns[0]; } + public final ColumnarArray getArray(int rowId) { +return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId)); + } + + public WritableColumnVector arrayData() { +return childColumns[0]; + } + + public abstract int getArrayLength(int rowId); + + public abstract int getArrayOffset(int rowId); --- End diff -- `WritableColumnVector` is an internal class, so I want to keep the visibility wider for the ease of use. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164086579 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -450,13 +437,12 @@ final boolean isNullAt(int rowId) { } @Override -final int getArrayLength(int rowId) { - return accessor.getInnerValueCountAt(rowId); -} - -@Override -final int getArrayOffset(int rowId) { - return accessor.getOffsetBuffer().getInt(rowId * accessor.OFFSET_WIDTH); +final ColumnarArray getArray(int rowId) { + ArrowBuf offsets = accessor.getOffsetBuffer(); + int index = rowId * accessor.OFFSET_WIDTH; + int start = offsets.getInt(index); + int end = offsets.getInt(index + accessor.OFFSET_WIDTH); --- End diff -- the end offset of the last record. see https://arrow.apache.org/docs/memory_layout.html#list-type --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164086846 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -182,57 +187,54 @@ } /** - * Returns the length of the array for rowId. - */ - public abstract int getArrayLength(int rowId); - - /** - * Returns the offset of the array for rowId. - */ - public abstract int getArrayOffset(int rowId); - - /** - * Returns the struct for rowId. + * Returns the struct type value for rowId. + * + * To support struct type, implementations must implement {@link #getChild(int)} and make this + * vector a tree structure. The number of child vectors must be same as the number of fields of + * the struct type, and each child vector is responsible to store the data for its corresponding + * struct field. */ public final ColumnarRow getStruct(int rowId) { return new ColumnarRow(this, rowId); } /** - * Returns the array for rowId. + * Returns the array type value for rowId. + * + * To support array type, implementations must construct an {@link ColumnarArray} and return it in + * this method. {@link ColumnarArray} requires a {@link ColumnVector} that stores the data of all --- End diff -- The comments here are mostly about how to construct the `ColumnarArray`, but about `ColumnarArray` itself, so I'd like to keep them here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164088592 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -602,7 +603,17 @@ public final int appendStruct(boolean isNull) { // `WritableColumnVector` puts the data of array in the first child column vector, and puts the // array offsets and lengths in the current column vector. @Override - public WritableColumnVector arrayData() { return childColumns[0]; } + public final ColumnarArray getArray(int rowId) { +return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId)); + } + + public WritableColumnVector arrayData() { +return childColumns[0]; + } + + public abstract int getArrayLength(int rowId); + + public abstract int getArrayOffset(int rowId); --- End diff -- Ok. Make sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164160869 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala --- @@ -479,10 +475,10 @@ object ColumnarBatchBenchmark { Array Vector Read: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -On Heap Read Size Only 416 / 423393.5 2.5 1.0X -Off Heap Read Size Only396 / 404413.6 2.4 1.1X -On Heap Read Elements 2569 / 2590 63.8 15.7 0.2X -Off Heap Read Elements3302 / 49.6 20.2 0.1X +On Heap Read Size Only 426 / 437384.9 2.6 1.0X +Off Heap Read Size Only406 / 421404.0 2.5 1.0X +On Heap Read Elements 2636 / 2642 62.2 16.1 0.2X +Off Heap Read Elements3770 / 3774 43.5 23.0 0.1X --- End diff -- Right. I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20395 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20395#discussion_r164249345 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -450,13 +439,11 @@ final boolean isNullAt(int rowId) { } @Override -final int getArrayLength(int rowId) { - return accessor.getInnerValueCountAt(rowId); -} - -@Override -final int getArrayOffset(int rowId) { - return accessor.getOffsetBuffer().getInt(rowId * accessor.OFFSET_WIDTH); +final ColumnarArray getArray(int rowId) { + int index = rowId * accessor.OFFSET_WIDTH; + int start = offsets.getInt(index); + int end = offsets.getInt(index + accessor.OFFSET_WIDTH); --- End diff -- yes, the code is correct. `offsetBuffer` will be sized to at least `(numRows + 1) * accessor.OFFSET_WIDTH` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org