[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-25 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20395

[SPARK-23218][SQL] simplify ColumnVector.getArray

## What changes were proposed in this pull request?

`ColumnVector` is very flexible about how to implement array type. As a 
result `ColumnVector` has 3 abstract methods for array type: `arrayData`, 
`getArrayOffset`, `getArrayLength`. For example, in `WritableColumnVector` we 
use the first child vector as the array data vector, and store offsets and 
lengths in 2 arrays in the parent vector. `ArrowColumnVector` has a different 
implementation.

This PR simplifies `ColumnVector` by using only one abstract for array 
type: `getArray`.

## How was this patch tested?

existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark vector

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20395


commit f3ca6c73e86928d0a087fbfd36de968ae873bbe3
Author: Wenchen Fan 
Date:   2018-01-25T14:30:57Z

simplify ColumnVector.getArray




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-25 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r163915680
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -450,13 +439,11 @@ final boolean isNullAt(int rowId) {
 }
 
 @Override
-final int getArrayLength(int rowId) {
-  return accessor.getInnerValueCountAt(rowId);
-}
-
-@Override
-final int getArrayOffset(int rowId) {
-  return accessor.getOffsetBuffer().getInt(rowId * 
accessor.OFFSET_WIDTH);
+final ColumnarArray getArray(int rowId) {
+  int index = rowId * accessor.OFFSET_WIDTH;
+  int start = offsets.getInt(index);
+  int end = offsets.getInt(index + accessor.OFFSET_WIDTH);
--- End diff --

Does this code work when `rowId == arrayLength`? In other words, is the 
length of `offsetBuffer` equal to `(rowId + 1) * accessor.OFFSET_WIDTH`?
cc: @ueshin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r163965864
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala
 ---
@@ -479,10 +475,10 @@ object ColumnarBatchBenchmark {
 
 Array Vector Read:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-On Heap Read Size Only 416 /  423393.5 
  2.5   1.0X
-Off Heap Read Size Only396 /  404413.6 
  2.4   1.1X
-On Heap Read Elements 2569 / 2590 63.8 
 15.7   0.2X
-Off Heap Read Elements3302 /  49.6 
 20.2   0.1X
+On Heap Read Size Only 426 /  437384.9 
  2.6   1.0X
+Off Heap Read Size Only406 /  421404.0 
  2.5   1.0X
+On Heap Read Elements 2636 / 2642 62.2 
 16.1   0.2X
+Off Heap Read Elements3770 / 3774 43.5 
 23.0   0.1X
--- End diff --

Thank you for updating this in the PR. It seems to cause a little 
performance degradation.
Could you update `OrcReadBenchmark` and `FilterPushdownBenchmark`, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-25 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164030610
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala
 ---
@@ -479,10 +475,10 @@ object ColumnarBatchBenchmark {
 
 Array Vector Read:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-On Heap Read Size Only 416 /  423393.5 
  2.5   1.0X
-Off Heap Read Size Only396 /  404413.6 
  2.4   1.1X
-On Heap Read Elements 2569 / 2590 63.8 
 15.7   0.2X
-Off Heap Read Elements3302 /  49.6 
 20.2   0.1X
+On Heap Read Size Only 426 /  437384.9 
  2.6   1.0X
+Off Heap Read Size Only406 /  421404.0 
  2.5   1.0X
+On Heap Read Elements 2636 / 2642 62.2 
 16.1   0.2X
+Off Heap Read Elements3770 / 3774 43.5 
 23.0   0.1X
--- End diff --

orc and parquet vectorized reader don't support array type yet.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-25 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164030712
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -450,13 +439,11 @@ final boolean isNullAt(int rowId) {
 }
 
 @Override
-final int getArrayLength(int rowId) {
-  return accessor.getInnerValueCountAt(rowId);
-}
-
-@Override
-final int getArrayOffset(int rowId) {
-  return accessor.getOffsetBuffer().getInt(rowId * 
accessor.OFFSET_WIDTH);
+final ColumnarArray getArray(int rowId) {
+  int index = rowId * accessor.OFFSET_WIDTH;
+  int start = offsets.getInt(index);
+  int end = offsets.getInt(index + accessor.OFFSET_WIDTH);
--- End diff --

I believe it is, by looking at `accessor.getInnerValueCountAt`. cc 
@BryanCutler for confirmation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-25 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164047732
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -433,10 +418,14 @@ final long getLong(int rowId) {
   private static class ArrayAccessor extends ArrowVectorAccessor {
 
 private final ListVector accessor;
+private final ArrowBuf offsets;
+private final ArrowColumnVector arrayData;
 
 ArrayAccessor(ListVector vector) {
   super(vector);
   this.accessor = vector;
+  this.offsets = vector.getOffsetBuffer();
--- End diff --

Maybe `vector.getOffsetBuffer()` will change when loading the next batch so 
we can't reuse the `ArrowBuf`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-25 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164049147
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -433,10 +418,14 @@ final long getLong(int rowId) {
   private static class ArrayAccessor extends ArrowVectorAccessor {
 
 private final ListVector accessor;
+private final ArrowBuf offsets;
+private final ArrowColumnVector arrayData;
 
 ArrayAccessor(ListVector vector) {
   super(vector);
   this.accessor = vector;
+  this.offsets = vector.getOffsetBuffer();
--- End diff --

good catch!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164065557
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -182,57 +187,54 @@
   }
 
   /**
-   * Returns the length of the array for rowId.
-   */
-  public abstract int getArrayLength(int rowId);
-
-  /**
-   * Returns the offset of the array for rowId.
-   */
-  public abstract int getArrayOffset(int rowId);
-
-  /**
-   * Returns the struct for rowId.
+   * Returns the struct type value for rowId.
+   *
+   * To support struct type, implementations must implement {@link 
#getChild(int)} and make this
+   * vector a tree structure. The number of child vectors must be same as 
the number of fields of
+   * the struct type, and each child vector is responsible to store the 
data for its corresponding
+   * struct field.
*/
   public final ColumnarRow getStruct(int rowId) {
 return new ColumnarRow(this, rowId);
   }
 
   /**
-   * Returns the array for rowId.
+   * Returns the array type value for rowId.
+   *
+   * To support array type, implementations must construct an {@link 
ColumnarArray} and return it in
+   * this method. {@link ColumnarArray} requires a {@link ColumnVector} 
that stores the data of all
--- End diff --

Shall we move the details of `ColumnarArray` implementation to 
`ColumnarArray` class?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164066651
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -450,13 +437,12 @@ final boolean isNullAt(int rowId) {
 }
 
 @Override
-final int getArrayLength(int rowId) {
-  return accessor.getInnerValueCountAt(rowId);
-}
-
-@Override
-final int getArrayOffset(int rowId) {
-  return accessor.getOffsetBuffer().getInt(rowId * 
accessor.OFFSET_WIDTH);
+final ColumnarArray getArray(int rowId) {
+  ArrowBuf offsets = accessor.getOffsetBuffer();
+  int index = rowId * accessor.OFFSET_WIDTH;
+  int start = offsets.getInt(index);
+  int end = offsets.getInt(index + accessor.OFFSET_WIDTH);
--- End diff --

At the last rowId, what `offsets.getInt(index + accessor.OFFSET_WIDTH)` 
returns?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164067505
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 ---
@@ -602,7 +603,17 @@ public final int appendStruct(boolean isNull) {
   // `WritableColumnVector` puts the data of array in the first child 
column vector, and puts the
   // array offsets and lengths in the current column vector.
   @Override
-  public WritableColumnVector arrayData() { return childColumns[0]; }
+  public final ColumnarArray getArray(int rowId) {
+return new ColumnarArray(arrayData(), getArrayOffset(rowId), 
getArrayLength(rowId));
+  }
+
+  public WritableColumnVector arrayData() {
+return childColumns[0];
+  }
+
+  public abstract int getArrayLength(int rowId);
+
+  public abstract int getArrayOffset(int rowId);
--- End diff --

Shall we make these two methods private/protected?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164086245
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 ---
@@ -602,7 +603,17 @@ public final int appendStruct(boolean isNull) {
   // `WritableColumnVector` puts the data of array in the first child 
column vector, and puts the
   // array offsets and lengths in the current column vector.
   @Override
-  public WritableColumnVector arrayData() { return childColumns[0]; }
+  public final ColumnarArray getArray(int rowId) {
+return new ColumnarArray(arrayData(), getArrayOffset(rowId), 
getArrayLength(rowId));
+  }
+
+  public WritableColumnVector arrayData() {
+return childColumns[0];
+  }
+
+  public abstract int getArrayLength(int rowId);
+
+  public abstract int getArrayOffset(int rowId);
--- End diff --

`WritableColumnVector` is an internal class, so I want to keep the 
visibility wider for the ease of use.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164086579
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -450,13 +437,12 @@ final boolean isNullAt(int rowId) {
 }
 
 @Override
-final int getArrayLength(int rowId) {
-  return accessor.getInnerValueCountAt(rowId);
-}
-
-@Override
-final int getArrayOffset(int rowId) {
-  return accessor.getOffsetBuffer().getInt(rowId * 
accessor.OFFSET_WIDTH);
+final ColumnarArray getArray(int rowId) {
+  ArrowBuf offsets = accessor.getOffsetBuffer();
+  int index = rowId * accessor.OFFSET_WIDTH;
+  int start = offsets.getInt(index);
+  int end = offsets.getInt(index + accessor.OFFSET_WIDTH);
--- End diff --

the end offset of the last record. see 
https://arrow.apache.org/docs/memory_layout.html#list-type


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164086846
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -182,57 +187,54 @@
   }
 
   /**
-   * Returns the length of the array for rowId.
-   */
-  public abstract int getArrayLength(int rowId);
-
-  /**
-   * Returns the offset of the array for rowId.
-   */
-  public abstract int getArrayOffset(int rowId);
-
-  /**
-   * Returns the struct for rowId.
+   * Returns the struct type value for rowId.
+   *
+   * To support struct type, implementations must implement {@link 
#getChild(int)} and make this
+   * vector a tree structure. The number of child vectors must be same as 
the number of fields of
+   * the struct type, and each child vector is responsible to store the 
data for its corresponding
+   * struct field.
*/
   public final ColumnarRow getStruct(int rowId) {
 return new ColumnarRow(this, rowId);
   }
 
   /**
-   * Returns the array for rowId.
+   * Returns the array type value for rowId.
+   *
+   * To support array type, implementations must construct an {@link 
ColumnarArray} and return it in
+   * this method. {@link ColumnarArray} requires a {@link ColumnVector} 
that stores the data of all
--- End diff --

The comments here are mostly about how to construct the `ColumnarArray`, 
but about `ColumnarArray` itself, so I'd like to keep them here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164088592
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 ---
@@ -602,7 +603,17 @@ public final int appendStruct(boolean isNull) {
   // `WritableColumnVector` puts the data of array in the first child 
column vector, and puts the
   // array offsets and lengths in the current column vector.
   @Override
-  public WritableColumnVector arrayData() { return childColumns[0]; }
+  public final ColumnarArray getArray(int rowId) {
+return new ColumnarArray(arrayData(), getArrayOffset(rowId), 
getArrayLength(rowId));
+  }
+
+  public WritableColumnVector arrayData() {
+return childColumns[0];
+  }
+
+  public abstract int getArrayLength(int rowId);
+
+  public abstract int getArrayOffset(int rowId);
--- End diff --

Ok. Make sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164160869
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala
 ---
@@ -479,10 +475,10 @@ object ColumnarBatchBenchmark {
 
 Array Vector Read:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-On Heap Read Size Only 416 /  423393.5 
  2.5   1.0X
-Off Heap Read Size Only396 /  404413.6 
  2.4   1.1X
-On Heap Read Elements 2569 / 2590 63.8 
 15.7   0.2X
-Off Heap Read Elements3302 /  49.6 
 20.2   0.1X
+On Heap Read Size Only 426 /  437384.9 
  2.6   1.0X
+Off Heap Read Size Only406 /  421404.0 
  2.5   1.0X
+On Heap Read Elements 2636 / 2642 62.2 
 16.1   0.2X
+Off Heap Read Elements3770 / 3774 43.5 
 23.0   0.1X
--- End diff --

Right. I see.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20395


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164249345
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -450,13 +439,11 @@ final boolean isNullAt(int rowId) {
 }
 
 @Override
-final int getArrayLength(int rowId) {
-  return accessor.getInnerValueCountAt(rowId);
-}
-
-@Override
-final int getArrayOffset(int rowId) {
-  return accessor.getOffsetBuffer().getInt(rowId * 
accessor.OFFSET_WIDTH);
+final ColumnarArray getArray(int rowId) {
+  int index = rowId * accessor.OFFSET_WIDTH;
+  int start = offsets.getInt(index);
+  int end = offsets.getInt(index + accessor.OFFSET_WIDTH);
--- End diff --

yes, the code is correct.  `offsetBuffer` will be sized to at least 
`(numRows + 1) * accessor.OFFSET_WIDTH`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org