[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20116


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159472368
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
 ---
@@ -248,7 +248,10 @@ public void enableReturningBatches() {
* Advances to the next batch of rows. Returns false if there are no 
more.
*/
   public boolean nextBatch() throws IOException {
-columnarBatch.reset();
+for (WritableColumnVector vector : columnVectors) {
--- End diff --

This is the standard java foreach code  style


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159469632
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java ---
@@ -14,26 +14,18 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.vectorized;
 
 import java.util.*;
 
 import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.execution.vectorized.MutableColumnarRow;
 import org.apache.spark.sql.types.StructType;
 
 /**
- * This class is the in memory representation of rows as they are streamed 
through operators. It
- * is designed to maximize CPU efficiency and not storage footprint. Since 
it is expected that
- * each operator allocates one of these objects, the storage footprint on 
the task is negligible.
- *
- * The layout is a columnar with values encoded in their native format. 
Each RowBatch contains
- * a horizontal partitioning of the data, split into columns.
- *
- * The ColumnarBatch supports either on heap or offheap modes with 
(mostly) the identical API.
- *
- * TODO:
- *  - There are many TODOs for the existing APIs. They should throw a not 
implemented exception.
- *  - Compaction: The batch and columns should be able to compact based on 
a selection vector.
+ * This class is a wrapper of multiple ColumnVectors and represents a 
logical table-like data
--- End diff --

How about?
> This class wraps multiple ColumnVectors as a row-wise table


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159469970
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java ---
@@ -87,19 +79,7 @@ public void remove() {
   }
 
   /**
-   * Resets the batch for writing.
-   */
-  public void reset() {
-for (int i = 0; i < numCols(); ++i) {
-  if (columns[i] instanceof WritableColumnVector) {
-((WritableColumnVector) columns[i]).reset();
-  }
-}
-this.numRows = 0;
-  }
-
-  /**
-   * Sets the number of rows that are valid.
+   * Sets the number of rows that are valid in this batch.
--- End diff --

How about?
> Sets the number of rows


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159470181
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -14,32 +14,39 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.vectorized;
 
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
 import org.apache.spark.unsafe.types.UTF8String;
 
 /**
- * This class represents in-memory values of a column and provides the 
main APIs to access the data.
- * It supports all the types and contains get APIs as well as their 
batched versions. The batched
- * versions are considered to be faster and preferable whenever possible.
+ * An interface representing in-memory columnar data in Spark. This 
interface defines the main APIs
+ * to access the data, as well as their batched versions. The batched 
versions are considered to be
+ * faster and preferable whenever possible.
  *
- * To handle nested schemas, ColumnVector has two types: Arrays and 
Structs. In both cases these
- * columns have child columns. All of the data are stored in the child 
columns and the parent column
- * only contains nullability. In the case of Arrays, the lengths and 
offsets are saved in the child
- * column and are encoded identically to INTs.
+ * Most of the APIs take the rowId as a parameter. This is the batch local 
0-based row id for values
+ * in this ColumnVector.
  *
- * Maps are just a special case of a two field struct.
+ * ColumnVector supports all the data types including nested types. To 
handle nested types,
+ * ColumnVector can have children and is a tree structure. For struct 
type, it stores the actual
+ * data of each field in the corresponding child ColumnVector, and only 
store null information in
+ * the parent ColumnVector. For array type, it stores the actual array 
elements in the child
+ * ColumnVector, and store null information, array offsets and lengths in 
the parent ColumnVector.
  *
- * Most of the APIs take the rowId as a parameter. This is the batch local 
0-based row id for values
- * in the current batch.
+ * ColumnVector is expected to be reused during the entire data loading 
process, to avoid allocating
+ * memory again and again.
+ *
+ * ColumnVector is meant to maximize CPU efficiency but not to minimize 
storage footprint,
+ * implementations should prefer computing efficiency over storage 
efficiency when design the
--- End diff --

`implementations ` -> `Implementations `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159465600
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
 ---
@@ -248,7 +248,10 @@ public void enableReturningBatches() {
* Advances to the next batch of rows. Returns false if there are no 
more.
*/
   public boolean nextBatch() throws IOException {
-columnarBatch.reset();
+for (WritableColumnVector vector : columnVectors) {
--- End diff --

Remove the space before `:`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159470325
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -14,32 +14,39 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.vectorized;
 
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
 import org.apache.spark.unsafe.types.UTF8String;
 
 /**
- * This class represents in-memory values of a column and provides the 
main APIs to access the data.
- * It supports all the types and contains get APIs as well as their 
batched versions. The batched
- * versions are considered to be faster and preferable whenever possible.
+ * An interface representing in-memory columnar data in Spark. This 
interface defines the main APIs
+ * to access the data, as well as their batched versions. The batched 
versions are considered to be
+ * faster and preferable whenever possible.
  *
- * To handle nested schemas, ColumnVector has two types: Arrays and 
Structs. In both cases these
- * columns have child columns. All of the data are stored in the child 
columns and the parent column
- * only contains nullability. In the case of Arrays, the lengths and 
offsets are saved in the child
- * column and are encoded identically to INTs.
+ * Most of the APIs take the rowId as a parameter. This is the batch local 
0-based row id for values
+ * in this ColumnVector.
  *
- * Maps are just a special case of a two field struct.
+ * ColumnVector supports all the data types including nested types. To 
handle nested types,
+ * ColumnVector can have children and is a tree structure. For struct 
type, it stores the actual
+ * data of each field in the corresponding child ColumnVector, and only 
store null information in
--- End diff --

`store ` -> `stores`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159470383
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -14,32 +14,39 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.vectorized;
 
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
 import org.apache.spark.unsafe.types.UTF8String;
 
 /**
- * This class represents in-memory values of a column and provides the 
main APIs to access the data.
- * It supports all the types and contains get APIs as well as their 
batched versions. The batched
- * versions are considered to be faster and preferable whenever possible.
+ * An interface representing in-memory columnar data in Spark. This 
interface defines the main APIs
+ * to access the data, as well as their batched versions. The batched 
versions are considered to be
+ * faster and preferable whenever possible.
  *
- * To handle nested schemas, ColumnVector has two types: Arrays and 
Structs. In both cases these
- * columns have child columns. All of the data are stored in the child 
columns and the parent column
- * only contains nullability. In the case of Arrays, the lengths and 
offsets are saved in the child
- * column and are encoded identically to INTs.
+ * Most of the APIs take the rowId as a parameter. This is the batch local 
0-based row id for values
+ * in this ColumnVector.
  *
- * Maps are just a special case of a two field struct.
+ * ColumnVector supports all the data types including nested types. To 
handle nested types,
+ * ColumnVector can have children and is a tree structure. For struct 
type, it stores the actual
+ * data of each field in the corresponding child ColumnVector, and only 
store null information in
+ * the parent ColumnVector. For array type, it stores the actual array 
elements in the child
+ * ColumnVector, and store null information, array offsets and lengths in 
the parent ColumnVector.
--- End diff --

`store ` -> `stores`
  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159467403
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java ---
@@ -14,26 +14,18 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.vectorized;
 
 import java.util.*;
 
 import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.execution.vectorized.MutableColumnarRow;
 import org.apache.spark.sql.types.StructType;
 
 /**
- * This class is the in memory representation of rows as they are streamed 
through operators. It
- * is designed to maximize CPU efficiency and not storage footprint. Since 
it is expected that
- * each operator allocates one of these objects, the storage footprint on 
the task is negligible.
- *
- * The layout is a columnar with values encoded in their native format. 
Each RowBatch contains
- * a horizontal partitioning of the data, split into columns.
- *
- * The ColumnarBatch supports either on heap or offheap modes with 
(mostly) the identical API.
- *
- * TODO:
- *  - There are many TODOs for the existing APIs. They should throw a not 
implemented exception.
- *  - Compaction: The batch and columns should be able to compact based on 
a selection vector.
+ * This class is a wrapper of multiple ColumnVectors and represents a 
logical table-like data
+ * structure. It provides a row-view of this batch so that Spark can 
access the data row by row.
--- End diff --

`row-view` -> `row view`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-03 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159392782
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -14,32 +14,38 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.vectorized;
 
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
 import org.apache.spark.unsafe.types.UTF8String;
 
 /**
- * This class represents in-memory values of a column and provides the 
main APIs to access the data.
- * It supports all the types and contains get APIs as well as their 
batched versions. The batched
- * versions are considered to be faster and preferable whenever possible.
+ * An interface representing in-memory columnar data in Spark. This 
interface defines the main APIs
+ * to access the data, as well as their batched versions. The batched 
versions are considered to be
+ * faster and preferable whenever possible.
  *
- * To handle nested schemas, ColumnVector has two types: Arrays and 
Structs. In both cases these
- * columns have child columns. All of the data are stored in the child 
columns and the parent column
- * only contains nullability. In the case of Arrays, the lengths and 
offsets are saved in the child
- * column and are encoded identically to INTs.
+ * Most of the APIs take the rowId as a parameter. This is the batch local 
0-based row id for values
+ * in this ColumnVector.
  *
- * Maps are just a special case of a two field struct.
+ * ColumnVector supports all the data types including nested types. To 
handle nested types,
+ * ColumnVector can have children and is a tree structure. For struct 
type, it stores the actual
--- End diff --

Sorry for my mistake.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159363145
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -14,32 +14,38 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.vectorized;
 
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
 import org.apache.spark.unsafe.types.UTF8String;
 
 /**
- * This class represents in-memory values of a column and provides the 
main APIs to access the data.
- * It supports all the types and contains get APIs as well as their 
batched versions. The batched
- * versions are considered to be faster and preferable whenever possible.
+ * An interface representing in-memory columnar data in Spark. This 
interface defines the main APIs
+ * to access the data, as well as their batched versions. The batched 
versions are considered to be
+ * faster and preferable whenever possible.
  *
- * To handle nested schemas, ColumnVector has two types: Arrays and 
Structs. In both cases these
- * columns have child columns. All of the data are stored in the child 
columns and the parent column
- * only contains nullability. In the case of Arrays, the lengths and 
offsets are saved in the child
- * column and are encoded identically to INTs.
+ * Most of the APIs take the rowId as a parameter. This is the batch local 
0-based row id for values
+ * in this ColumnVector.
  *
- * Maps are just a special case of a two field struct.
+ * ColumnVector supports all the data types including nested types. To 
handle nested types,
+ * ColumnVector can have children and is a tree structure. For struct 
type, it stores the actual
--- End diff --

it's already `children`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-02 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159260224
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 ---
@@ -586,7 +587,7 @@ public final int appendStruct(boolean isNull) {
 if (isNull) {
   appendNull();
   for (ColumnVector c: childColumns) {
-if (c.type instanceof StructType) {
+if (c.dataType() instanceof StructType) {
--- End diff --

nit: Which access type will we use for `ColumnVector.type`? `dataType()` or 
`type`?  
For example, `OnHeapColumnVector.reserveInternal()` uses `type` while this 
line uses `dataType()`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-02 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159248950
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -14,32 +14,38 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.vectorized;
 
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
 import org.apache.spark.unsafe.types.UTF8String;
 
 /**
- * This class represents in-memory values of a column and provides the 
main APIs to access the data.
- * It supports all the types and contains get APIs as well as their 
batched versions. The batched
- * versions are considered to be faster and preferable whenever possible.
+ * An interface representing in-memory columnar data in Spark. This 
interface defines the main APIs
+ * to access the data, as well as their batched versions. The batched 
versions are considered to be
+ * faster and preferable whenever possible.
  *
- * To handle nested schemas, ColumnVector has two types: Arrays and 
Structs. In both cases these
- * columns have child columns. All of the data are stored in the child 
columns and the parent column
- * only contains nullability. In the case of Arrays, the lengths and 
offsets are saved in the child
- * column and are encoded identically to INTs.
+ * Most of the APIs take the rowId as a parameter. This is the batch local 
0-based row id for values
+ * in this ColumnVector.
  *
- * Maps are just a special case of a two field struct.
+ * ColumnVector supports all the data types including nested types. To 
handle nested types,
+ * ColumnVector can have children and is a tree structure. For struct 
type, it stores the actual
+ * data of each field in the corresponding child ColumnVector, and only 
store null information in
+ * the parent ColumnVector. For array type, it stores the actual array 
elements in the child
+ * ColumnVector, and store null information, array offsets and lengths in 
the parent ColumnVector.
  *
- * Most of the APIs take the rowId as a parameter. This is the batch local 
0-based row id for values
- * in the current batch.
+ * ColumnVector is expected to be reused during the entire data loading 
process, to avoid allocating
+ * memory again and again.
+ *
+ * ColumnVector is meant to maximize CPU efficiency and not storage 
footprint, implementations
--- End diff --

nit: not -> not to minimize


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-02 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159248670
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -14,32 +14,38 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.vectorized;
 
 import org.apache.spark.sql.catalyst.util.MapData;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.Decimal;
 import org.apache.spark.unsafe.types.UTF8String;
 
 /**
- * This class represents in-memory values of a column and provides the 
main APIs to access the data.
- * It supports all the types and contains get APIs as well as their 
batched versions. The batched
- * versions are considered to be faster and preferable whenever possible.
+ * An interface representing in-memory columnar data in Spark. This 
interface defines the main APIs
+ * to access the data, as well as their batched versions. The batched 
versions are considered to be
+ * faster and preferable whenever possible.
  *
- * To handle nested schemas, ColumnVector has two types: Arrays and 
Structs. In both cases these
- * columns have child columns. All of the data are stored in the child 
columns and the parent column
- * only contains nullability. In the case of Arrays, the lengths and 
offsets are saved in the child
- * column and are encoded identically to INTs.
+ * Most of the APIs take the rowId as a parameter. This is the batch local 
0-based row id for values
+ * in this ColumnVector.
  *
- * Maps are just a special case of a two field struct.
+ * ColumnVector supports all the data types including nested types. To 
handle nested types,
+ * ColumnVector can have children and is a tree structure. For struct 
type, it stores the actual
--- End diff --

nit: child -> children


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2018-01-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159227467
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java
 ---
@@ -87,19 +79,7 @@ public void remove() {
   }
 
   /**
-   * Resets the batch for writing.
-   */
-  public void reset() {
-for (int i = 0; i < numCols(); ++i) {
-  if (columns[i] instanceof WritableColumnVector) {
-((WritableColumnVector) columns[i]).reset();
-  }
-}
-this.numRows = 0;
--- End diff --

This doesn't matter. The `numRows` is only used when calling 
`rowsInterator`, and we always call `setNumRows` before calling `rowsIterator`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2017-12-30 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159119348
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnVector.java
 ---
@@ -14,32 +14,38 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.sources.v2.vectorized;
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2017-12-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159112727
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java
 ---
@@ -87,19 +79,7 @@ public void remove() {
   }
 
   /**
-   * Resets the batch for writing.
-   */
-  public void reset() {
-for (int i = 0; i < numCols(); ++i) {
-  if (columns[i] instanceof WritableColumnVector) {
-((WritableColumnVector) columns[i]).reset();
-  }
-}
-this.numRows = 0;
--- End diff --

This can result an incorrect `numRows`. May it be a potential error? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2017-12-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159112274
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnVector.java
 ---
@@ -14,32 +14,38 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.sources.v2.vectorized;
--- End diff --

Is v2 package proper for them? Are ColumnVector and related classes belong 
to data source v2 API?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2017-12-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159112087
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java
 ---
@@ -14,26 +14,18 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.sources.v2.vectorized;
 
 import java.util.*;
 
 import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.execution.vectorized.MutableColumnarRow;
 import org.apache.spark.sql.types.StructType;
 
 /**
- * This class is the in memory representation of rows as they are streamed 
through operators. It
- * is designed to maximize CPU efficiency and not storage footprint. Since 
it is expected that
- * each operator allocates one of these objects, the storage footprint on 
the task is negligible.
- *
- * The layout is a columnar with values encoded in their native format. 
Each RowBatch contains
- * a horizontal partitioning of the data, split into columns.
- *
- * The ColumnarBatch supports either on heap or offheap modes with 
(mostly) the identical API.
- *
- * TODO:
- *  - There are many TODOs for the existing APIs. They should throw a not 
implemented exception.
- *  - Compaction: The batch and columns should be able to compact based on 
a selection vector.
+ * This class is a wrapper of multiple ColumnVectors and represents a 
table. It provides a row-view
--- End diff --

a table -> a portion of a table?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2017-12-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159081495
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java
 ---
@@ -87,19 +79,7 @@ public void remove() {
   }
 
   /**
-   * Resets the batch for writing.
-   */
-  public void reset() {
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2017-12-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20116#discussion_r159079805
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java
 ---
@@ -87,19 +79,7 @@ public void remove() {
   }
 
   /**
-   * Resets the batch for writing.
-   */
-  public void reset() {
--- End diff --

remove this as it's for `WritableColumnVector` only


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public

2017-12-29 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20116

[SPARK-20960][SQL] make ColumnVector public

## What changes were proposed in this pull request?

move `ColumnVector` and related classes to 
`org.apache.spark.sql.sources.v2.vectorized`, and improve the document.

## How was this patch tested?

existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark column-vector

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20116.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20116


commit 7dc4496a00ae97c2f686403092883184a65e6bfb
Author: Wenchen Fan 
Date:   2017-12-29T07:27:25Z

make ColumnVector public




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org