[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20116 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159472368 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java --- @@ -248,7 +248,10 @@ public void enableReturningBatches() { * Advances to the next batch of rows. Returns false if there are no more. */ public boolean nextBatch() throws IOException { -columnarBatch.reset(); +for (WritableColumnVector vector : columnVectors) { --- End diff -- This is the standard java foreach code style --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159469632 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java --- @@ -14,26 +14,18 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.vectorized; import java.util.*; import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.execution.vectorized.MutableColumnarRow; import org.apache.spark.sql.types.StructType; /** - * This class is the in memory representation of rows as they are streamed through operators. It - * is designed to maximize CPU efficiency and not storage footprint. Since it is expected that - * each operator allocates one of these objects, the storage footprint on the task is negligible. - * - * The layout is a columnar with values encoded in their native format. Each RowBatch contains - * a horizontal partitioning of the data, split into columns. - * - * The ColumnarBatch supports either on heap or offheap modes with (mostly) the identical API. - * - * TODO: - * - There are many TODOs for the existing APIs. They should throw a not implemented exception. - * - Compaction: The batch and columns should be able to compact based on a selection vector. + * This class is a wrapper of multiple ColumnVectors and represents a logical table-like data --- End diff -- How about? > This class wraps multiple ColumnVectors as a row-wise table --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159469970 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java --- @@ -87,19 +79,7 @@ public void remove() { } /** - * Resets the batch for writing. - */ - public void reset() { -for (int i = 0; i < numCols(); ++i) { - if (columns[i] instanceof WritableColumnVector) { -((WritableColumnVector) columns[i]).reset(); - } -} -this.numRows = 0; - } - - /** - * Sets the number of rows that are valid. + * Sets the number of rows that are valid in this batch. --- End diff -- How about? > Sets the number of rows --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159470181 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -14,32 +14,39 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.vectorized; import org.apache.spark.sql.catalyst.util.MapData; import org.apache.spark.sql.types.DataType; import org.apache.spark.sql.types.Decimal; import org.apache.spark.unsafe.types.UTF8String; /** - * This class represents in-memory values of a column and provides the main APIs to access the data. - * It supports all the types and contains get APIs as well as their batched versions. The batched - * versions are considered to be faster and preferable whenever possible. + * An interface representing in-memory columnar data in Spark. This interface defines the main APIs + * to access the data, as well as their batched versions. The batched versions are considered to be + * faster and preferable whenever possible. * - * To handle nested schemas, ColumnVector has two types: Arrays and Structs. In both cases these - * columns have child columns. All of the data are stored in the child columns and the parent column - * only contains nullability. In the case of Arrays, the lengths and offsets are saved in the child - * column and are encoded identically to INTs. + * Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values + * in this ColumnVector. * - * Maps are just a special case of a two field struct. + * ColumnVector supports all the data types including nested types. To handle nested types, + * ColumnVector can have children and is a tree structure. For struct type, it stores the actual + * data of each field in the corresponding child ColumnVector, and only store null information in + * the parent ColumnVector. For array type, it stores the actual array elements in the child + * ColumnVector, and store null information, array offsets and lengths in the parent ColumnVector. * - * Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values - * in the current batch. + * ColumnVector is expected to be reused during the entire data loading process, to avoid allocating + * memory again and again. + * + * ColumnVector is meant to maximize CPU efficiency but not to minimize storage footprint, + * implementations should prefer computing efficiency over storage efficiency when design the --- End diff -- `implementations ` -> `Implementations ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159465600 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java --- @@ -248,7 +248,10 @@ public void enableReturningBatches() { * Advances to the next batch of rows. Returns false if there are no more. */ public boolean nextBatch() throws IOException { -columnarBatch.reset(); +for (WritableColumnVector vector : columnVectors) { --- End diff -- Remove the space before `:` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159470325 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -14,32 +14,39 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.vectorized; import org.apache.spark.sql.catalyst.util.MapData; import org.apache.spark.sql.types.DataType; import org.apache.spark.sql.types.Decimal; import org.apache.spark.unsafe.types.UTF8String; /** - * This class represents in-memory values of a column and provides the main APIs to access the data. - * It supports all the types and contains get APIs as well as their batched versions. The batched - * versions are considered to be faster and preferable whenever possible. + * An interface representing in-memory columnar data in Spark. This interface defines the main APIs + * to access the data, as well as their batched versions. The batched versions are considered to be + * faster and preferable whenever possible. * - * To handle nested schemas, ColumnVector has two types: Arrays and Structs. In both cases these - * columns have child columns. All of the data are stored in the child columns and the parent column - * only contains nullability. In the case of Arrays, the lengths and offsets are saved in the child - * column and are encoded identically to INTs. + * Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values + * in this ColumnVector. * - * Maps are just a special case of a two field struct. + * ColumnVector supports all the data types including nested types. To handle nested types, + * ColumnVector can have children and is a tree structure. For struct type, it stores the actual + * data of each field in the corresponding child ColumnVector, and only store null information in --- End diff -- `store ` -> `stores` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159470383 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -14,32 +14,39 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.vectorized; import org.apache.spark.sql.catalyst.util.MapData; import org.apache.spark.sql.types.DataType; import org.apache.spark.sql.types.Decimal; import org.apache.spark.unsafe.types.UTF8String; /** - * This class represents in-memory values of a column and provides the main APIs to access the data. - * It supports all the types and contains get APIs as well as their batched versions. The batched - * versions are considered to be faster and preferable whenever possible. + * An interface representing in-memory columnar data in Spark. This interface defines the main APIs + * to access the data, as well as their batched versions. The batched versions are considered to be + * faster and preferable whenever possible. * - * To handle nested schemas, ColumnVector has two types: Arrays and Structs. In both cases these - * columns have child columns. All of the data are stored in the child columns and the parent column - * only contains nullability. In the case of Arrays, the lengths and offsets are saved in the child - * column and are encoded identically to INTs. + * Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values + * in this ColumnVector. * - * Maps are just a special case of a two field struct. + * ColumnVector supports all the data types including nested types. To handle nested types, + * ColumnVector can have children and is a tree structure. For struct type, it stores the actual + * data of each field in the corresponding child ColumnVector, and only store null information in + * the parent ColumnVector. For array type, it stores the actual array elements in the child + * ColumnVector, and store null information, array offsets and lengths in the parent ColumnVector. --- End diff -- `store ` -> `stores` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159467403 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java --- @@ -14,26 +14,18 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.vectorized; import java.util.*; import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.execution.vectorized.MutableColumnarRow; import org.apache.spark.sql.types.StructType; /** - * This class is the in memory representation of rows as they are streamed through operators. It - * is designed to maximize CPU efficiency and not storage footprint. Since it is expected that - * each operator allocates one of these objects, the storage footprint on the task is negligible. - * - * The layout is a columnar with values encoded in their native format. Each RowBatch contains - * a horizontal partitioning of the data, split into columns. - * - * The ColumnarBatch supports either on heap or offheap modes with (mostly) the identical API. - * - * TODO: - * - There are many TODOs for the existing APIs. They should throw a not implemented exception. - * - Compaction: The batch and columns should be able to compact based on a selection vector. + * This class is a wrapper of multiple ColumnVectors and represents a logical table-like data + * structure. It provides a row-view of this batch so that Spark can access the data row by row. --- End diff -- `row-view` -> `row view` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159392782 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -14,32 +14,38 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.vectorized; import org.apache.spark.sql.catalyst.util.MapData; import org.apache.spark.sql.types.DataType; import org.apache.spark.sql.types.Decimal; import org.apache.spark.unsafe.types.UTF8String; /** - * This class represents in-memory values of a column and provides the main APIs to access the data. - * It supports all the types and contains get APIs as well as their batched versions. The batched - * versions are considered to be faster and preferable whenever possible. + * An interface representing in-memory columnar data in Spark. This interface defines the main APIs + * to access the data, as well as their batched versions. The batched versions are considered to be + * faster and preferable whenever possible. * - * To handle nested schemas, ColumnVector has two types: Arrays and Structs. In both cases these - * columns have child columns. All of the data are stored in the child columns and the parent column - * only contains nullability. In the case of Arrays, the lengths and offsets are saved in the child - * column and are encoded identically to INTs. + * Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values + * in this ColumnVector. * - * Maps are just a special case of a two field struct. + * ColumnVector supports all the data types including nested types. To handle nested types, + * ColumnVector can have children and is a tree structure. For struct type, it stores the actual --- End diff -- Sorry for my mistake. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159363145 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -14,32 +14,38 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.vectorized; import org.apache.spark.sql.catalyst.util.MapData; import org.apache.spark.sql.types.DataType; import org.apache.spark.sql.types.Decimal; import org.apache.spark.unsafe.types.UTF8String; /** - * This class represents in-memory values of a column and provides the main APIs to access the data. - * It supports all the types and contains get APIs as well as their batched versions. The batched - * versions are considered to be faster and preferable whenever possible. + * An interface representing in-memory columnar data in Spark. This interface defines the main APIs + * to access the data, as well as their batched versions. The batched versions are considered to be + * faster and preferable whenever possible. * - * To handle nested schemas, ColumnVector has two types: Arrays and Structs. In both cases these - * columns have child columns. All of the data are stored in the child columns and the parent column - * only contains nullability. In the case of Arrays, the lengths and offsets are saved in the child - * column and are encoded identically to INTs. + * Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values + * in this ColumnVector. * - * Maps are just a special case of a two field struct. + * ColumnVector supports all the data types including nested types. To handle nested types, + * ColumnVector can have children and is a tree structure. For struct type, it stores the actual --- End diff -- it's already `children` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159260224 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -586,7 +587,7 @@ public final int appendStruct(boolean isNull) { if (isNull) { appendNull(); for (ColumnVector c: childColumns) { -if (c.type instanceof StructType) { +if (c.dataType() instanceof StructType) { --- End diff -- nit: Which access type will we use for `ColumnVector.type`? `dataType()` or `type`? For example, `OnHeapColumnVector.reserveInternal()` uses `type` while this line uses `dataType()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159248950 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -14,32 +14,38 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.vectorized; import org.apache.spark.sql.catalyst.util.MapData; import org.apache.spark.sql.types.DataType; import org.apache.spark.sql.types.Decimal; import org.apache.spark.unsafe.types.UTF8String; /** - * This class represents in-memory values of a column and provides the main APIs to access the data. - * It supports all the types and contains get APIs as well as their batched versions. The batched - * versions are considered to be faster and preferable whenever possible. + * An interface representing in-memory columnar data in Spark. This interface defines the main APIs + * to access the data, as well as their batched versions. The batched versions are considered to be + * faster and preferable whenever possible. * - * To handle nested schemas, ColumnVector has two types: Arrays and Structs. In both cases these - * columns have child columns. All of the data are stored in the child columns and the parent column - * only contains nullability. In the case of Arrays, the lengths and offsets are saved in the child - * column and are encoded identically to INTs. + * Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values + * in this ColumnVector. * - * Maps are just a special case of a two field struct. + * ColumnVector supports all the data types including nested types. To handle nested types, + * ColumnVector can have children and is a tree structure. For struct type, it stores the actual + * data of each field in the corresponding child ColumnVector, and only store null information in + * the parent ColumnVector. For array type, it stores the actual array elements in the child + * ColumnVector, and store null information, array offsets and lengths in the parent ColumnVector. * - * Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values - * in the current batch. + * ColumnVector is expected to be reused during the entire data loading process, to avoid allocating + * memory again and again. + * + * ColumnVector is meant to maximize CPU efficiency and not storage footprint, implementations --- End diff -- nit: not -> not to minimize --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159248670 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -14,32 +14,38 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.vectorized; import org.apache.spark.sql.catalyst.util.MapData; import org.apache.spark.sql.types.DataType; import org.apache.spark.sql.types.Decimal; import org.apache.spark.unsafe.types.UTF8String; /** - * This class represents in-memory values of a column and provides the main APIs to access the data. - * It supports all the types and contains get APIs as well as their batched versions. The batched - * versions are considered to be faster and preferable whenever possible. + * An interface representing in-memory columnar data in Spark. This interface defines the main APIs + * to access the data, as well as their batched versions. The batched versions are considered to be + * faster and preferable whenever possible. * - * To handle nested schemas, ColumnVector has two types: Arrays and Structs. In both cases these - * columns have child columns. All of the data are stored in the child columns and the parent column - * only contains nullability. In the case of Arrays, the lengths and offsets are saved in the child - * column and are encoded identically to INTs. + * Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values + * in this ColumnVector. * - * Maps are just a special case of a two field struct. + * ColumnVector supports all the data types including nested types. To handle nested types, + * ColumnVector can have children and is a tree structure. For struct type, it stores the actual --- End diff -- nit: child -> children --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159227467 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java --- @@ -87,19 +79,7 @@ public void remove() { } /** - * Resets the batch for writing. - */ - public void reset() { -for (int i = 0; i < numCols(); ++i) { - if (columns[i] instanceof WritableColumnVector) { -((WritableColumnVector) columns[i]).reset(); - } -} -this.numRows = 0; --- End diff -- This doesn't matter. The `numRows` is only used when calling `rowsInterator`, and we always call `setNumRows` before calling `rowsIterator` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159119348 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnVector.java --- @@ -14,32 +14,38 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.sources.v2.vectorized; --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159112727 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java --- @@ -87,19 +79,7 @@ public void remove() { } /** - * Resets the batch for writing. - */ - public void reset() { -for (int i = 0; i < numCols(); ++i) { - if (columns[i] instanceof WritableColumnVector) { -((WritableColumnVector) columns[i]).reset(); - } -} -this.numRows = 0; --- End diff -- This can result an incorrect `numRows`. May it be a potential error? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159112274 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnVector.java --- @@ -14,32 +14,38 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.sources.v2.vectorized; --- End diff -- Is v2 package proper for them? Are ColumnVector and related classes belong to data source v2 API? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159112087 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java --- @@ -14,26 +14,18 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.spark.sql.execution.vectorized; +package org.apache.spark.sql.sources.v2.vectorized; import java.util.*; import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.execution.vectorized.MutableColumnarRow; import org.apache.spark.sql.types.StructType; /** - * This class is the in memory representation of rows as they are streamed through operators. It - * is designed to maximize CPU efficiency and not storage footprint. Since it is expected that - * each operator allocates one of these objects, the storage footprint on the task is negligible. - * - * The layout is a columnar with values encoded in their native format. Each RowBatch contains - * a horizontal partitioning of the data, split into columns. - * - * The ColumnarBatch supports either on heap or offheap modes with (mostly) the identical API. - * - * TODO: - * - There are many TODOs for the existing APIs. They should throw a not implemented exception. - * - Compaction: The batch and columns should be able to compact based on a selection vector. + * This class is a wrapper of multiple ColumnVectors and represents a table. It provides a row-view --- End diff -- a table -> a portion of a table? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159081495 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java --- @@ -87,19 +79,7 @@ public void remove() { } /** - * Resets the batch for writing. - */ - public void reset() { --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20116#discussion_r159079805 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java --- @@ -87,19 +79,7 @@ public void remove() { } /** - * Resets the batch for writing. - */ - public void reset() { --- End diff -- remove this as it's for `WritableColumnVector` only --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20116: [SPARK-20960][SQL] make ColumnVector public
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20116 [SPARK-20960][SQL] make ColumnVector public ## What changes were proposed in this pull request? move `ColumnVector` and related classes to `org.apache.spark.sql.sources.v2.vectorized`, and improve the document. ## How was this patch tested? existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark column-vector Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20116.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20116 commit 7dc4496a00ae97c2f686403092883184a65e6bfb Author: Wenchen FanDate: 2017-12-29T07:27:25Z make ColumnVector public --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org