[GitHub] [iceberg] mayursrivastava commented on a change in pull request #2286: Add Arrow vectorized reader

GitBox Wed, 04 Aug 2021 05:39:46 -0700


mayursrivastava commented on a change in pull request #2286:
URL: https://github.com/apache/iceberg/pull/2286#discussion_r682575840




##########
File path: 
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ColumnVector.java
##########
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.arrow.vectorized;
+
+import org.apache.arrow.vector.FieldVector;
+import org.apache.iceberg.types.Types;
+
+/**
+ * This class is inspired by Spark's {@code ColumnVector}.
+ * This class represents the column data for an Iceberg table query.
+ * It wraps an arrow {@link FieldVector} and provides simple
+ * accessors for the row values. Advanced users can access
+ * the {@link FieldVector}.
+ * <p>
+ *   Supported Iceberg data types:
+ *   <ul>
+ *     <li>{@link Types.BooleanType}</li>
+ *     <li>{@link Types.IntegerType}</li>
+ *     <li>{@link Types.LongType}</li>
+ *     <li>{@link Types.FloatType}</li>
+ *     <li>{@link Types.DoubleType}</li>
+ *     <li>{@link Types.StringType}</li>
+ *     <li>{@link Types.BinaryType}</li>
+ *     <li>{@link Types.TimestampType} (with and without timezone)</li>
+ *     <li>{@link Types.DateType}</li>
+ *   </ul>
+ */
+public class ColumnVector implements AutoCloseable {

Review comment:
       My use case is to use Arrow VectorSchemaRoot directly, but I agree with 
@rymurr's suggestions to wrap the arrow data structures. 
   
   I see the following benefits with the wrapper interface:
   1. Lifecycle management is better.
   2. The current parquet reader returns physical repr of the data as arrow 
vectors. This means that the dictionary encoded columns are returned as int32 
and columns that were widened return the physical file column width (e.g. if 
int32 was widened to int64 and the data contains int32, the arrow vector is 
int32). The wrapper classes can handle dictionary encoding and type widening 
correctly. Note that this is also done in the Spark version. This, however, is 
a limitation of the current implementation and it will be better for the Arrow 
Reader to return arrow vectors with logical types.
   3. The wrapper interface is easier to use for most users.
   
   The cons are following:
   1. A new API is introduced that has to be learned and maintained.
   
   I also don't feel strong about the wrapper interface and if the community 
doesn't find it useful abstraction, I agree.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] mayursrivastava commented on a change in pull request #2286: Add Arrow vectorized reader

Reply via email to