[GitHub] paul-rogers commented on a change in pull request #1618: DRILL-6950: Row set-based scan framework

GitBox Sat, 26 Jan 2019 20:52:42 -0800

paul-rogers commented on a change in pull request #1618: DRILL-6950: Row 
set-based scan framework
URL: https://github.com/apache/drill/pull/1618#discussion_r251219282


 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/framework/SchemaNegotiator.java
 ##########
 @@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.scan.framework;
+
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+
+/**
+ * Negotiates the table schema with the scanner framework and provides
+ * context information for the reader. In a typical scan, the physical
+ * plan provides the project list: the set of columns that the query
+ * expects. Readers provide a table schema: the set of columns actually
+ * available. The scan framework combines the two lists to determine
+ * the available table columns that must be read, along with any additional
+ * to be added. Additional columns can be file metadata (if the storage
+ * plugin requests them), or can be null columns added for projected
+ * columns that don't actually exist in the table.
+ * <p>
+ * The reader provides the table schema in one of two ways:
+ * <ul>
+ * <li>If the reader is of "early schema" type, then the reader calls
+ * {@link #setTableSchema(TupleMetadata)} to provide that schema.</li>
+ * <li>If the reader is of "late schema" type, then the reader discovers
+ * the schema as the data is read, calling the
+ * {@link RowSetLoader#addColumn()} method to add each column as it is
+ * discovered.
+ * <p>
+ * Either way, the project list from the physical plan determines which
+ * table columns are materialized and which are not. Readers are provided
+ * for all table columns for readers that must read sequentially, but
+ * only the materialized columns are written to value vectors.
+ * <p>
+ * Regardless of the schema type, the result of building the schema is a
+ * result set loader used to prepare batches for use in the query. The reader
+ * can simply read all columns, allowing the framework to discard unwanted
+ * values. Or for efficiency, the reader can check the column metadata to
+ * determine if a column is projected, and if not, then don't even read
+ * the column from the input source.
+ */
+
+public interface SchemaNegotiator {
+
+  OperatorContext context();
+
+  /**
+   * Specify the type of table schema. Required only in the obscure
+   * case of an early-schema table with an empty schema, else inferred.
+   * (Set to {@link TableSchemaType#EARLY} if no columns provided, or
+   * to {@link TableSchemaType#LATE if at least one column is provided.)
+   * @param type the table schema type
+   */
+
+  void setTableSchema(TupleMetadata schema);
 
 Review comment:
   Argh... The comment is left over from a now-deleted method.
   
   The semantics are that an early-schema reader can specify the schema on 
open. If that occurs, then the scan operator can return that schema without 
having to actually read any rows for the "fast schema" path. (Do we still have 
that path? Wasn't entirely clear after Jinfeng's empty batches changes.)
   
   The classical examples of early schema are Parquet, CSV (with or without 
headers), Parquet an JDBC.
   
   If a table does not know its schema than it is late-schema and can simply 
add columns as needed later. The classic example is JSON.
   
   Columns are added just by creating them directly in the tuple writer handed 
to the reader. The framework merges those new columns with any declared up 
front. More precisely, the schema here is used to create the initial set of 
vectors in the tuple writer, the reader can add more later.
   
   The mechanism allows a hybrid: declare a partial schema up front, fill in 
details later. Can't think of an example, but it fell out of the 
implementation, so might as well support and test the hybrid case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on a change in pull request #1618: DRILL-6950: Row set-based scan framework

Reply via email to