[GitHub] [drill] paul-rogers opened a new pull request #2051: DRILL-7696: EVF v2 scan schema resolution

GitBox Thu, 09 Apr 2020 21:35:45 -0700

paul-rogers opened a new pull request #2051: DRILL-7696: EVF v2 scan schema 
resolution
URL: https://github.com/apache/drill/pull/2051
 
 
   # [DRILL-7696](https://issues.apache.org/jira/browse/DRILL-7696): EVF v2 
scan schema resolution
   
   ## Description
   
   Provides the mechanism to resolve the scan schema from a projection list, 
provided schema, early reader schema and actual reader schema.
   
   This is the the first of several PRs to implement the "third generation" of 
scan framework. The first was based on `ScanBatch` and primarily focused on 
creating a set of vectors. EVF is the second generation: vector management 
shifted to the `ResultSetLoader`, we introduced the idea of a provided schema 
and an "early" reader schema (that known before any data is read.) EVF evolved 
to include an ad-hoc schema resolution mechanism that half worked with schema, 
other half still worked with vectors. It worked, but was overly complex.
   
   ### Scan Schema Resolution
   
   This third generation provides a clean split between schema resolution and 
reading data into the schema. This PR focuses on schema resolution. Drill is 
unique among query engines that it uses a _dynamic schema_: column names are 
known, but types are "to be named later" by the readers. Over the last couple 
of years we've added additional tools to help resolve schema ambiguities that 
arise in a dynamic schema. This PR brings all that work together to derive the 
scan output schema from a variety of inputs:
   
   * The _projection list_: a set of names sometimes with extra hints such as 
map members or array indexes. Evolves the prior 
`RequestedTuple`/`RequestedColumn` mechanisms. A column based only on 
projection is said to be _dynamic_ and is represented by a new `DynamicColumn` 
implementation of `ColumnMetadata`.
   * The _provided schema_ which adds type information to the items in the 
project list, resulting in columns becoming _resolved_ (both name and type are 
known.) As before, provided schemas can be "lenient" or "strict." A wildcard 
projection with a strict schema cause the set of columns to become resolved and 
fixed before we read any data.
   * The _implicit columns_ which resolves a subset of columns independent of 
any reader.
   * The _early reader schema_ for readers such as JDBC or Parquet where the 
reader schema is known before reading data. The early reader schema works much 
like the provided schema.
   * The _reader output schema_: the actual set of columns read by readers such 
as CSV (with headers), JSON, etc.
   * The _missing column schema_. The provided schema and early reader schema 
both indicate what the reader is capable of reading. The projection list and 
the reader output schema represent a (usually proper) subset of those columns 
which the user actually wants to read. Drill allows the user to project columns 
which do not actually exist in the reader: these become _missing columns_. 
Drill makes up a type (and we cross our fingers that it is correct.) As before, 
each plugin can define the missing column type it prefers.
   
   The result of the above is the _scan output schema_ the set of columns which 
the scan will produce. Once we've worked out the schema, filling the vectors is 
a just an "implementation detail" handled by the individual readers and the 
`ResultSetLoader`.
   
   In the EVF1 we had a number of classes which all modelled columns in some 
way. A key improvement in this version is that, by adding a dynamic column 
type, the `ColumnMetadata`/`TupleMetadata` classes can represent columns 
throughout the schema resolution process. (The one exception is handling 
top-level columns in a row, which adds a layer of "wrapper" around a column 
metadata.)
   
   ### Defined Schema
   
   Notice that if we have a strict provided schema, we can actually do the 
schema resolution independent of the readers. We end up with a _defined 
schema": one which defines the name and type of each column which the reader is 
to produce. In fact, we could compute the defined schema in the planner like 
traditional query engines. This code anticipates this idea: it allows the 
client to specify a defined schema in place of a projection list. Nothing is 
able to use that functionality, but it is a step toward a solution.
   
   This PR focuses on schema resolution, which is enough for one PR. Nothing 
calls this code yet: the next PR will add V3 of the scan framework based on 
this work. After that, each of the existing clients of EVF will migrate to EVF2 
as a series of PRs. Finally, we'll remove the EVF1 code which, after all the 
conversions, will no longer be used.
   
   Since Drill wants to use a dynamic schema whenever possible, the 
implementation allows a defined schema in which some columns are concrete, some 
dynamic. This models the idea of a query using a non-strict provided schema 
where the schema describes a few "problem child" columns, but does not bother 
those that are unambiguous.
   
   ### Implicit Columns
   
   One convenience of this formalized resolution mechanism is that we can add 
functionality to the provided schema which can now include implicit columns by 
using a new column property. This lets the user, say, project `dir0` as `year`, 
`dir` as `month` and so on. This PR only works for reading; the aliases don't 
yet work for filter push-down. To make this work, the implicit column 
definitions are extended with a fixed name for each implicit column. Do so 
ensures that the provided schema produces the same results independent of the 
implicit column system/session options. Conversely, a column not marked as 
implicit will never be implicit in a provided schema. This means that a table 
can use, say, `filename` as a column name without worry of conflict with the 
implicit column of the same name.
   
   ### Projection Filter
   
   Prior PRs introduced the _projection filter_ mechanism for the 
`ResultSetLoader`. To ensure consistent behavior, the schema resolution 
mechanism uses the same filter mechanism which was enhanced to provide the 
extra information required by schema resolution.
   
   ## Documentation
   
   The provided schema can now include implicit columns. Implicit columns can 
be any name (not just those used by Drill.) Indicate an implicit column by 
adding the `drill.implicit` property to a column. The property takes one of 
these values: "fqn", "filepath", "filename", "suffix" or "dirx" where x is a 
number starting from 0. See the implicit column definitions for a description 
of these columns.
   
   If you include a column that does not have the `drill.implicit` property 
set, then it will not be an implicit column, even if it happens to have the 
same name as a Drill-defined implicit columns.
   
   The combination of these two features means that your provided schema is 
completely isolated from implicit column names defined in system/session 
properties: your provided schema columns can never collide with an implicit 
column name.
    
   ## Testing
   
   Includes tests for all new functionality. Migrates (copies of) the relevant 
EVF1 tests to ensure all existing functionality continues to work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers opened a new pull request #2051: DRILL-7696: EVF v2 scan schema resolution

Reply via email to