paul-rogers opened a new pull request #2051: DRILL-7696: EVF v2 scan schema resolution URL: https://github.com/apache/drill/pull/2051 # [DRILL-7696](https://issues.apache.org/jira/browse/DRILL-7696): EVF v2 scan schema resolution ## Description Provides the mechanism to resolve the scan schema from a projection list, provided schema, early reader schema and actual reader schema. This is the the first of several PRs to implement the "third generation" of scan framework. The first was based on `ScanBatch` and primarily focused on creating a set of vectors. EVF is the second generation: vector management shifted to the `ResultSetLoader`, we introduced the idea of a provided schema and an "early" reader schema (that known before any data is read.) EVF evolved to include an ad-hoc schema resolution mechanism that half worked with schema, other half still worked with vectors. It worked, but was overly complex. ### Scan Schema Resolution This third generation provides a clean split between schema resolution and reading data into the schema. This PR focuses on schema resolution. Drill is unique among query engines that it uses a _dynamic schema_: column names are known, but types are "to be named later" by the readers. Over the last couple of years we've added additional tools to help resolve schema ambiguities that arise in a dynamic schema. This PR brings all that work together to derive the scan output schema from a variety of inputs: * The _projection list_: a set of names sometimes with extra hints such as map members or array indexes. Evolves the prior `RequestedTuple`/`RequestedColumn` mechanisms. A column based only on projection is said to be _dynamic_ and is represented by a new `DynamicColumn` implementation of `ColumnMetadata`. * The _provided schema_ which adds type information to the items in the project list, resulting in columns becoming _resolved_ (both name and type are known.) As before, provided schemas can be "lenient" or "strict." A wildcard projection with a strict schema cause the set of columns to become resolved and fixed before we read any data. * The _implicit columns_ which resolves a subset of columns independent of any reader. * The _early reader schema_ for readers such as JDBC or Parquet where the reader schema is known before reading data. The early reader schema works much like the provided schema. * The _reader output schema_: the actual set of columns read by readers such as CSV (with headers), JSON, etc. * The _missing column schema_. The provided schema and early reader schema both indicate what the reader is capable of reading. The projection list and the reader output schema represent a (usually proper) subset of those columns which the user actually wants to read. Drill allows the user to project columns which do not actually exist in the reader: these become _missing columns_. Drill makes up a type (and we cross our fingers that it is correct.) As before, each plugin can define the missing column type it prefers. The result of the above is the _scan output schema_ the set of columns which the scan will produce. Once we've worked out the schema, filling the vectors is a just an "implementation detail" handled by the individual readers and the `ResultSetLoader`. In the EVF1 we had a number of classes which all modelled columns in some way. A key improvement in this version is that, by adding a dynamic column type, the `ColumnMetadata`/`TupleMetadata` classes can represent columns throughout the schema resolution process. (The one exception is handling top-level columns in a row, which adds a layer of "wrapper" around a column metadata.) ### Defined Schema Notice that if we have a strict provided schema, we can actually do the schema resolution independent of the readers. We end up with a _defined schema": one which defines the name and type of each column which the reader is to produce. In fact, we could compute the defined schema in the planner like traditional query engines. This code anticipates this idea: it allows the client to specify a defined schema in place of a projection list. Nothing is able to use that functionality, but it is a step toward a solution. This PR focuses on schema resolution, which is enough for one PR. Nothing calls this code yet: the next PR will add V3 of the scan framework based on this work. After that, each of the existing clients of EVF will migrate to EVF2 as a series of PRs. Finally, we'll remove the EVF1 code which, after all the conversions, will no longer be used. Since Drill wants to use a dynamic schema whenever possible, the implementation allows a defined schema in which some columns are concrete, some dynamic. This models the idea of a query using a non-strict provided schema where the schema describes a few "problem child" columns, but does not bother those that are unambiguous. ### Implicit Columns One convenience of this formalized resolution mechanism is that we can add functionality to the provided schema which can now include implicit columns by using a new column property. This lets the user, say, project `dir0` as `year`, `dir` as `month` and so on. This PR only works for reading; the aliases don't yet work for filter push-down. To make this work, the implicit column definitions are extended with a fixed name for each implicit column. Do so ensures that the provided schema produces the same results independent of the implicit column system/session options. Conversely, a column not marked as implicit will never be implicit in a provided schema. This means that a table can use, say, `filename` as a column name without worry of conflict with the implicit column of the same name. ### Projection Filter Prior PRs introduced the _projection filter_ mechanism for the `ResultSetLoader`. To ensure consistent behavior, the schema resolution mechanism uses the same filter mechanism which was enhanced to provide the extra information required by schema resolution. ## Documentation The provided schema can now include implicit columns. Implicit columns can be any name (not just those used by Drill.) Indicate an implicit column by adding the `drill.implicit` property to a column. The property takes one of these values: "fqn", "filepath", "filename", "suffix" or "dirx" where x is a number starting from 0. See the implicit column definitions for a description of these columns. If you include a column that does not have the `drill.implicit` property set, then it will not be an implicit column, even if it happens to have the same name as a Drill-defined implicit columns. The combination of these two features means that your provided schema is completely isolated from implicit column names defined in system/session properties: your provided schema columns can never collide with an implicit column name. ## Testing Includes tests for all new functionality. Migrates (copies of) the relevant EVF1 tests to ensure all existing functionality continues to work.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
