[
https://issues.apache.org/jira/browse/DRILL-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080219#comment-17080219
]
ASF GitHub Bot commented on DRILL-7696:
---------------------------------------
paul-rogers commented on pull request #2051: DRILL-7696: EVF v2 scan schema
resolution
URL: https://github.com/apache/drill/pull/2051
# [DRILL-7696](https://issues.apache.org/jira/browse/DRILL-7696): EVF v2
scan schema resolution
## Description
Provides the mechanism to resolve the scan schema from a projection list,
provided schema, early reader schema and actual reader schema.
This is the the first of several PRs to implement the "third generation" of
scan framework. The first was based on `ScanBatch` and primarily focused on
creating a set of vectors. EVF is the second generation: vector management
shifted to the `ResultSetLoader`, we introduced the idea of a provided schema
and an "early" reader schema (that known before any data is read.) EVF evolved
to include an ad-hoc schema resolution mechanism that half worked with schema,
other half still worked with vectors. It worked, but was overly complex.
### Scan Schema Resolution
This third generation provides a clean split between schema resolution and
reading data into the schema. This PR focuses on schema resolution. Drill is
unique among query engines that it uses a _dynamic schema_: column names are
known, but types are "to be named later" by the readers. Over the last couple
of years we've added additional tools to help resolve schema ambiguities that
arise in a dynamic schema. This PR brings all that work together to derive the
scan output schema from a variety of inputs:
* The _projection list_: a set of names sometimes with extra hints such as
map members or array indexes. Evolves the prior
`RequestedTuple`/`RequestedColumn` mechanisms. A column based only on
projection is said to be _dynamic_ and is represented by a new `DynamicColumn`
implementation of `ColumnMetadata`.
* The _provided schema_ which adds type information to the items in the
project list, resulting in columns becoming _resolved_ (both name and type are
known.) As before, provided schemas can be "lenient" or "strict." A wildcard
projection with a strict schema cause the set of columns to become resolved and
fixed before we read any data.
* The _implicit columns_ which resolves a subset of columns independent of
any reader.
* The _early reader schema_ for readers such as JDBC or Parquet where the
reader schema is known before reading data. The early reader schema works much
like the provided schema.
* The _reader output schema_: the actual set of columns read by readers such
as CSV (with headers), JSON, etc.
* The _missing column schema_. The provided schema and early reader schema
both indicate what the reader is capable of reading. The projection list and
the reader output schema represent a (usually proper) subset of those columns
which the user actually wants to read. Drill allows the user to project columns
which do not actually exist in the reader: these become _missing columns_.
Drill makes up a type (and we cross our fingers that it is correct.) As before,
each plugin can define the missing column type it prefers.
The result of the above is the _scan output schema_ the set of columns which
the scan will produce. Once we've worked out the schema, filling the vectors is
a just an "implementation detail" handled by the individual readers and the
`ResultSetLoader`.
In the EVF1 we had a number of classes which all modelled columns in some
way. A key improvement in this version is that, by adding a dynamic column
type, the `ColumnMetadata`/`TupleMetadata` classes can represent columns
throughout the schema resolution process. (The one exception is handling
top-level columns in a row, which adds a layer of "wrapper" around a column
metadata.)
### Defined Schema
Notice that if we have a strict provided schema, we can actually do the
schema resolution independent of the readers. We end up with a _defined
schema": one which defines the name and type of each column which the reader is
to produce. In fact, we could compute the defined schema in the planner like
traditional query engines. This code anticipates this idea: it allows the
client to specify a defined schema in place of a projection list. Nothing is
able to use that functionality, but it is a step toward a solution.
This PR focuses on schema resolution, which is enough for one PR. Nothing
calls this code yet: the next PR will add V3 of the scan framework based on
this work. After that, each of the existing clients of EVF will migrate to EVF2
as a series of PRs. Finally, we'll remove the EVF1 code which, after all the
conversions, will no longer be used.
Since Drill wants to use a dynamic schema whenever possible, the
implementation allows a defined schema in which some columns are concrete, some
dynamic. This models the idea of a query using a non-strict provided schema
where the schema describes a few "problem child" columns, but does not bother
those that are unambiguous.
### Implicit Columns
One convenience of this formalized resolution mechanism is that we can add
functionality to the provided schema which can now include implicit columns by
using a new column property. This lets the user, say, project `dir0` as `year`,
`dir` as `month` and so on. This PR only works for reading; the aliases don't
yet work for filter push-down. To make this work, the implicit column
definitions are extended with a fixed name for each implicit column. Do so
ensures that the provided schema produces the same results independent of the
implicit column system/session options. Conversely, a column not marked as
implicit will never be implicit in a provided schema. This means that a table
can use, say, `filename` as a column name without worry of conflict with the
implicit column of the same name.
### Projection Filter
Prior PRs introduced the _projection filter_ mechanism for the
`ResultSetLoader`. To ensure consistent behavior, the schema resolution
mechanism uses the same filter mechanism which was enhanced to provide the
extra information required by schema resolution.
## Documentation
The provided schema can now include implicit columns. Implicit columns can
be any name (not just those used by Drill.) Indicate an implicit column by
adding the `drill.implicit` property to a column. The property takes one of
these values: "fqn", "filepath", "filename", "suffix" or "dirx" where x is a
number starting from 0. See the implicit column definitions for a description
of these columns.
If you include a column that does not have the `drill.implicit` property
set, then it will not be an implicit column, even if it happens to have the
same name as a Drill-defined implicit columns.
The combination of these two features means that your provided schema is
completely isolated from implicit column names defined in system/session
properties: your provided schema columns can never collide with an implicit
column name.
## Testing
Includes tests for all new functionality. Migrates (copies of) the relevant
EVF1 tests to ensure all existing functionality continues to work.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> EVF v2 Scan Schema Resolution
> -----------------------------
>
> Key: DRILL-7696
> URL: https://issues.apache.org/jira/browse/DRILL-7696
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.18.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
> Fix For: 1.18.0
>
>
> Revises the mechanism EVF uses to resolve the schema for a scan. See PR for
> details.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)