GitHub user paul-rogers opened a pull request:
https://github.com/apache/drill/pull/914
DRILL-5657: Size-aware vector writer structure
This large PR provides another two levels of foundation for size-aware
vector writers in the Drill record readers. It combines code from two previous
PRS:
* PR 866 - DRILL-5657: Implement size-aware result set loader
* PR 887 - DRILL-5688: Add repeated map support to column accessors
The PR then goes on to integrate the two prior PRs and provide additional
functionality.
Like the two previous PRs, this one is divided into commits to group the
work.
1. Accessor layer
2. Row Set layer
3. Tuple and Column Model layer
4. Row Set Loader layer
5. Secondary changes
Much of the material below appears in Javadoc throughout the code. The
material here is not meant to replace that documentation. Instead, it is meant
to provide the âbig pictureâ: placing the bits and pieces in context and
pointing out interesting functionality to explore in each layer.
## Commit 1: The Accessor Layer
The first commit provides the core of the mechanism: the writers that put
data into vectors, and the readers that retrieve that data. The version here is
an evolution of the version provided in an earlier PR a few months ago.
### Overview of the Drill Vector Data Model
The code will make much more sense if we start with a review of Drillâs
complex vector data model. Drill has 38+ data (âminorâ) types as defined in
the [proto
buf](https://github.com/apache/drill/blob/master/protocol/src/main/protobuf/Types.proto)
definition. Drill also has three cardinalities (âmodesâ) defined in the
same file. The result is over 120+ different vector types. Then, when you add
maps, repeated maps, lists and repeated lists, you rapidly get an explosion of
types that the writer code must handle.
Vectors can be categorized along multiple dimensions:
* By data (minor) type
* By cardinality (mode)
* By fixed or variable width
A repeated map, a list, a repeated list and any array (repeated) scalar all
are array-like. Nullable and required modes are identical (single values), but
a nullable has an additional is-set (âbitâ) vector.
A key contribution of this PR is the data model used to organize vectors.
* Both the top-level row, and a Drill map are âtuplesâ and are treated
similarly in the model.
* All non-map, non-list (that is, scalar) data types are treated uniformly.
* All arrays (whether a list, a repeated list, a repeated map, or a
repeated scalar) are treated uniformly.
### Accessor Data Model
The above leads to a very simple, JSON-like data model, introduced in this
PR.
* A tuple reader or writer models a row. (Usually via a subclass.) Column
are accessible by name or position.
* Every column is modeled as an object.
* The object can have an object type: scalar, tuple or array.
* An array has a single element type (but many run-time elements)
* A scalar can be nullable or not, and provides a uniform get/set interface.
This data model is similar to; but has important differences from, the
prior, generated, readers and writers.
The object layer is new: it is the simplest way to model the three
âobject types.â An app using this code would use just the leaf scalar
readers and writers.
Although there is quite a bit of code change here to provide the new
structure the core functionality of reading and writing to vectors has not
changed much. And, this code has extensive unit tests, which should avoid the
need to "mentally execute" each line of code.
See the classes in `org.apache.drill.exec.vector.accessor` for details. In
particular, please see the `package-info.java` file in that package for more
information.
As before, the top package provides a set of interfaces; the inner packages
provide the implementation. The `ColumnAccessors.java` template generates the
per-vector code. Warning: this template has become quite cryptic: the best bet
for review is to generate the Java code and review that.
### Writer Performance
During previous review, we discussed ways to optimize writer performance.
This PR has two improvements:
* Completely rework the writers to minimize code steps
* Rework the âcolumn loadersâ to eliminate them: instead of two
additional method calls, the âloaderâ now uses the column writers directly.
Much behind-the-scenes rework was needed to accomplish the above.
Column readers, however, were left with their existing structure; we can
apply the same optimizations to the readers in a later PR.
While doing the performance optimization, it became clear we can adopt a
major simplification. Writers evolved from having three versions (nullable,
required, repeated) to having a single version to handle all three cases. Some
items of note:
* The writers bypass DrillBuf and the UDLE to needed writes to direct
memory.
* The writers buffer the buffer address and implement a number of methods
to synchronize that address when the buffer changes (on a new batch or during
vector resize).
* Writing require a single bounds check. In most cases, the write is within
bounds so the single check is all that is needed.
* If the write is out of bounds, then the writer determines the new vector
size and performs the needed reallocation. To avoid multiple doublings, the
writer computes the needed new size and allocates that size directly.
* Vector reallocation is improved to eliminate zeroing the new half of the
buffer, data is left âgarbage-filled.â
* If the vector would grow beyond 16 MB, then overflow is triggered, via a
listener, which causes the buffer to be replaced. The write then continues.
* Offset vector updates are integrated into the writers using an
`OffsetVectorWriter`. This writer caches the last write position so that each
array write needs a single offset update, rather than the read and write as in
previous code.
* The writers keep track of the âlast write positionâ and perform
âfill-emptiesâ work if the new write position is more than one position
behind the last write. All types now correctly support âfill-emptiesâ
(before, only nullable types did so reliably.)
* Null handling is done by an additional writer layer that wraps the
underlying data writer. This avoids the need for a special nullable writer: the
same nullable layer works for all data types.
* Array handling is done similarly: an array writer manages the offset
vector and works the same for repeated scalars, repeated maps and (eventually)
lists and repeated lists.
Since vector overflow is now incorporated directly into the writers, this
PR backs out the `setScalar()` and related methods added to value vectors in a
previous commit.
### Supporting Classes
A new `TupleMetadata` class is a superset of the existing `BatchSchema`,
but also provides "classic" tuple-like access by position or name. Eventually,
this will also hold additional information such as actual size and so on
(information now laboriously rediscovered by the "record batch sizer.") Since
the accessors use a "tuple" abstraction to model both rows and maps, the tuple
metadata provides this same view. The top-most tuple models the row. Columns
within the row can be maps, which have their own tuple schema, and so on.
`TupleNameSpace` moves locations (so it can be used in the vector package)
but otherwise remains unchanged.
`DrillBuf` provides an experimental `putInt()` method that does bounds
checking and sets a value, to minimize calls. This will probably move into the
writer in a later PR.
This PR fixes DRILL-5690, a bug in repeated vectors that did not pass along
Decimal scale and precision. See `RepeatedValueVectors.java`.
`MaterializedField` changes to add an `isEquivalent()` method to compare
two fields, ignoring internal (`$offset$`, `$bits$`, etc.) vectors.
## Commit 2: Row Set Layer
The âRow Setâ family of classes allow tests to quickly build and
analyze vector containers in the form of a ârow setâ - a set of rows. In
prior commits, the row set abstraction was the primary client of the accessors
(readers and writers.) In this commit, much functionality shifts to be shared
with the new âloaderâ abstraction intended or production code.
Key changes in response to the accessor changes:
* The reader and writer are moved to separate files.
* Row sets now use a common âmodelâ to describe a vector tree (more
below).
* Static factory methods were added to hide constructor complexity.
* The `RowSetBuilder` and `RowSetComparison` test tools added support for
repeated maps.
* Code to handle generic object writing moved from the `RowSetBuilder` into
the accessors.
* The old `RowSetSchema` evolved to become the `TupleMetadata` mentioned
above.
* Tests were greatly enhanced to test all modes of all supported scalar
types, as well as the new JSON-like structure.
As before the row set unit tests exercise the row set classes themselves,
but also are the mechanism to exercise the accessor classes.
See the `org.apache.drill.test.rowSet` and its sub-packages for details.
## Commit 3: Tuple and Column Models
In the previous version, the row set classes had complex logic to figure
out what kind of accessor to create for each vector. This became overly
complex. In the previous PR, the row set "parses" a vector container to create
"storage" objects that represent tuples and columns. A column can, itself, be a
tuple. (Note: there is no need to model lists since lists are just vectors at
this level of abstraction, so need no special handling.)
In this PR, that concept evolves again to become a âtuple modelâ and
âcolumn modelâ: a tree-structured representation of the structure of a
Drill row. The model cleanly represents things like maps, repeated maps, list
of repeated maps that contain arrays, etc.
Think of the model as the âbackboneâ to work with a batch vector tree.
(âTreeâ because maps and lists contain other vectors.) Ideally, the vectors
themselves would provide this backbone. But, vectors are shared across
operators, so that structure created by one operator would conflict with
structure needed by another.
With this change, accessor creation is a simple matter of walking a tree to
assemble the JSON-structure.
This structure is also used to create a batch's vectors from a schema using
âvisitorâ classes, to allocate vectors based on metadata and other chores
needed in the result set loader.
See `org.apache.drill.exec.physical.rowSet.model`. The top package contains
interfaces. The two sub packages provide implementations for single batches and
hyper batches. (Hyper batches are used only in the test âRowSetâ
abstractions at present.)
### Tuple and Column Metadata
Also in this commit is a new âextendedâ metadata model that replaces
earlier versions in the row set and previous âloaderâ commits.
For reasons that will become clear in the next PR, the scan batch ends up
doing quite a bit of semantic analysis to map from the select list and the
table schema to the result schema. Drill provides a BatchSchema class that is
useful, but limited in this context. The metadata layer helps to solve this
problem by allowing fast access by both name and position, and allows the
schema to grow dynamically.
The dynamic aspect appears in this PR as well: the âloaderâ API (see
below) allows adding columns as the app proceeds; and that action internally
adds new columns to the batch schema.
The metadata classes evolved from an earlier version in the row set
abstractions. That code was extracted and extended to act as the foundation to
create the metadata layer.
The metadata model is simple: it represents tuples (rows, maps) and
columns. The metadata is intended to provide addition information beyond what
exists in the `MaterializedField` class. In particular, if provides hints about
VarChar width and array size. It also provides many useful methods to simplify
working with columns.
Eventually, it would be good to combine the new metadata classes with the
existing `BatchSchema` and `MaterializedSchema`. But, that is a big change that
must wait for another day.
## Commit 4: The Result Set Loader Layer
We now finally get to the ultimate goal of this PR: to introduce a new
âresult set loader.â This abstraction is an evolution of the âMutatorâ
class in the scan batch. (The mutator manages a batch; some readers used the
existing âwritersâ with the mutator, others used different techniques such
as working directly with native memory.)
A result set loader loads a set of tuples (AKA records, rows) from any
source (such as a record reader) into a set of record batches. The loader:
* Divides records into batches based on a maximum row count or a maximum
vector size, whichever occurs first. (Later revisions may limit overall batch
size.)
* Tracks the start and end of each batch.
* Tracks the start and end of each row.
* Provides column loaders to write each column value.
* Handles overflow when a vector becomes full, but the client still must
finish writing the current row.
The original Mutator class divided up responsibilities:
* The Mutator handled the entire record batch
* An optional VectorContainerWriter writes each record
The result set loader follows this same general pattern.
* The result set loader handles the entire record batch (really, a series
of batches that make up the entire result set: hence the name.)
* The TupleLoader class provides per-tuple services which mostly consists
of access to the column loaders.
* A tuple schema defines the schema for the result set (see below.)
To hide this complexity from the client, a ResultSetLoader interface
defines the public API. Then, a `ResultSetLoaderImpl` class implements the
interface with all the gory details. Separate classes handle each column, the
result set schema, and so on.
Recall that the overall goal of this PR is to handle memory fragmentation
by limiting vectors to 16 MB in size. The Loader realizes that goal by silently
handling vector overflow by moving the âoverflowâ row to a new
âlook-aheadâ batch. The complete batch is then sent downstream, and the
next read starts with the overflow row as the first row of the next batch.
This version is a bit simpler than the previous one because this one
leverages the model, metadata and accessor layers for much of the work.
### Loader API
The actual loader API turns out to be quite simple: another important goal
to allow third parties to more easily create custom ârecord readers.â For
example:
```
void writeABatch() {
RowSetLoader writer = ...
while (writer.start()) {
writer.scalar(0).setInt(10);
writer.scalar(1).setString("foo");
...
writer.save();
}
}
```
Note that, in this version, the `writer` above is defined in the result set
loader, but the writer is simply a tuple writer defined in the accessor layer.
By the time we get to the `setInt()` or `setString()` methods, the code does a
single method call to get into position to write the data. This eliminates the
extra âcolumn loaderâ layer of the prior version of the result set loader.
### Writer and Loader Integration
The column writers are low-level classes that interface between a consumer
and a value vector. To create the tuple loader we need a way to bind the column
writers to the result set loader. For this, we use a pair of listeners.
* A tuple writer listener handles requests to add a new column at run time.
* A column listener handles vector overflow.
In both cases, the app works with the low-level accessors, but the listener
allows events to be fed back to the loader which handles the complex details of
dynamic schema evolution and of vector overflow (in fact, it can handle both at
the same time.)
To handle vector overflow, each âsetâ method:
* Does a single bounds check to determine if the value is in the range of
the current buffer.
* If not, relocates the buffer, if possible.
* If the vector is full (reached the size limit), calls the listener to
handle overflow.
* The listener, via the vector model described earlier, dispatches the
request to the loader.
* The loader handles overflow, creating a new vector.
* The writer then writes the value into the new or existing buffer.
### Vector Overflow Logic
The heart of this abstraction is that last point: the ability to detect
when a vector overflows, switch in a new vector, and continue writing. Several
tricks are involved.
Suppose we have a row of five columns: a through e. The code writes a and
b. Then, c overflows. The code canât rewrite a and b. To handle this, the
tuple loader:
* Creates a new, small set of vectors called the âoverflow batchâ
* Copies columns a and b from the current batch to the overflow batch.
* Writes column c to the overflow batch.
* Allows the client code to finish writing columns d and e (to the overflow
batch).
* Reports to the client that the batch is full.
Note that the client is completely unaware that any of the above occurred:
it just writes a row and asks if it can write another.
### Skipping Columns
The loader must also handle a reader, such as Parquet, that skips columns
if they are null. There were bugs in Drillâs vectors for this case and
temporary patches were made in a number of places to make this work. The trick
also should work for arrays (a null array is allowed, Drill represents it as an
empty array.) But, this code also was broken. For good measure, the code now
also allows skipping non-null columns if a good âemptyâ value is available:
0 for numbers, blank for strings. This behavior is needed for the CSV reader;
if a line is missing a field, the CSV reader treats it as an empty (not null)
field.
### Result Set Schema
The tuple loader is designed to handle two kinds of tables: âearly
schemaâ (such as Parquet and CSV) define the schema up front. âLate
schemaâ (such as JSON) discover the schema during reading. The tuple loader
allows either form, and, in fact, uses the same mechanism.
Consumer of batches will, of course, want to know that the schema changed.
Providing a simple flag is muddy: when should it be reset? A better solution is
to provide a schema version which is incremented each time a column is added.
(Columns cannot be removed or changed â at least not yet.)
### Internal Plumbing
The loader assembles the many pieces described earlier:
* The vectors themselves
* The writers (tuples, objects, arrays, scalars)
* The vector model (tree)
* Tuple and column metadata
* The loader logic for handling overflow, etc.
The solution is based on a strict separation of concerns via layering and
loose coupling.
* The vector model is the âbackboneâ that coordinates the other pieces.
* Metadata provides instructions for building the vector model.
* The vector model builds the corresponding vector and writer.
* The writer listener sends events (see above) to the vector model.
* The vector model dispatches events to the loader via a âcoordinatorâ
interface.
* The loader performs batch, row and column operations via the vector model.
Please see the extensive Javadoc description in the
`org.apache.drill.exec.physical.rowSet` package, especially in the
`package-info.java` file in the `impl` sub-package.
### Result Vector Cache
Above we mentioned that the tuple loader allows schema changes on the fly.
As the next PR will make more clear, downstream operators want a fixed set of
vectors. To assist with this, the tuple loader uses a âresult vector
cacheâ. Letâs say a scanner reads two JSON files with the same schema. The
first crates the schema and vectors. The second is obligated to use the same
vectors. This is a royal pain. But, the vector cache does it automatically:
when the tuple loader adds a new column, it checks if the vector already exists
in the cache and reuses it. If not there, the cache adds it and returns it so
that it is there the next time around.
### Unit Tests
All of the above is pretty thoroughly tested via unit tests. In fact, the
unit tests are a good place to start (now I tell you!) in order to see how
client code uses the various abstractions.
The bit of unit test structure that handled system options turned out to be
wrong. Modified it to use the defaults defined in the system option manager,
which required changing the visibility of the defaults table.
## Commit 5: Collateral Damage
The last commit contains various other changes, mostly reflecting the
changes above.
Some unit tests were updated to use new features which become available in
this PR. See TestFillEmpties and TestVectorLimits.
The `equals()` method in BatchSchema is badly broken. Cleaned it up some.
But, didnât want to change it too much in case anything depends on the
current, broken, semantics. So, added a new `isEquivalent` method to provide
the correct semantics. Added an `isEquivalent()` method to the
MaterializedField as well that will ignore the âimplementationâ columns
that hang off of types such as nullables, repeated, etc. That is, two repeated
columns are identical if their type is identical, regardless of whether one has
the â$offsetsâ child or not.
## Open Issues
This PR provides most of the required functionality. However, a few bits
and pieces are still in progress and will appear in subsequent commits and/or
PRs:
* Full testing of repeated maps
* Support for lists and repeated lists (will reuse structure from repeated
maps)
* Column projection (was in an earlier version, new version needed for this
structure)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/paul-rogers/drill DRILL-5657-2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/914.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #914
----
commit 68c0520492fc053eb74a9f02e9fa3f01be53ab83
Author: Paul Rogers <[email protected]>
Date: 2017-08-18T05:41:30Z
DRILL-5657: Size-aware vector writer structure
Commit 1: The vector and accessor layer
commit d2f7b2050764b8d4630158cd34dde0291377d3d7
Author: Paul Rogers <[email protected]>
Date: 2017-08-18T05:43:54Z
Commit 2: Row Set layer
Includes the test-oriented row set abstractions as well as unit tests
for the accessor layer
commit 7aff7a791f786db24853752c91031ee9121c8de3
Author: Paul Rogers <[email protected]>
Date: 2017-08-18T05:47:10Z
Commit 3: The tuple and column models
Also includes the implementation of the tuple and column metadata.
commit 4ef94ffd15f0570398a4ab39c9bffc77bc777473
Author: Paul Rogers <[email protected]>
Date: 2017-08-18T05:48:00Z
Commit 4: The "result set loader" layer
commit cf2973278bc7aa29a5cf28fd320f84b08bd49743
Author: Paul Rogers <[email protected]>
Date: 2017-08-18T05:49:13Z
Commit 5: Collateral damage
Includes tests that had to change when the row set interface changed
and a number of test cleanup items.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---