[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

paul-rogers Thu, 17 Aug 2017 22:53:57 -0700

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/914


    DRILL-5657: Size-aware vector writer structure

    This large PR provides another two levels of foundation for size-aware 
vector writers in the Drill record readers. It combines code from two previous 
PRS:
    
    * PR 866 - DRILL-5657: Implement size-aware result set loader
    * PR 887 - DRILL-5688: Add repeated map support to column accessors
    
    The PR then goes on to integrate the two prior PRs and provide additional 
functionality.
    
    Like the two previous PRs, this one is divided into commits to group the 
work.
    
    1. Accessor layer
    2. Row Set layer
    3. Tuple and Column Model layer
    4. Row Set Loader layer
    5. Secondary changes
    
    Much of the material below appears in Javadoc throughout the code. The 
material here is not meant to replace that documentation. Instead, it is meant 
to provide the âbig pictureâ: placing the bits and pieces in context and 
pointing out interesting functionality to explore in each layer.
    
    ## Commit 1: The Accessor Layer
    
    The first commit provides the core of the mechanism: the writers that put 
data into vectors, and the readers that retrieve that data. The version here is 
an evolution of the version provided in an earlier PR a few months ago.
    
    ### Overview of the Drill Vector Data Model
    
    The code will make much more sense if we start with a review of Drillâs 
complex vector data model. Drill has 38+ data (âminorâ) types as defined in 
the [proto 
buf](https://github.com/apache/drill/blob/master/protocol/src/main/protobuf/Types.proto)
 definition. Drill also has three cardinalities (âmodesâ) defined in the 
same file. The result is over 120+ different vector types. Then, when you add 
maps, repeated maps, lists and repeated lists, you rapidly get an explosion of 
types that the writer code must handle.
    
    Vectors can be categorized along multiple dimensions:
    
    * By data (minor) type
    * By cardinality (mode)
    * By fixed or variable width
    
    A repeated map, a list, a repeated list and any array (repeated) scalar all 
are array-like. Nullable and required modes are identical (single values), but 
a nullable has an additional is-set (âbitâ) vector.
    
    A key contribution of this PR is the data model used to organize vectors.
    
    * Both the top-level row, and a Drill map are âtuplesâ and are treated 
similarly in the model.
    * All non-map, non-list (that is, scalar) data types are treated uniformly.
    * All arrays (whether a list, a repeated list, a repeated map, or a 
repeated scalar) are treated uniformly.
    
    ### Accessor Data Model
    
    The above leads to a very simple, JSON-like data model, introduced in this 
PR.
    
    * A tuple reader or writer models a row. (Usually via a subclass.) Column 
are accessible by name or position.
    * Every column is modeled as an object.
    * The object can have an object type: scalar, tuple or array.
    * An array has a single element type (but many run-time elements)
    * A scalar can be nullable or not, and provides a uniform get/set interface.
    
    This data model is similar to; but has important differences from, the 
prior, generated, readers and writers. 
    
    The object layer is new: it is the simplest way to model the three 
âobject types.â An app using this code would use just the leaf scalar 
readers and writers.
    
    Although there is quite a bit of code change here to provide the new 
structure the core functionality of reading and writing to vectors has not 
changed much. And, this code has extensive unit tests, which should avoid the 
need to "mentally execute" each line of code.
    
    See the classes in `org.apache.drill.exec.vector.accessor` for details. In 
particular, please see the `package-info.java` file in that package for more 
information.
    
    As before, the top package provides a set of interfaces; the inner packages 
provide the implementation. The `ColumnAccessors.java` template generates the 
per-vector code. Warning: this template has become quite cryptic: the best bet 
for review is to generate the Java code and review that.
    
    ### Writer Performance
    
    During previous review, we discussed ways to optimize writer performance. 
This PR has two improvements:
    
    * Completely rework the writers to minimize code steps
    * Rework the âcolumn loadersâ to eliminate them: instead of two 
additional method calls, the âloaderâ now uses the column writers directly.
    
    Much behind-the-scenes rework was needed to accomplish the above.
    
    Column readers, however, were left with their existing structure; we can 
apply the same optimizations to the readers in a later PR.
    
    While doing the performance optimization, it became clear we can adopt a 
major simplification. Writers evolved from having three versions (nullable, 
required, repeated) to having a single version to handle all three cases. Some 
items of note:
    
    * The writers bypass DrillBuf and the UDLE to needed writes to direct 
memory.
    * The writers buffer the buffer address and implement a number of methods 
to synchronize that address when the buffer changes (on a new batch or during 
vector resize).
    * Writing require a single bounds check. In most cases, the write is within 
bounds so the single check is all that is needed.
    * If the write is out of bounds, then the writer determines the new vector 
size and performs the needed reallocation. To avoid multiple doublings, the 
writer computes the needed new size and allocates that size directly.
    * Vector reallocation is improved to eliminate zeroing the new half of the 
buffer, data is left âgarbage-filled.â
    * If the vector would grow beyond 16 MB, then overflow is triggered, via a 
listener, which causes the buffer to be replaced. The write then continues.
    * Offset vector updates are integrated into the writers using an 
`OffsetVectorWriter`. This writer caches the last write position so that each 
array write needs a single offset update, rather than the read and write as in 
previous code.
    * The writers keep track of the âlast write positionâ and perform 
âfill-emptiesâ work if the new write position is more than one position 
behind the last write. All types now correctly support âfill-emptiesâ 
(before, only nullable types did so reliably.)
    * Null handling is done by an additional writer layer that wraps the 
underlying data writer. This avoids the need for a special nullable writer: the 
same nullable layer works for all data types.
    * Array handling is done similarly: an array writer manages the offset 
vector and works the same for repeated scalars, repeated maps and (eventually) 
lists and repeated lists.
    
    Since vector overflow is now incorporated directly into the writers, this 
PR backs out the `setScalar()` and related methods added to value vectors in a 
previous commit.
    
    ### Supporting Classes
    
    A new `TupleMetadata` class is a superset of the existing `BatchSchema`, 
but also provides "classic" tuple-like access by position or name. Eventually, 
this will also hold additional information such as actual size and so on 
(information now laboriously rediscovered by the "record batch sizer.") Since 
the accessors use a "tuple" abstraction to model both rows and maps, the tuple 
metadata provides this same view. The top-most tuple models the row. Columns 
within the row can be maps, which have their own tuple schema, and so on.
    
    `TupleNameSpace` moves locations (so it can be used in the vector package) 
but otherwise remains unchanged.
    
    `DrillBuf` provides an experimental `putInt()` method that does bounds 
checking and sets a value, to minimize calls. This will probably move into the 
writer in a later PR.
    
    This PR fixes DRILL-5690, a bug in repeated vectors that did not pass along 
Decimal scale and precision. See `RepeatedValueVectors.java`.
    
    `MaterializedField` changes to add an `isEquivalent()` method to compare 
two fields, ignoring internal (`$offset$`, `$bits$`, etc.) vectors.
    
    ## Commit 2: Row Set Layer
    
    The âRow Setâ family of classes allow tests to quickly build and 
analyze vector containers in the form of a ârow setâ - a set of rows. In 
prior commits, the row set abstraction was the primary client of the accessors 
(readers and writers.) In this commit, much functionality shifts to be shared 
with the new âloaderâ abstraction intended or production code.
    
    Key changes in response to the accessor changes:
    
    * The reader and writer are moved to separate files.
    * Row sets now use a common âmodelâ to describe a vector tree (more 
below).
    * Static factory methods were added to hide constructor complexity.
    * The `RowSetBuilder` and `RowSetComparison` test tools added support for 
repeated maps.
    * Code to handle generic object writing moved from the `RowSetBuilder` into 
the accessors.
    * The old `RowSetSchema` evolved to become the `TupleMetadata` mentioned 
above.
    * Tests were greatly enhanced to test all modes of all supported scalar 
types, as well as the new JSON-like structure.
    
    As before the row set unit tests exercise the row set classes themselves, 
but also are the mechanism to exercise the accessor classes.
    
    See the `org.apache.drill.test.rowSet` and its sub-packages for details.
    
    ## Commit 3: Tuple and Column Models
    
    In the previous version, the row set classes had complex logic to figure 
out what kind of accessor to create for each vector. This became overly 
complex. In the previous PR, the row set "parses" a vector container to create 
"storage" objects that represent tuples and columns. A column can, itself, be a 
tuple. (Note: there is no need to model lists since lists are just vectors at 
this level of abstraction, so need no special handling.)
    
    In this PR, that concept evolves again to become a âtuple modelâ and 
âcolumn modelâ: a tree-structured representation of the structure of a 
Drill row. The model cleanly represents things like maps, repeated maps, list 
of repeated maps that contain arrays, etc.
    
    Think of the model as the âbackboneâ to work with a batch vector tree. 
(âTreeâ because maps and lists contain other vectors.) Ideally, the vectors 
themselves would provide this backbone. But, vectors are shared across 
operators, so that structure created by one operator would conflict with 
structure needed by another.
    
    With this change, accessor creation is a simple matter of walking a tree to 
assemble the JSON-structure.
    
    This structure is also used to create a batch's vectors from a schema using 
âvisitorâ classes, to allocate vectors based on metadata and other chores 
needed in the result set loader.
    
    See `org.apache.drill.exec.physical.rowSet.model`. The top package contains 
interfaces. The two sub packages provide implementations for single batches and 
hyper batches. (Hyper batches are used only in the test âRowSetâ 
abstractions at present.)
    
    ### Tuple and Column Metadata
    
    Also in this commit is a new âextendedâ metadata model that replaces 
earlier versions in the row set and previous âloaderâ commits.
    
    For reasons that will become clear in the next PR, the scan batch ends up 
doing quite a bit of semantic analysis to map from the select list and the 
table schema to the result schema. Drill provides a BatchSchema class that is 
useful, but limited in this context. The metadata layer helps to solve this 
problem by allowing fast access by both name and position, and allows the 
schema to grow dynamically.
    
    The dynamic aspect appears in this PR as well: the âloaderâ API (see 
below) allows adding columns as the app proceeds; and that action internally 
adds new columns to the batch schema.
    
    The metadata classes evolved from an earlier version in the row set 
abstractions. That code was extracted and extended to act as the foundation to 
create the metadata layer.
    
    The metadata model is simple: it represents tuples (rows, maps) and 
columns. The metadata is intended to provide addition information beyond what 
exists in the `MaterializedField` class. In particular, if provides hints about 
VarChar width and array size. It also provides many useful methods to simplify 
working with columns.
    
    Eventually, it would be good to combine the new metadata classes with the 
existing `BatchSchema` and `MaterializedSchema`. But, that is a big change that 
must wait for another day.
    
    ## Commit 4: The Result Set Loader Layer
    
    We now finally get to the ultimate goal of this PR: to introduce a new 
âresult set loader.â This abstraction is an evolution of the âMutatorâ 
class in the scan batch. (The mutator manages a batch; some readers used the 
existing âwritersâ with the mutator, others used different techniques such 
as working directly with native memory.)
    
    A result set loader loads a set of tuples (AKA records, rows) from any 
source (such as a record reader) into a set of record batches. The loader:
    
    * Divides records into batches based on a maximum row count or a maximum 
vector size, whichever occurs first. (Later revisions may limit overall batch 
size.)
    * Tracks the start and end of each batch.
    * Tracks the start and end of each row.
    * Provides column loaders to write each column value.
    * Handles overflow when a vector becomes full, but the client still must 
finish writing the current row.
    
    The original Mutator class divided up responsibilities:
    
    * The Mutator handled the entire record batch
    * An optional VectorContainerWriter writes each record
    
    The result set loader follows this same general pattern.
    
    * The result set loader handles the entire record batch (really, a series 
of batches that make up the entire result set: hence the name.)
    * The TupleLoader class provides per-tuple services which mostly consists 
of access to the column loaders.
    * A tuple schema defines the schema for the result set (see below.)
    
    To hide this complexity from the client, a ResultSetLoader interface 
defines the public API. Then, a `ResultSetLoaderImpl` class implements the 
interface with all the gory details. Separate classes handle each column, the 
result set schema, and so on.
    
    Recall that the overall goal of this PR is to handle memory fragmentation 
by limiting vectors to 16 MB in size. The Loader realizes that goal by silently 
handling vector overflow by moving the âoverflowâ row to a new 
âlook-aheadâ batch. The complete batch is then sent downstream, and the 
next read starts with the overflow row as the first row of the next batch.
    
    This version is a bit simpler than the previous one because this one 
leverages the model, metadata and accessor layers for much of the work.
    
    ### Loader API
    
    The actual loader API turns out to be quite simple: another important goal 
to allow third parties to more easily create custom ârecord readers.â For 
example:
    
    ```
     void writeABatch() {
       RowSetLoader writer = ...
       while (writer.start()) {
         writer.scalar(0).setInt(10);
         writer.scalar(1).setString("foo");
         ...
         writer.save();
       }
     }
    ```
    
    Note that, in this version, the `writer` above is defined in the result set 
loader, but the writer is simply a tuple writer defined in the accessor layer. 
By the time we get to the `setInt()` or `setString()` methods, the code does a 
single method call to get into position to write the data. This eliminates the 
extra âcolumn loaderâ layer of the prior version of the result set loader.
    
    ### Writer and Loader Integration
    
    The column writers are low-level classes that interface between a consumer 
and a value vector. To create the tuple loader we need a way to bind the column 
writers to the result set loader. For this, we use a pair of listeners.
    
    * A tuple writer listener handles requests to add a new column at run time.
    * A column listener handles vector overflow.
    
    In both cases, the app works with the low-level accessors, but the listener 
allows events to be fed back to the loader which handles the complex details of 
dynamic schema evolution and of vector overflow (in fact, it can handle both at 
the same time.)
    
    To handle vector overflow, each âsetâ method:
    
    * Does a single bounds check to determine if the value is in the range of 
the current buffer.
    * If not, relocates the buffer, if possible.
    * If the vector is full (reached the size limit), calls the listener to 
handle overflow.
    * The listener, via the vector model described earlier, dispatches the 
request to the loader.
    * The loader handles overflow, creating a new vector.
    * The writer then writes the value into the new or existing buffer.
    
    ### Vector Overflow Logic
    
    The heart of this abstraction is that last point: the ability to detect 
when a vector overflows, switch in a new vector, and continue writing. Several 
tricks are involved.
    
    Suppose we have a row of five columns: a through e. The code writes a and 
b. Then, c overflows. The code canât rewrite a and b. To handle this, the 
tuple loader:
    
    * Creates a new, small set of vectors called the âoverflow batchâ
    * Copies columns a and b from the current batch to the overflow batch.
    * Writes column c to the overflow batch.
    * Allows the client code to finish writing columns d and e (to the overflow 
batch).
    * Reports to the client that the batch is full.
    
    Note that the client is completely unaware that any of the above occurred: 
it just writes a row and asks if it can write another.
    
    ### Skipping Columns
    
    The loader must also handle a reader, such as Parquet, that skips columns 
if they are null. There were bugs in Drillâs vectors for this case and 
temporary patches were made in a number of places to make this work. The trick 
also should work for arrays (a null array is allowed, Drill represents it as an 
empty array.) But, this code also was broken. For good measure, the code now 
also allows skipping non-null columns if a good âemptyâ value is available: 
0 for numbers, blank for strings. This behavior is needed for the CSV reader; 
if a line is missing a field, the CSV reader treats it as an empty (not null) 
field.
    
    ### Result Set Schema
    
    The tuple loader is designed to handle two kinds of tables: âearly 
schemaâ (such as Parquet and CSV) define the schema up front. âLate 
schemaâ (such as JSON) discover the schema during reading. The tuple loader 
allows either form, and, in fact, uses the same mechanism.
    
    Consumer of batches will, of course, want to know that the schema changed. 
Providing a simple flag is muddy: when should it be reset? A better solution is 
to provide a schema version which is incremented each time a column is added. 
(Columns cannot be removed or changed â at least not yet.)
    
    ### Internal Plumbing
    
    The loader assembles the many pieces described earlier:
    
    * The vectors themselves
    * The writers (tuples, objects, arrays, scalars)
    * The vector model (tree)
    * Tuple and column metadata
    * The loader logic for handling overflow, etc.
    
    The solution is based on a strict separation of concerns via layering and 
loose coupling.
    
    * The vector model is the âbackboneâ that coordinates the other pieces.
    * Metadata provides instructions for building the vector model.
    * The vector model builds the corresponding vector and writer.
    * The writer listener sends events (see above) to the vector model.
    * The vector model dispatches events to the loader via a âcoordinatorâ 
interface.
    * The loader performs batch, row and column operations via the vector model.
    
    Please see the extensive Javadoc description in the 
`org.apache.drill.exec.physical.rowSet` package, especially in the 
`package-info.java` file in the `impl` sub-package.
    
    ### Result Vector Cache
    
    Above we mentioned that the tuple loader allows schema changes on the fly. 
As the next PR will make more clear, downstream operators want a fixed set of 
vectors. To assist with this, the tuple loader uses a âresult vector 
cacheâ. Letâs say a scanner reads two JSON files with the same schema. The 
first crates the schema and vectors. The second is obligated to use the same 
vectors. This is a royal pain. But, the vector cache does it automatically: 
when the tuple loader adds a new column, it checks if the vector already exists 
in the cache and reuses it. If not there, the cache adds it and returns it so 
that it is there the next time around.
    
    ### Unit Tests
    
    All of the above is pretty thoroughly tested via unit tests. In fact, the 
unit tests are a good place to start (now I tell you!) in order to see how 
client code uses the various abstractions.
    
    The bit of unit test structure that handled system options turned out to be 
wrong. Modified it to use the defaults defined in the system option manager, 
which required changing the visibility of the defaults table.
    
    ## Commit 5: Collateral Damage
    
    The last commit contains various other changes, mostly reflecting the 
changes above.
    
    Some unit tests were updated to use new features which become available in 
this PR. See TestFillEmpties and TestVectorLimits.
    
    The `equals()` method in BatchSchema is badly broken. Cleaned it up some. 
But, didnât want to change it too much in case anything depends on the 
current, broken, semantics. So, added a new `isEquivalent` method to provide 
the correct semantics. Added an `isEquivalent()` method to the 
MaterializedField as well that will ignore the âimplementationâ columns 
that hang off of types such as nullables, repeated, etc. That is, two repeated 
columns are identical if their type is identical, regardless of whether one has 
the â$offsetsâ child or not.
    
    ## Open Issues
    
    This PR provides most of the required functionality. However, a few bits 
and pieces are still in progress and will appear in subsequent commits and/or 
PRs:
    
    * Full testing of repeated maps
    * Support for lists and repeated lists (will reuse structure from repeated 
maps)
    * Column projection (was in an earlier version, new version needed for this 
structure)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5657-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/914.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #914
    
----
commit 68c0520492fc053eb74a9f02e9fa3f01be53ab83
Author: Paul Rogers <[email protected]>
Date:   2017-08-18T05:41:30Z

    DRILL-5657: Size-aware vector writer structure
    
    Commit 1: The vector and accessor layer

commit d2f7b2050764b8d4630158cd34dde0291377d3d7
Author: Paul Rogers <[email protected]>
Date:   2017-08-18T05:43:54Z

    Commit 2: Row Set layer
    
    Includes the test-oriented row set abstractions as well as unit tests
    for the accessor layer

commit 7aff7a791f786db24853752c91031ee9121c8de3
Author: Paul Rogers <[email protected]>
Date:   2017-08-18T05:47:10Z

    Commit 3: The tuple and column models
    
    Also includes the implementation of the tuple and column metadata.

commit 4ef94ffd15f0570398a4ab39c9bffc77bc777473
Author: Paul Rogers <[email protected]>
Date:   2017-08-18T05:48:00Z

    Commit 4: The "result set loader" layer

commit cf2973278bc7aa29a5cf28fd320f84b08bd49743
Author: Paul Rogers <[email protected]>
Date:   2017-08-18T05:49:13Z

    Commit 5: Collateral damage
    
    Includes tests that had to change when the row set interface changed
    and a number of test cleanup items.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

Reply via email to