[ https://issues.apache.org/jira/browse/DRILL-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131762#comment-16131762 ]
ASF GitHub Bot commented on DRILL-5657: --------------------------------------- GitHub user paul-rogers opened a pull request: https://github.com/apache/drill/pull/914 DRILL-5657: Size-aware vector writer structure This large PR provides another two levels of foundation for size-aware vector writers in the Drill record readers. It combines code from two previous PRS: * PR 866 - DRILL-5657: Implement size-aware result set loader * PR 887 - DRILL-5688: Add repeated map support to column accessors The PR then goes on to integrate the two prior PRs and provide additional functionality. Like the two previous PRs, this one is divided into commits to group the work. 1. Accessor layer 2. Row Set layer 3. Tuple and Column Model layer 4. Row Set Loader layer 5. Secondary changes Much of the material below appears in Javadoc throughout the code. The material here is not meant to replace that documentation. Instead, it is meant to provide the “big picture”: placing the bits and pieces in context and pointing out interesting functionality to explore in each layer. ## Commit 1: The Accessor Layer The first commit provides the core of the mechanism: the writers that put data into vectors, and the readers that retrieve that data. The version here is an evolution of the version provided in an earlier PR a few months ago. ### Overview of the Drill Vector Data Model The code will make much more sense if we start with a review of Drill’s complex vector data model. Drill has 38+ data (“minor”) types as defined in the [proto buf](https://github.com/apache/drill/blob/master/protocol/src/main/protobuf/Types.proto) definition. Drill also has three cardinalities (“modes”) defined in the same file. The result is over 120+ different vector types. Then, when you add maps, repeated maps, lists and repeated lists, you rapidly get an explosion of types that the writer code must handle. Vectors can be categorized along multiple dimensions: * By data (minor) type * By cardinality (mode) * By fixed or variable width A repeated map, a list, a repeated list and any array (repeated) scalar all are array-like. Nullable and required modes are identical (single values), but a nullable has an additional is-set (“bit”) vector. A key contribution of this PR is the data model used to organize vectors. * Both the top-level row, and a Drill map are “tuples” and are treated similarly in the model. * All non-map, non-list (that is, scalar) data types are treated uniformly. * All arrays (whether a list, a repeated list, a repeated map, or a repeated scalar) are treated uniformly. ### Accessor Data Model The above leads to a very simple, JSON-like data model, introduced in this PR. * A tuple reader or writer models a row. (Usually via a subclass.) Column are accessible by name or position. * Every column is modeled as an object. * The object can have an object type: scalar, tuple or array. * An array has a single element type (but many run-time elements) * A scalar can be nullable or not, and provides a uniform get/set interface. This data model is similar to; but has important differences from, the prior, generated, readers and writers. The object layer is new: it is the simplest way to model the three “object types.” An app using this code would use just the leaf scalar readers and writers. Although there is quite a bit of code change here to provide the new structure the core functionality of reading and writing to vectors has not changed much. And, this code has extensive unit tests, which should avoid the need to "mentally execute" each line of code. See the classes in `org.apache.drill.exec.vector.accessor` for details. In particular, please see the `package-info.java` file in that package for more information. As before, the top package provides a set of interfaces; the inner packages provide the implementation. The `ColumnAccessors.java` template generates the per-vector code. Warning: this template has become quite cryptic: the best bet for review is to generate the Java code and review that. ### Writer Performance During previous review, we discussed ways to optimize writer performance. This PR has two improvements: * Completely rework the writers to minimize code steps * Rework the “column loaders” to eliminate them: instead of two additional method calls, the “loader” now uses the column writers directly. Much behind-the-scenes rework was needed to accomplish the above. Column readers, however, were left with their existing structure; we can apply the same optimizations to the readers in a later PR. While doing the performance optimization, it became clear we can adopt a major simplification. Writers evolved from having three versions (nullable, required, repeated) to having a single version to handle all three cases. Some items of note: * The writers bypass DrillBuf and the UDLE to needed writes to direct memory. * The writers buffer the buffer address and implement a number of methods to synchronize that address when the buffer changes (on a new batch or during vector resize). * Writing require a single bounds check. In most cases, the write is within bounds so the single check is all that is needed. * If the write is out of bounds, then the writer determines the new vector size and performs the needed reallocation. To avoid multiple doublings, the writer computes the needed new size and allocates that size directly. * Vector reallocation is improved to eliminate zeroing the new half of the buffer, data is left “garbage-filled.” * If the vector would grow beyond 16 MB, then overflow is triggered, via a listener, which causes the buffer to be replaced. The write then continues. * Offset vector updates are integrated into the writers using an `OffsetVectorWriter`. This writer caches the last write position so that each array write needs a single offset update, rather than the read and write as in previous code. * The writers keep track of the “last write position” and perform “fill-empties” work if the new write position is more than one position behind the last write. All types now correctly support “fill-empties” (before, only nullable types did so reliably.) * Null handling is done by an additional writer layer that wraps the underlying data writer. This avoids the need for a special nullable writer: the same nullable layer works for all data types. * Array handling is done similarly: an array writer manages the offset vector and works the same for repeated scalars, repeated maps and (eventually) lists and repeated lists. Since vector overflow is now incorporated directly into the writers, this PR backs out the `setScalar()` and related methods added to value vectors in a previous commit. ### Supporting Classes A new `TupleMetadata` class is a superset of the existing `BatchSchema`, but also provides "classic" tuple-like access by position or name. Eventually, this will also hold additional information such as actual size and so on (information now laboriously rediscovered by the "record batch sizer.") Since the accessors use a "tuple" abstraction to model both rows and maps, the tuple metadata provides this same view. The top-most tuple models the row. Columns within the row can be maps, which have their own tuple schema, and so on. `TupleNameSpace` moves locations (so it can be used in the vector package) but otherwise remains unchanged. `DrillBuf` provides an experimental `putInt()` method that does bounds checking and sets a value, to minimize calls. This will probably move into the writer in a later PR. This PR fixes DRILL-5690, a bug in repeated vectors that did not pass along Decimal scale and precision. See `RepeatedValueVectors.java`. `MaterializedField` changes to add an `isEquivalent()` method to compare two fields, ignoring internal (`$offset$`, `$bits$`, etc.) vectors. ## Commit 2: Row Set Layer The “Row Set” family of classes allow tests to quickly build and analyze vector containers in the form of a “row set” - a set of rows. In prior commits, the row set abstraction was the primary client of the accessors (readers and writers.) In this commit, much functionality shifts to be shared with the new “loader” abstraction intended or production code. Key changes in response to the accessor changes: * The reader and writer are moved to separate files. * Row sets now use a common “model” to describe a vector tree (more below). * Static factory methods were added to hide constructor complexity. * The `RowSetBuilder` and `RowSetComparison` test tools added support for repeated maps. * Code to handle generic object writing moved from the `RowSetBuilder` into the accessors. * The old `RowSetSchema` evolved to become the `TupleMetadata` mentioned above. * Tests were greatly enhanced to test all modes of all supported scalar types, as well as the new JSON-like structure. As before the row set unit tests exercise the row set classes themselves, but also are the mechanism to exercise the accessor classes. See the `org.apache.drill.test.rowSet` and its sub-packages for details. ## Commit 3: Tuple and Column Models In the previous version, the row set classes had complex logic to figure out what kind of accessor to create for each vector. This became overly complex. In the previous PR, the row set "parses" a vector container to create "storage" objects that represent tuples and columns. A column can, itself, be a tuple. (Note: there is no need to model lists since lists are just vectors at this level of abstraction, so need no special handling.) In this PR, that concept evolves again to become a “tuple model” and “column model”: a tree-structured representation of the structure of a Drill row. The model cleanly represents things like maps, repeated maps, list of repeated maps that contain arrays, etc. Think of the model as the “backbone” to work with a batch vector tree. (“Tree” because maps and lists contain other vectors.) Ideally, the vectors themselves would provide this backbone. But, vectors are shared across operators, so that structure created by one operator would conflict with structure needed by another. With this change, accessor creation is a simple matter of walking a tree to assemble the JSON-structure. This structure is also used to create a batch's vectors from a schema using “visitor” classes, to allocate vectors based on metadata and other chores needed in the result set loader. See `org.apache.drill.exec.physical.rowSet.model`. The top package contains interfaces. The two sub packages provide implementations for single batches and hyper batches. (Hyper batches are used only in the test “RowSet” abstractions at present.) ### Tuple and Column Metadata Also in this commit is a new “extended” metadata model that replaces earlier versions in the row set and previous “loader” commits. For reasons that will become clear in the next PR, the scan batch ends up doing quite a bit of semantic analysis to map from the select list and the table schema to the result schema. Drill provides a BatchSchema class that is useful, but limited in this context. The metadata layer helps to solve this problem by allowing fast access by both name and position, and allows the schema to grow dynamically. The dynamic aspect appears in this PR as well: the “loader” API (see below) allows adding columns as the app proceeds; and that action internally adds new columns to the batch schema. The metadata classes evolved from an earlier version in the row set abstractions. That code was extracted and extended to act as the foundation to create the metadata layer. The metadata model is simple: it represents tuples (rows, maps) and columns. The metadata is intended to provide addition information beyond what exists in the `MaterializedField` class. In particular, if provides hints about VarChar width and array size. It also provides many useful methods to simplify working with columns. Eventually, it would be good to combine the new metadata classes with the existing `BatchSchema` and `MaterializedSchema`. But, that is a big change that must wait for another day. ## Commit 4: The Result Set Loader Layer We now finally get to the ultimate goal of this PR: to introduce a new “result set loader.” This abstraction is an evolution of the “Mutator” class in the scan batch. (The mutator manages a batch; some readers used the existing “writers” with the mutator, others used different techniques such as working directly with native memory.) A result set loader loads a set of tuples (AKA records, rows) from any source (such as a record reader) into a set of record batches. The loader: * Divides records into batches based on a maximum row count or a maximum vector size, whichever occurs first. (Later revisions may limit overall batch size.) * Tracks the start and end of each batch. * Tracks the start and end of each row. * Provides column loaders to write each column value. * Handles overflow when a vector becomes full, but the client still must finish writing the current row. The original Mutator class divided up responsibilities: * The Mutator handled the entire record batch * An optional VectorContainerWriter writes each record The result set loader follows this same general pattern. * The result set loader handles the entire record batch (really, a series of batches that make up the entire result set: hence the name.) * The TupleLoader class provides per-tuple services which mostly consists of access to the column loaders. * A tuple schema defines the schema for the result set (see below.) To hide this complexity from the client, a ResultSetLoader interface defines the public API. Then, a `ResultSetLoaderImpl` class implements the interface with all the gory details. Separate classes handle each column, the result set schema, and so on. Recall that the overall goal of this PR is to handle memory fragmentation by limiting vectors to 16 MB in size. The Loader realizes that goal by silently handling vector overflow by moving the “overflow” row to a new “look-ahead” batch. The complete batch is then sent downstream, and the next read starts with the overflow row as the first row of the next batch. This version is a bit simpler than the previous one because this one leverages the model, metadata and accessor layers for much of the work. ### Loader API The actual loader API turns out to be quite simple: another important goal to allow third parties to more easily create custom “record readers.” For example: ``` void writeABatch() { RowSetLoader writer = ... while (writer.start()) { writer.scalar(0).setInt(10); writer.scalar(1).setString("foo"); ... writer.save(); } } ``` Note that, in this version, the `writer` above is defined in the result set loader, but the writer is simply a tuple writer defined in the accessor layer. By the time we get to the `setInt()` or `setString()` methods, the code does a single method call to get into position to write the data. This eliminates the extra “column loader” layer of the prior version of the result set loader. ### Writer and Loader Integration The column writers are low-level classes that interface between a consumer and a value vector. To create the tuple loader we need a way to bind the column writers to the result set loader. For this, we use a pair of listeners. * A tuple writer listener handles requests to add a new column at run time. * A column listener handles vector overflow. In both cases, the app works with the low-level accessors, but the listener allows events to be fed back to the loader which handles the complex details of dynamic schema evolution and of vector overflow (in fact, it can handle both at the same time.) To handle vector overflow, each “set” method: * Does a single bounds check to determine if the value is in the range of the current buffer. * If not, relocates the buffer, if possible. * If the vector is full (reached the size limit), calls the listener to handle overflow. * The listener, via the vector model described earlier, dispatches the request to the loader. * The loader handles overflow, creating a new vector. * The writer then writes the value into the new or existing buffer. ### Vector Overflow Logic The heart of this abstraction is that last point: the ability to detect when a vector overflows, switch in a new vector, and continue writing. Several tricks are involved. Suppose we have a row of five columns: a through e. The code writes a and b. Then, c overflows. The code can’t rewrite a and b. To handle this, the tuple loader: * Creates a new, small set of vectors called the “overflow batch” * Copies columns a and b from the current batch to the overflow batch. * Writes column c to the overflow batch. * Allows the client code to finish writing columns d and e (to the overflow batch). * Reports to the client that the batch is full. Note that the client is completely unaware that any of the above occurred: it just writes a row and asks if it can write another. ### Skipping Columns The loader must also handle a reader, such as Parquet, that skips columns if they are null. There were bugs in Drill’s vectors for this case and temporary patches were made in a number of places to make this work. The trick also should work for arrays (a null array is allowed, Drill represents it as an empty array.) But, this code also was broken. For good measure, the code now also allows skipping non-null columns if a good “empty” value is available: 0 for numbers, blank for strings. This behavior is needed for the CSV reader; if a line is missing a field, the CSV reader treats it as an empty (not null) field. ### Result Set Schema The tuple loader is designed to handle two kinds of tables: “early schema” (such as Parquet and CSV) define the schema up front. “Late schema” (such as JSON) discover the schema during reading. The tuple loader allows either form, and, in fact, uses the same mechanism. Consumer of batches will, of course, want to know that the schema changed. Providing a simple flag is muddy: when should it be reset? A better solution is to provide a schema version which is incremented each time a column is added. (Columns cannot be removed or changed — at least not yet.) ### Internal Plumbing The loader assembles the many pieces described earlier: * The vectors themselves * The writers (tuples, objects, arrays, scalars) * The vector model (tree) * Tuple and column metadata * The loader logic for handling overflow, etc. The solution is based on a strict separation of concerns via layering and loose coupling. * The vector model is the “backbone” that coordinates the other pieces. * Metadata provides instructions for building the vector model. * The vector model builds the corresponding vector and writer. * The writer listener sends events (see above) to the vector model. * The vector model dispatches events to the loader via a “coordinator” interface. * The loader performs batch, row and column operations via the vector model. Please see the extensive Javadoc description in the `org.apache.drill.exec.physical.rowSet` package, especially in the `package-info.java` file in the `impl` sub-package. ### Result Vector Cache Above we mentioned that the tuple loader allows schema changes on the fly. As the next PR will make more clear, downstream operators want a fixed set of vectors. To assist with this, the tuple loader uses a “result vector cache”. Let’s say a scanner reads two JSON files with the same schema. The first crates the schema and vectors. The second is obligated to use the same vectors. This is a royal pain. But, the vector cache does it automatically: when the tuple loader adds a new column, it checks if the vector already exists in the cache and reuses it. If not there, the cache adds it and returns it so that it is there the next time around. ### Unit Tests All of the above is pretty thoroughly tested via unit tests. In fact, the unit tests are a good place to start (now I tell you!) in order to see how client code uses the various abstractions. The bit of unit test structure that handled system options turned out to be wrong. Modified it to use the defaults defined in the system option manager, which required changing the visibility of the defaults table. ## Commit 5: Collateral Damage The last commit contains various other changes, mostly reflecting the changes above. Some unit tests were updated to use new features which become available in this PR. See TestFillEmpties and TestVectorLimits. The `equals()` method in BatchSchema is badly broken. Cleaned it up some. But, didn’t want to change it too much in case anything depends on the current, broken, semantics. So, added a new `isEquivalent` method to provide the correct semantics. Added an `isEquivalent()` method to the MaterializedField as well that will ignore the “implementation” columns that hang off of types such as nullables, repeated, etc. That is, two repeated columns are identical if their type is identical, regardless of whether one has the “$offsets” child or not. ## Open Issues This PR provides most of the required functionality. However, a few bits and pieces are still in progress and will appear in subsequent commits and/or PRs: * Full testing of repeated maps * Support for lists and repeated lists (will reuse structure from repeated maps) * Column projection (was in an earlier version, new version needed for this structure) You can merge this pull request into a Git repository by running: $ git pull https://github.com/paul-rogers/drill DRILL-5657-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/914.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #914 ---- commit 68c0520492fc053eb74a9f02e9fa3f01be53ab83 Author: Paul Rogers <prog...@maprtech.com> Date: 2017-08-18T05:41:30Z DRILL-5657: Size-aware vector writer structure Commit 1: The vector and accessor layer commit d2f7b2050764b8d4630158cd34dde0291377d3d7 Author: Paul Rogers <prog...@maprtech.com> Date: 2017-08-18T05:43:54Z Commit 2: Row Set layer Includes the test-oriented row set abstractions as well as unit tests for the accessor layer commit 7aff7a791f786db24853752c91031ee9121c8de3 Author: Paul Rogers <prog...@maprtech.com> Date: 2017-08-18T05:47:10Z Commit 3: The tuple and column models Also includes the implementation of the tuple and column metadata. commit 4ef94ffd15f0570398a4ab39c9bffc77bc777473 Author: Paul Rogers <prog...@maprtech.com> Date: 2017-08-18T05:48:00Z Commit 4: The "result set loader" layer commit cf2973278bc7aa29a5cf28fd320f84b08bd49743 Author: Paul Rogers <prog...@maprtech.com> Date: 2017-08-18T05:49:13Z Commit 5: Collateral damage Includes tests that had to change when the row set interface changed and a number of test cleanup items. ---- > Implement size-aware result set loader > -------------------------------------- > > Key: DRILL-5657 > URL: https://issues.apache.org/jira/browse/DRILL-5657 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: Future > Reporter: Paul Rogers > Assignee: Paul Rogers > Fix For: Future > > > A recent extension to Drill's set of test tools created a "row set" > abstraction to allow us to create, and verify, record batches with very few > lines of code. Part of this work involved creating a set of "column > accessors" in the vector subsystem. Column readers provide a uniform API to > obtain data from columns (vectors), while column writers provide a uniform > writing interface. > DRILL-5211 discusses a set of changes to limit value vectors to 16 MB in size > (to avoid memory fragmentation due to Drill's two memory allocators.) The > column accessors have proven to be so useful that they will be the basis for > the new, size-aware writers used by Drill's record readers. > A step in that direction is to retrofit the column writers to use the > size-aware {{setScalar()}} and {{setArray()}} methods introduced in > DRILL-5517. > Since the test framework row set classes are (at present) the only consumer > of the accessors, those classes must also be updated with the changes. > This then allows us to add a new "row mutator" class that handles size-aware > vector writing, including the case in which a vector fills in the middle of a > row. -- This message was sent by Atlassian JIRA (v6.4.14#64029)