Hi All,

Here is the next installment in the “batch size control” project update.

Drill has a great many operators. As we move forward, we must update them to 
use the new batch size control framework. Unit testing becomes a major concern. 
This note explains how we address that issue in this project.

The “classic” way to test Drill is to build the product, fire up the Drill 
server, and use Sqlline to fire off queries. The problem of course, is that the 
edit-compile-debug is glacially slow (five minutes). Testing is manual 
(copy/paste the query into Sqlline, visually inspect the results.)

Another alternative is to run the very same query, but as a JUnit test. Drill 
has many such tests. The “BaseTestQuery” framework and “TestBuilder” help. The 
newish “Cluster Framework” makes it very easy to start an embedded Drillbit 
with the desired options and settings, run a query, and examine the results. 
The edit-compile-debug cycle is much faster, on the order of 10-20 seconds.

This is good, but we still run the entire Drill operator stack and throw 
queries at it. We use use a file for input and capture query results as output. 
 But, we want much finer grain testing. That is, we want true unit testing: 
isolate a component, feed it some input, and verify its output.

A fact of Drill is that operators are tightly coupled with the fragment context 
which is coupled with the Drillbit context which needs the entire server. What 
to do? One solution is to use mocks, and, indeed, Drill has three solutions 
based on JMockit, Mockito, and Jinfeng’s handy new “Mini-Plan” framework.

Mocks are handy, but it is cleaner and simpler to have code that can be tested 
in isolation without mocks. The next step is the “sub-operator” test framework, 
the “RowSet” utilities and the “context” refactoring that break the tight 
coupling with the rest of Drill, allowing us to separate out an operator (after 
some simple changes to the code) to test in isolation. We can now easily pump 
in a very large variety of inputs (such as Drill’s 30+ data types in the 3 
cardinalities) without having to set up a lot of overhead for each.

Still, however, many operators are internally complex and poking at them from 
the outside is limiting. We want to test, say, not just the sort operator, as a 
whole, but we want to exercise the bit of code that does the in-memory sort, or 
the one that writes batches to disk. To do this, we must “disaggregate” each 
operator into a series of separately-testable components, each with a clear API.

Refactoring operators can only be done for new operators, or when we need to 
make major changes to an existing operator. As part of the “batch size control” 
project, we have created a new version of the scan operator using this model.

Refactoring scan pointed out an opportunity to refactor the core operator code 
itself. Each operator has three responsibilities:

* Implement the Drill iterator protocol.
* Hold a record batch.
* Details of the operator algorithm.

The next “batch size” PR will provide a new version of the base operator class 
that splits responsibility into classes for the first two items, and an 
interface for the third. This allows us to unit test the two classes once and 
for all. Per-operator, the focus is just the operator implementation.

The core operator algorithm implementation is designed to be loosely coupled to 
the rest of Drill, allowing complete unit testing without mocks. The scan 
operator revision, which we’ll describe in the next note, makes use of this 
structure.

Thanks,

- Paul


Reply via email to