Hello Drill devs,
I would like to propose a proactive effort to make the Drill codebase
easier to unit test.
Many JIRAs have been created for bugs that should have been prevented by
better unit testing, and we are still fixing these kinds of bugs today as
they crop up. I have a few ideas, and I plan on creating JIRAs for specific
refactoring and test infrastructure improvements. Before I do, I would like
to collect thoughts from everyone on what can get us the most benefit for
our work.
As a short overview of the situation today, most of the tests in Drill take
the form of running a SQL query on a local drillbit and verifying the
results. Plenty of times this has been described as more of integration
testing than unit testing, and it has caused several common testing pains
and gaps.
1. batch boundaries - as we cannot control where batches are cut off during
the query, complete queries often make it hard to test different scenarios
processing an incoming stream of data with given properties.
- examples of issues: inconsistent behavior between operators,
some
operators have failed to handle empty batches, or a batch full
of nulls
until we wrote a test that happened to have the right input file
and plan to
produce these scenarios
2. Valid planning changes can end up making tests previously designed to
test execution fail in new ways as the data will now flow differently
through the operators
3. SQL queries as test specifications make it hard to test "everything",
all types, all possible data properties/structures, all possible switches
flipped in the planner or configuration for an operator
I would like to start the discussion with a proposal to fix some of these
problems. We need a way to run an operator easily in isolation. Possible
steps to achieve this include, a new operator that will produce data in
explicitly provided batches, that can be configured from a test. This can
serve as a universal input to unit test operators. We would also need some
way to consume and verify the output of the operators. This could share
code with the current query execution, or possibly side step it to avoid
having to mock or instantiate the whole query context.
This proposal itself is testing a relatively large part of the system as a
whole "unit". I would be interested to hear opinions on the utility vs
extra effort of trying to refactor more classes so that they can be created
in tests and have their individual methods tested. This is already being
done for some classes like the value vectors, but it is far from
exhaustive. I don't expect us to start rigidly enforcing this level of
testing granularity everywhere, but there are components of the system that
really need to be resilient and be guaranteed to stay that way as the
project evolves.
Please chime in with your thoughts.