Hi All,
Drill 1.16 introduced the the "provided schema" mechanism to help you query the
kind of messy files found in the real world. Arina and Bridget created nice
documentation [1] for the feature. Sorabh presented the feature at the recent
Drill Meetup. If you are a plugin developer, we need your help to expand the
feature to other plugins.
To understand the need for the feature, it helps to remember that are two
popular ways to query data in a distributed file system (DFS) such as Hadoop:
direct query or ETL.
Most major query engines require an ETL step: use Hive or Spark to transform
your data into a standard format such as Parquet or ORC. Then, use a tool such
as Drill (or Impala, Presto, Hive LLAP, Snowflake, Big Query, etc.) to query
the data. The ETL approach works well, but it has a cost: you must maintain
multiple copies of the data, manage an ETL pipeline, and so on. This cost is
justified if your users query the data frequently, as in the classic "data
warehouse" use case.
There are other use cases (such as log analysis, data exploration, data
science) where the benefit of the two-step ETL process is less clear. These use
cases are better served by directly querying your "raw" data. Here your choices
are mostly Drill, Spark or the original Hive. Although Spark is very powerful,
Drill is far easier to use for tasks that can be expressed in SQL using your
favorite BI tool.
Drill's "schema on read" (AKA "schemaless") approach allows Drill to read a
data file directly: just point Drill at a file and immediately run queries on
that file. However, we've seen over the years that files can be messy or
ambiguous. Let's look at the two most common problems and how the provided
schema solves them: schema evolution and ambiguous data.
Schema evolution occurs when, say, a table started with two columns (a, b),
then newer versions added a third column (c, say). If you query SELECT a, b, c
FROM ..., Drill has to guess a type for column c in the old files (without the
column). Drill generally guesses Nullable Int. But, if the column is actually
VarChar, then a schema conflict (AKA "hard schema change") will occur and your
query may fail. With a provided schema, you can tell Drill that column "c" is a
VarChar, and even provide a default value. Now, Drill knows what to do for
files without column "c".
Another kind of ambiguity occurs when Drill attempts to guess a data type from
looking at the first few rows of a file. The classic JSON example is a
two-record file: {a: 10} {a: 10.1} -- a column starts as an INT, but then we
want to store FLOAT data into it, causing an error. With a hint, the user can
just declare the column as FLOAT, avoiding the ambiguity.
The "provided schema" feature solves these problems by supplying hints about
how interpret a file. The feature avoids heavy-weight cost of the Hive
metastore (HMS) that is used by Hive, Impala and Presto. Instead, the schema is
a simple file stored directly in the DFS alongside your data.
You can enable schema support in a plugin by using the new "enhanced vector
framework" (EVF) (AKA the "row set framework" or the "new scan framework".)
This framework was originally developed to control reader memory use by
limiting batch and vector size and to minimize vector memory fragmentation.
Solving those problems turned out to also solve the problems needed to support
a provided schema.
We are actively working to prepare the EVF for your use. We are converting the
Log (regex) format plugin and preparing a tutorial based on that conversion.
(The log reader was the basis of the format plugin chapter of the Learning
Apache Drill book, so it is a good choice for the EVF tutorial.)
If you are a user, please try out the feature on text files and let us know how
it works for you. That way, we an address any issues before we convert the
other plugins.
Thanks,
- Paul
[1] https://drill.apache.org/docs/create-or-replace-schema/