Hi All,

Drill 1.16 introduced the the "provided schema" mechanism to help you query the 
kind of messy files found in the real world. Arina and Bridget created nice 
documentation [1] for the feature. Sorabh presented the feature at the recent 
Drill Meetup. If you are a plugin developer, we need your help to expand the 
feature to other plugins.


To understand the need for the feature, it helps to remember that are two 
popular ways to query data in a distributed file system (DFS) such as Hadoop: 
direct query or ETL.

 Most major query engines require an ETL step: use Hive or Spark to transform 
your data into a standard format such as Parquet or ORC. Then, use a tool such 
as Drill (or Impala, Presto, Hive LLAP, Snowflake, Big Query, etc.) to query 
the data. The ETL approach works well, but it has a cost: you must maintain 
multiple copies of the data, manage an ETL pipeline, and so on. This cost is 
justified if your users query the data frequently, as in the classic "data 
warehouse" use case.


There are other use cases (such as log analysis, data exploration, data 
science) where the benefit of the two-step ETL process is less clear. These use 
cases are better served by directly querying your "raw" data. Here your choices 
are mostly Drill, Spark or the original Hive. Although Spark is very powerful, 
Drill is far easier to use for tasks that can be expressed in SQL using your 
favorite BI tool.


Drill's "schema on read" (AKA "schemaless") approach allows Drill to read a 
data file directly: just point Drill at a file and immediately run queries on 
that file. However, we've seen over the years that files can be messy or 
ambiguous. Let's look at the two most common problems and how the provided 
schema solves them: schema evolution and ambiguous data.

Schema evolution occurs when, say, a table started with two columns (a, b), 
then newer versions added a third column (c, say). If you query SELECT a, b, c 
FROM ..., Drill has to guess a type for column c in the old files (without the 
column). Drill generally guesses Nullable Int. But, if the column is actually 
VarChar, then a schema conflict (AKA "hard schema change") will occur and your 
query may fail. With a provided schema, you can tell Drill that column "c" is a 
VarChar, and even provide a default value. Now, Drill knows what to do for 
files without column "c".

Another kind of ambiguity occurs when Drill attempts to guess a data type from 
looking at the first few rows of a file. The classic JSON example is a 
two-record file: {a: 10} {a: 10.1} -- a column starts as an INT, but then we 
want to store FLOAT data into it, causing an error. With a hint, the user can 
just declare the column as FLOAT, avoiding the ambiguity.


The "provided schema" feature solves these problems by supplying hints about 
how interpret a file. The feature avoids heavy-weight cost of the Hive 
metastore (HMS) that is used by Hive, Impala and Presto. Instead, the schema is 
a simple file stored directly in the DFS alongside your data.


 You can enable schema support in a plugin by using the new "enhanced vector 
framework" (EVF) (AKA the "row set framework" or the "new scan framework".) 
This framework was originally developed to control reader memory use by 
limiting batch and vector size and to minimize vector memory fragmentation. 
Solving those problems turned out to also solve the problems needed to support 
a provided schema.

We are actively working to prepare the EVF for your use. We are converting the 
Log (regex) format plugin and preparing a tutorial based on that conversion. 
(The log reader was the basis of the format plugin chapter of the Learning 
Apache Drill book, so it is a good choice for the EVF tutorial.)

If you are a user, please try out the feature on text files and let us know how 
it works for you. That way, we an address any issues before we convert the 
other plugins.


Thanks,
- Paul

[1] https://drill.apache.org/docs/create-or-replace-schema/

Reply via email to