Hi All, Drill 1.16 introduced the the "provided schema" mechanism to help you query the kind of messy files found in the real world. Arina and Bridget created nice documentation [1] for the feature. Sorabh presented the feature at the recent Drill Meetup. If you are a plugin developer, we need your help to expand the feature to other plugins.
To understand the need for the feature, it helps to remember that are two popular ways to query data in a distributed file system (DFS) such as Hadoop: direct query or ETL. Most major query engines require an ETL step: use Hive or Spark to transform your data into a standard format such as Parquet or ORC. Then, use a tool such as Drill (or Impala, Presto, Hive LLAP, Snowflake, Big Query, etc.) to query the data. The ETL approach works well, but it has a cost: you must maintain multiple copies of the data, manage an ETL pipeline, and so on. This cost is justified if your users query the data frequently, as in the classic "data warehouse" use case. There are other use cases (such as log analysis, data exploration, data science) where the benefit of the two-step ETL process is less clear. These use cases are better served by directly querying your "raw" data. Here your choices are mostly Drill, Spark or the original Hive. Although Spark is very powerful, Drill is far easier to use for tasks that can be expressed in SQL using your favorite BI tool. Drill's "schema on read" (AKA "schemaless") approach allows Drill to read a data file directly: just point Drill at a file and immediately run queries on that file. However, we've seen over the years that files can be messy or ambiguous. Let's look at the two most common problems and how the provided schema solves them: schema evolution and ambiguous data. Schema evolution occurs when, say, a table started with two columns (a, b), then newer versions added a third column (c, say). If you query SELECT a, b, c FROM ..., Drill has to guess a type for column c in the old files (without the column). Drill generally guesses Nullable Int. But, if the column is actually VarChar, then a schema conflict (AKA "hard schema change") will occur and your query may fail. With a provided schema, you can tell Drill that column "c" is a VarChar, and even provide a default value. Now, Drill knows what to do for files without column "c". Another kind of ambiguity occurs when Drill attempts to guess a data type from looking at the first few rows of a file. The classic JSON example is a two-record file: {a: 10} {a: 10.1} -- a column starts as an INT, but then we want to store FLOAT data into it, causing an error. With a hint, the user can just declare the column as FLOAT, avoiding the ambiguity. The "provided schema" feature solves these problems by supplying hints about how interpret a file. The feature avoids heavy-weight cost of the Hive metastore (HMS) that is used by Hive, Impala and Presto. Instead, the schema is a simple file stored directly in the DFS alongside your data. You can enable schema support in a plugin by using the new "enhanced vector framework" (EVF) (AKA the "row set framework" or the "new scan framework".) This framework was originally developed to control reader memory use by limiting batch and vector size and to minimize vector memory fragmentation. Solving those problems turned out to also solve the problems needed to support a provided schema. We are actively working to prepare the EVF for your use. We are converting the Log (regex) format plugin and preparing a tutorial based on that conversion. (The log reader was the basis of the format plugin chapter of the Learning Apache Drill book, so it is a good choice for the EVF tutorial.) If you are a user, please try out the feature on text files and let us know how it works for you. That way, we an address any issues before we convert the other plugins. Thanks, - Paul [1] https://drill.apache.org/docs/create-or-replace-schema/