Paul Rogers created DRILL-7553: ---------------------------------- Summary: Modernize type management Key: DRILL-7553 URL: https://issues.apache.org/jira/browse/DRILL-7553 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.17.0 Reporter: Paul Rogers
This is a roll-up issue for our ongoing discussion around improving and modernizing Drill's runtime type system. At present, Drill approaches types vastly differently than most other DB and query tools: * Drill does little (or no) plan-time type checking and propagation. Instead, all type management is done at execution time, in each reader, in each operator, and ultimately in the client. * Drill allows structured types (Map, Dict, Arrays), but does not have the extended SQL statements to fully utilize these types. * Drill supports varying types: two readers can both read column {{c}}, but can do so with different types. We've always hoped to discover some way to reconcile the types. But, at present, the functionality is buggy and incomplete. It is not clear that a viable solution exists. Drill also provides "formal" varying types: Union and List. These types are also not fully supported. These three topics are closely related. "Schema-free" means we must infer types at read time and so Drill cannot do plan-type type analysis of the kind done in other engines. Because of schema-on-read (which is what "schema-free" really means), two readers can read different types for the same fields, and so we end up with varying or inconsistent types, and are forced to figure out some way to manage the conflicts. The gist of the proposal explored in this ticket is to exploit the learning from other engines: to embrace types when available, and to impose tractable rules when types are discovered at run time. h4. Proposal Summary This is very much a discussion draft. Here are some suggestions to get started. # Set as our goal to manage types at plan time. Runtime type discovery becomes a (limited) special case. # Pull type resolution, propagation and checking into the planner where it can be done once per query. Move it out of execution where it must be done multiple times: once per operator per minor fragment. Implement the standard DB type checking and propagation rules. (These rules are currently implicitly implemented deep in the code gen code.) # Generate operator code in the planner; send it to workers as part of the physical plan (to avoid the need to generate the code on each worker.) # Provide schema-aware extensions for storage and format plugins so that they can advertise a schema when known. (Examples; Hive sources get schemas from HMS, JDBC sources get schema from the underlying database, Avro, Parquet and others obtain schema from the target files, etc.) This mechanism works with, but is in addition to, the Drill metastore. # Separate the concepts of "schema-free" (no plan-time schema) from "schema-on-read" (schema is known in the planner, and data is read into that schema by readers; e.g. the Hive model.) Drill remains schema-on-read (for sources that need it), but does not attempt the impossible with schema-free (that is, we no longer read inconsistent data into a relational model and hope we can make it work.) # For convenience, allow "schema-free" (no plan-time schema). The restriction is that all readers *must* produce the same schema It is a fatal (to the query) error for an operator to receive batches with different schemas. (The reasons can be discussed separately.) # Preserve the Map, Dict and Array types, but with tighter semantics: all elements must be of the same type. # Replace the Union and List types with a new type: Java objects. Java objects can be anything and can vary from row-to-row. Java types are processed using UDFs (or Drill functions.) # All "extended" types (complex: Map, Dict and Array, or Java objects) must be reduced to primitive types in a top-level tuple if the client is ODBC (which cannot handle non-relational types.) The same is true if the destination is a simple sink such as CSV or JDBC. # Provide a light-weight way to resolve schema ambiguities that are identified by the new, stricter type rules. The light-weight solution is either a file or some kind of simple Drill-managed registry akin to the plugin registry. Users can run a query, see if there are conflicting types, and, if so, add a resolution rule to the registry. The user then reruns the query with a clean result. In the past couple of years we have made progress in some of these areas. This ticket suggests we bring those threads together in a coherent strategy. h4. Arrow/Java/Fixed Block/Something Else Storage The ideas here are independent of choices we might make for our internal data representation format. The above design works equally well with either Drill or Arrow vectors, or with something else entirely. -- This message was sent by Atlassian Jira (v8.3.4#803005)