Hi All,
We've discussed quite a few times whether Drill should or should not support or
require schemas, and if so, how the user might express the schema.
I came across a paper [1] that suggests a simple, elegant SQL extension:
EXTRACT <column>[:<type>] {,<column>[:<type>]}
FROM <stream_name>
Paraphrasing into Drill's SQL:
SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS <alias>]}
FROM <table_name>
Have a collection of JSON files in which string column `foo` appears in only
half the files? Don't want to get schema conflicts with VARCHAR and nullable
INT? Just do:
SELECT name:VARCHAR, age:INT, foo:VARCHAR
FROM `my-dir` ...
Not only can the syntax be used to specify the "natural" type for a column, it
might also specify a preferred type. For example. "age:INT" says that "age" is
an INT, even though JSON would normally parse it as a BIGINT. Similarly, using
this syntax is a easy way to tell Drill how to convert CSV columns from strings
to DATE, INT, FLOAT, etc. without the need for CAST functions. (CAST functions
read the data in one format, then convert it to another in a Project operator.
Using a column type might let the reader do the conversion -- something that is
easy to implement if using the "result set loader" mechanism.)
Plus, the syntax fits nicely into the existing view file structure. If the
types appear in views, then client tools can continue to use standard SQL
without the type information.
When this idea came up in the past, someone mentioned the issue of nullable vs.
non-nullable. (Let's also include arrays, since Drill supports that. Maybe add
a suffix to the the name:
SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ...
Not pretty, but works with the existing SQL syntax rules.
Obviously, Drill has much on its plate, so not suggestion that Drill should do
this soon. Just passing it along as yet another option to consider.
Thanks,
- Paul
[1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf