Hi All,

We've discussed quite a few times whether Drill should or should not support or 
require schemas, and if so, how the user might express the schema.

I came across a paper [1] that suggests a simple, elegant SQL extension:

EXTRACT <column>[:<type>] {,<column>[:<type>]}
FROM <stream_name>

Paraphrasing into Drill's SQL:

SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS <alias>]}
FROM <table_name>

Have a collection of JSON files in which string column `foo` appears in only 
half the files? Don't want to get schema conflicts with VARCHAR and nullable 
INT? Just do:

SELECT name:VARCHAR, age:INT, foo:VARCHAR
FROM `my-dir` ...

Not only can the syntax be used to specify the "natural" type for a column, it 
might also specify a preferred type. For example. "age:INT" says that "age" is 
an INT, even though JSON would normally parse it as a BIGINT. Similarly, using 
this syntax is a easy way to tell Drill how to convert CSV columns from strings 
to DATE, INT, FLOAT, etc. without the need for CAST functions. (CAST functions 
read the data in one format, then convert it to another in a Project operator. 
Using a column type might let the reader do the conversion -- something that is 
easy to implement if using the "result set loader" mechanism.)

Plus, the syntax fits nicely into the existing view file structure. If the 
types appear in views, then client tools can continue to use standard SQL 
without the type information.

When this idea came up in the past, someone mentioned the issue of nullable vs. 
non-nullable. (Let's also include arrays, since Drill supports that. Maybe add 
a suffix to the the name:

SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ...

Not pretty, but works with the existing SQL syntax rules.

Obviously, Drill has much on its plate, so not suggestion that Drill should do 
this soon. Just passing it along as yet another option to consider.

Thanks,
- Paul

[1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf

Reply via email to