Hi Weijie,
Thanks for raising this topic. I think you've got a great suggestion.
My two cents: there is no harm in reading all manner of ugly data. But, rather
than try to process the mess throughout Drill (as we do today with schema
changes, just-in-time code generation, union vectors and the rest), simply
require that the user provide disambiguation rules prior to running the query.
Require that the reader convert input data into a clean relational format.
That is, rather than try to be schema-less, just be schema-on-read, which, as
you point out, is what most users probably want anyway.
JSON is my favorite example of the current issues. JSON is a general-purpose
data format. SQL is a relational format. IMHO, if JDBC and ODBC are Drill's
primary APIs, data must fit those APIs, and that means data must be relational.
So, the goal when using JDBC is to map the JDBC structure into a relational
structure.
Since Drill now supports arrays of nested tuples via implicit joins,
"relational format" here means the extended relational format. Perhaps a
requirement is that arrays MUST be flattened to simple tuples prior to
returning results to JDBC or ODBC.
If we provide schema hints ("field x, when it appears, will be a Double"), then
Drill need not fight with the problem that the data will be Double in some
files, Nullable Int in others. The schema hint removes the ambiguity.
The result is that Drill can read any type of data. But, as you say, the Drill
internals are simpler, cleaner and faster. There is no ambiguity about types
that Drill kinda-sorta supports but that ODBC/JDBC don't support. Effort can go
into new features rather than fighting an unwinnable battle to use
non-relational data in a relational engine.
In short, as part of the metadata API work, perhaps define how metadata can
simplify Drill internals. Ensure that users can create simple metadata hint
files for ad-hoc use, maybe as an extension to view files. Then, push the
problem of messy schemas into a conversion layer in the reader and out of the
rest of the Drill execution engine.
The reason I keep jumping into these metadata discussions is that I encountered
the ambiguity problems first hand in the "result set loader" work. There are
ambiguities in JSON that simply cannot be resolved until Drill can predict the
future (by having access to metadata.)
Consider the query "SELECT a, x FROM foo.json". Column x is missing from the
first 1000 records. In the 1001st record, it shows up and is a Double.
Previously, Drill would guess Nullable Int on the first column, then blow up
when x appears as a Double. A revision was to postpone picking a type as late
as possible so that, if column x does show up in the first batch, ambiguity is
avoided. But, this trick does not work if the column shows up in the second or
later batch. We need a column in the first batch, and we will guess Nullable
Int. Depending on the query, this will result in a schema change elsewhere in
the DAG or in the client.
Similarly, in the new version, JSON can handle data of the form {x: 10.1} {x:
10} because it can convert an Int to a Double. Sadly, {x: 10} {x: 10.1} still
fails because Drill can't convert a Double to an Int. Sigh...
To implement these tricks, the revised reader framework accepts a schema and
automagically does the work of converting data from the input type to the
defined type. Once this code is into master (we are still about three PRs
away), it can be combined with the metadata system to achieve the "apply schema
on read" idea discussed above. We then would not need the horrible hacks like
those just discussed.
Thanks,
- Paul
On Wednesday, August 15, 2018, 5:02:08 PM PDT, weijie tong
<[email protected]> wrote:
Hi all:
Hope the statement not seems too dash to you.
Drill claims be a schema-free distributed SQL engine. It pays lots of
work to make the execution engine to support it to support JSON file like
storage format. It is easier to make bugs and let the code logic ugly. I
wonder do we still insist on this ,since we are designing the metadata
system with DRILL-6552.
Traditionally, people is used to design its table schema firstly before
firing a SQL query. I don't think this saves people too much time. Other
system like Spark is popular not due to lack the schema claiming. I think
we should be brave enough to take the right decision whether to still
insist on this feature which seems not so important but a burden.
Thanks.