Re: Possible way to specify column types in query

Paul Rogers Mon, 01 Oct 2018 18:52:04 -0700

Hi JC,

One of Drill's challenges is that it cannot predict the future: it can't know 
what type your column will be in later records or in another file. All it knows 
is the here and now: the type of columns it actually reads. This is the flip 
side of "schema less".

The only real solution to this problem is for the user to tell Drill the column 
type. The team is looking at several options for doing this. [1]

If you are creating your own reader, you can perhaps come up with an 
alternative solution. A quick check of the MessagePack spec [2] reveals that 
each field is prefixed by a type code. This resolves the INT/DOUBLE problems in 
JSON. But, it does not help if you don't actually see an instance of the field: 
you still have to get that type information from somewhere.

There is a JSON Schema spec [3] which could be useful. You could, for example. 
load it at plan time and pass it to your readers as part of the physical plan. 
(If you poke around the physical plan, in JSON form, returned from EXPLAIN 
PLAN, you'll see what I mean.) You might require, say, that the user put a 
".schema" file in the root directory of your table.

While the JSON reader is one way to create a reader, it is probably the most 
complex example around. I'd suggest you look at the recently-added log reader 
[4] for a simpler way to get started. We also dedicate a chapter to format 
plugins in the soon-to-be-published book Learning Apache Drill (available in 
preview from Safari.)

There is also a not-yet-committed revised version [5] of the JSON reader [4] 
that uses state objects instead of complex if-statements to keep track of parse 
states. This implementation is made more complex, however, because of the need 
to deal with all the ambiguities that occur in JSON without a schema. Feel free 
to borrow ideas if helpful.

Thanks,
- Paul

[1] https://issues.apache.org/jira/browse/DRILL-6552
[2] https://github.com/msgpack/msgpack/blob/master/spec.md

[3] JSON Schema

[4] 
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/log

[5] 
https://github.com/paul-rogers/drill/tree/RowSetRev4/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json

    On Monday, October 1, 2018, 6:03:38 PM PDT, Jean-Claude Cote 
<[email protected]> wrote:  

 I'm implementing a msgpack reader and use the JSON reader as inspiration.
I've noticed that in the JSON reader there's code to detect if rows were
written but with no columns it will actually add one row with columns of
type INT. The comment in the code is
"if we had no columns, create one empty one so we can return some data for
count purposes"

Is this what the proposal above would solve. The JSON reader would not have
to guess what is the data type of the selected fields. Right now I get
failures in my msgpack reader because the schema is guessed to be an INT
but later it encounters a value and it is a VARCHAR.

I'm wondering why the JSON reader actually even writes a row if no fields
were found in the batch.. Is there something I can do better than the JSON
reader or is this a limitation of the drill engine?

Thanks
jc

On Mon, Oct 1, 2018 at 1:54 PM Arina Yelchiyeva <[email protected]>
wrote:

> Currently Calcite supports the following syntax, apparently used in
> Phoenix.
> *select empno + x from EMP_MODIFIABLEVIEW extend (x int not null)*
>
> Another option to consider is hint syntax (many DBs use this one) basically
> it's a multiline comment followed by a plus:
> *select /*+.....*/ col_name from t*
> This would allow us to pass not only schema but join / index hints etc.
>
> Example:
> *select /*+ SCHEMA(a int not null, b int) */ a from t*
>
> One minus we would need to implement this first in Calcite if Calcite
> community would be in favor of such changes.
>
> Kind regards,
> Arina
>
> On Mon, Sep 10, 2018 at 7:42 AM Paul Rogers <[email protected]>
> wrote:
>
> > Hi Weijie,
> >
> > Thanks for the paper pointer. F1 uses the same syntax as Scope (the
> system
> > cited in my earlier note): data type after the name.
> >
> > Another description is [1]. Neither paper describe how F1 handles arrays.
> > However, this second paper points out that Protobuf is F1's native
> format,
> > and so F1 has support for nested types. Drill does also, but in Drill, a
> > reference to "customer.phone.cell" cause the nested "cell" column to be
> > projected as a top-level column. And, neither paper say whether F1 is
> used
> > with O/JDBC, and if so, how they handle the mapping from nested types to
> > the flat tuple structure required by xDBC.
> >
> > Have you come across these details?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Thursday, September 6, 2018, 8:43:57 PM PDT, weijie tong <
> > [email protected]> wrote:
> >
> >  Google's latest paper about F1[1] claims to support any data sources by
> > using an extension api called TVF see section 6.3. Also need to declare
> > column datatype before the query.
> >
> >
> > [1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf
> >
> > On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers <[email protected]>
> > wrote:
> >
> > > Hi All,
> > >
> > > We've discussed quite a few times whether Drill should or should not
> > > support or require schemas, and if so, how the user might express the
> > > schema.
> > >
> > > I came across a paper [1] that suggests a simple, elegant SQL
> extension:
> > >
> > > EXTRACT <column>[:<type>] {,<column>[:<type>]}
> > > FROM <stream_name>
> > >
> > > Paraphrasing into Drill's SQL:
> > >
> > > SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS <alias>]}
> > > FROM <table_name>
> > >
> > > Have a collection of JSON files in which string column `foo` appears in
> > > only half the files? Don't want to get schema conflicts with VARCHAR
> and
> > > nullable INT? Just do:
> > >
> > > SELECT name:VARCHAR, age:INT, foo:VARCHAR
> > > FROM `my-dir` ...
> > >
> > > Not only can the syntax be used to specify the "natural" type for a
> > > column, it might also specify a preferred type. For example. "age:INT"
> > says
> > > that "age" is an INT, even though JSON would normally parse it as a
> > BIGINT.
> > > Similarly, using this syntax is a easy way to tell Drill how to convert
> > CSV
> > > columns from strings to DATE, INT, FLOAT, etc. without the need for
> CAST
> > > functions. (CAST functions read the data in one format, then convert it
> > to
> > > another in a Project operator. Using a column type might let the reader
> > do
> > > the conversion -- something that is easy to implement if using the
> > "result
> > > set loader" mechanism.)
> > >
> > > Plus, the syntax fits nicely into the existing view file structure. If
> > the
> > > types appear in views, then client tools can continue to use standard
> SQL
> > > without the type information.
> > >
> > > When this idea came up in the past, someone mentioned the issue of
> > > nullable vs. non-nullable. (Let's also include arrays, since Drill
> > supports
> > > that. Maybe add a suffix to the the name:
> > >
> > > SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ...
> > >
> > > Not pretty, but works with the existing SQL syntax rules.
> > >
> > > Obviously, Drill has much on its plate, so not suggestion that Drill
> > > should do this soon. Just passing it along as yet another option to
> > > consider.
> > >
> > > Thanks,
> > > - Paul
> > >
> > > [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf
> >
>

Re: Possible way to specify column types in query

Reply via email to