Re: Possible way to specify column types in query

Jean-Claude Cote Mon, 01 Oct 2018 19:50:17 -0700

I've though of declaring schema like was done with the pcap reader. It
seems to use a DrillTable to define it's schema. However I like the idea of
a .schema file at the root of the table. I tried what you suggested (look
at the explain plan). However I failed to see what you were pointing me at?


Yes I realize the JSON reader is complex. I actually used the JSON reader
and BSON (mongo) reader. So all the dfs related stuff I took from JSON and
the actual parsing from BSON. I think I have something fairly complete.
https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack.

I'm running it with real data now and I'm uncovering edge cases like the
issue where like you say drill can't see into the future. But like I said I
think having a .schema file would be a very good idea.

How would I best leverage such a file.
Thank you very much
jc


On Mon, Oct 1, 2018 at 9:51 PM Paul Rogers <[email protected]>
wrote:

> Hi JC,
>
> One of Drill's challenges is that it cannot predict the future: it can't
> know what type your column will be in later records or in another file. All
> it knows is the here and now: the type of columns it actually reads. This
> is the flip side of "schema less".
>
> The only real solution to this problem is for the user to tell Drill the
> column type. The team is looking at several options for doing this. [1]
>
> If you are creating your own reader, you can perhaps come up with an
> alternative solution. A quick check of the MessagePack spec [2] reveals
> that each field is prefixed by a type code. This resolves the INT/DOUBLE
> problems in JSON. But, it does not help if you don't actually see an
> instance of the field: you still have to get that type information from
> somewhere.
>
> There is a JSON Schema spec [3] which could be useful. You could, for
> example. load it at plan time and pass it to your readers as part of the
> physical plan. (If you poke around the physical plan, in JSON form,
> returned from EXPLAIN PLAN, you'll see what I mean.) You might require,
> say, that the user put a ".schema" file in the root directory of your table.
>
> While the JSON reader is one way to create a reader, it is probably the
> most complex example around. I'd suggest you look at the recently-added log
> reader [4] for a simpler way to get started. We also dedicate a chapter to
> format plugins in the soon-to-be-published book Learning Apache Drill
> (available in preview from Safari.)
>
> There is also a not-yet-committed revised version [5] of the JSON reader
> [4] that uses state objects instead of complex if-statements to keep track
> of parse states. This implementation is made more complex, however, because
> of the need to deal with all the ambiguities that occur in JSON without a
> schema. Feel free to borrow ideas if helpful.
>
> Thanks,
> - Paul
>
> [1] https://issues.apache.org/jira/browse/DRILL-6552
> [2] https://github.com/msgpack/msgpack/blob/master/spec.md
>
> [3] JSON Schema
>
> [4]
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/log
>
> [5]
> https://github.com/paul-rogers/drill/tree/RowSetRev4/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
>
>
>
>
>     On Monday, October 1, 2018, 6:03:38 PM PDT, Jean-Claude Cote <
> [email protected]> wrote:
>
>  I'm implementing a msgpack reader and use the JSON reader as inspiration.
> I've noticed that in the JSON reader there's code to detect if rows were
> written but with no columns it will actually add one row with columns of
> type INT. The comment in the code is
> "if we had no columns, create one empty one so we can return some data for
> count purposes"
>
> Is this what the proposal above would solve. The JSON reader would not have
> to guess what is the data type of the selected fields. Right now I get
> failures in my msgpack reader because the schema is guessed to be an INT
> but later it encounters a value and it is a VARCHAR.
>
> I'm wondering why the JSON reader actually even writes a row if no fields
> were found in the batch.. Is there something I can do better than the JSON
> reader or is this a limitation of the drill engine?
>
> Thanks
> jc
>
> On Mon, Oct 1, 2018 at 1:54 PM Arina Yelchiyeva <
> [email protected]>
> wrote:
>
> > Currently Calcite supports the following syntax, apparently used in
> > Phoenix.
> > *select empno + x from EMP_MODIFIABLEVIEW extend (x int not null)*
> >
> > Another option to consider is hint syntax (many DBs use this one)
> basically
> > it's a multiline comment followed by a plus:
> > *select /*+.....*/ col_name from t*
> > This would allow us to pass not only schema but join / index hints etc.
> >
> > Example:
> > *select /*+ SCHEMA(a int not null, b int) */ a from t*
> >
> > One minus we would need to implement this first in Calcite if Calcite
> > community would be in favor of such changes.
> >
> > Kind regards,
> > Arina
> >
> > On Mon, Sep 10, 2018 at 7:42 AM Paul Rogers <[email protected]>
> > wrote:
> >
> > > Hi Weijie,
> > >
> > > Thanks for the paper pointer. F1 uses the same syntax as Scope (the
> > system
> > > cited in my earlier note): data type after the name.
> > >
> > > Another description is [1]. Neither paper describe how F1 handles
> arrays.
> > > However, this second paper points out that Protobuf is F1's native
> > format,
> > > and so F1 has support for nested types. Drill does also, but in Drill,
> a
> > > reference to "customer.phone.cell" cause the nested "cell" column to be
> > > projected as a top-level column. And, neither paper say whether F1 is
> > used
> > > with O/JDBC, and if so, how they handle the mapping from nested types
> to
> > > the flat tuple structure required by xDBC.
> > >
> > > Have you come across these details?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Thursday, September 6, 2018, 8:43:57 PM PDT, weijie tong <
> > > [email protected]> wrote:
> > >
> > >  Google's latest paper about F1[1] claims to support any data sources
> by
> > > using an extension api called TVF see section 6.3. Also need to declare
> > > column datatype before the query.
> > >
> > >
> > > [1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf
> > >
> > > On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers <[email protected]>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > We've discussed quite a few times whether Drill should or should not
> > > > support or require schemas, and if so, how the user might express the
> > > > schema.
> > > >
> > > > I came across a paper [1] that suggests a simple, elegant SQL
> > extension:
> > > >
> > > > EXTRACT <column>[:<type>] {,<column>[:<type>]}
> > > > FROM <stream_name>
> > > >
> > > > Paraphrasing into Drill's SQL:
> > > >
> > > > SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS <alias>]}
> > > > FROM <table_name>
> > > >
> > > > Have a collection of JSON files in which string column `foo` appears
> in
> > > > only half the files? Don't want to get schema conflicts with VARCHAR
> > and
> > > > nullable INT? Just do:
> > > >
> > > > SELECT name:VARCHAR, age:INT, foo:VARCHAR
> > > > FROM `my-dir` ...
> > > >
> > > > Not only can the syntax be used to specify the "natural" type for a
> > > > column, it might also specify a preferred type. For example.
> "age:INT"
> > > says
> > > > that "age" is an INT, even though JSON would normally parse it as a
> > > BIGINT.
> > > > Similarly, using this syntax is a easy way to tell Drill how to
> convert
> > > CSV
> > > > columns from strings to DATE, INT, FLOAT, etc. without the need for
> > CAST
> > > > functions. (CAST functions read the data in one format, then convert
> it
> > > to
> > > > another in a Project operator. Using a column type might let the
> reader
> > > do
> > > > the conversion -- something that is easy to implement if using the
> > > "result
> > > > set loader" mechanism.)
> > > >
> > > > Plus, the syntax fits nicely into the existing view file structure.
> If
> > > the
> > > > types appear in views, then client tools can continue to use standard
> > SQL
> > > > without the type information.
> > > >
> > > > When this idea came up in the past, someone mentioned the issue of
> > > > nullable vs. non-nullable. (Let's also include arrays, since Drill
> > > supports
> > > > that. Maybe add a suffix to the the name:
> > > >
> > > > SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ...
> > > >
> > > > Not pretty, but works with the existing SQL syntax rules.
> > > >
> > > > Obviously, Drill has much on its plate, so not suggestion that Drill
> > > > should do this soon. Just passing it along as yet another option to
> > > > consider.
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > > [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf
> > >
> >
>

Re: Possible way to specify column types in query

Reply via email to