Re: Possible way to specify column types in query

Jean-Claude Cote Tue, 02 Oct 2018 18:12:35 -0700

I've been looking at the pcap reader
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap
it has a simple PcapDrillTable (hard coded). I though of using it as
inspiration but I wonder if the way it is implemented would the DESCRIBE
functionality work? Or if I want DESCRIBE to work I need to implement it at
the planner level?


Thanks Paul
jc

On Tue, Oct 2, 2018 at 12:54 PM Paul Rogers <[email protected]>
wrote:

> Hi JC,
>
> Now that you have a working reader, sounds like your next task is to pass
> column schema to the reader. There are two ways to do that. There are
> sketchy examples of both. First, the reader itself can obtain the schema
> (as is done in Parquet.) Second, the planner can pass the schema to the
> reader as part of the physical plan is is sort of done for the new log
> reader.
>
> Let's start with the reader side first, since that is easiest to test.
> When scanning a directory, the physical plan includes the path to the base
> directory. (I'm sure you've see it, if not, let me know and I'll track it
> down.) Each reader can use that path to infer the location of the
> (optional) .schema file, then read the schema and proceed from there. In
> this model, the reader has to handle columns requested by the query which
> are not in the schema. Go ahead and create them of type nullable INT.
>
> (By the way, you can name the file anything you like. If the file starts
> with a dot or underscore, Dill will ignore it when looking for data files,
> which can be handy way to prevent your schema file from being treated as
> data.)
>
> Having each of possibly hundreds of readers reading the same file with a
> remote read is less than optimal, however. You probably want to read the
> file once. You can do that by creating custom plan nodes using Calcite. I'm
> not an expert in this area, but you can look at the existing code for, say,
> HBase or other readers for the classes you need to create. Others on this
> list might be able to provide more detailed hints.
>
> A bonus of planner-side metadata is that you can fail queries that request
> columns which do not exist. (Or, you can keep the usual technique of making
> up a nullable INT column in the reader.)
>
> Next you need to pass the schema information from the planner to each
> reader. You do that via the physical plan. When you looked at the EXPLAIN
> PLAN, in JSON, you should have seen the JSON form of the "easy" file scan.
> You'd create your own Jackson-serializable class to use for your reader in
> place of the "easy" version. You'd then pass the schema in that custom
> class. (Again, you sound familiar with the code, so I've not spelled out
> the details. Let me know if you get stuck.)
>
> Finally, I'd draw your attention to several features that the log reader
> illustrates. One is handling of SELECT all/some/none projection. Another is
> using "column state" classes to perform type-specific handling instead of a
> big (slow) switch statement. That new JSON reader I mentioned takes the
> idea a step further and has the column state classes handle data
> translation, vector writing and parse state handling.
>
> I hope this helps.
>
> Thanks,
> - Paul
>
>
>
>     On Monday, October 1, 2018, 7:49:50 PM PDT, Jean-Claude Cote <
> [email protected]> wrote:
>
>  I've though of declaring schema like was done with the pcap reader. It
> seems to use a DrillTable to define it's schema. However I like the idea of
> a .schema file at the root of the table. I tried what you suggested (look
> at the explain plan). However I failed to see what you were pointing me at?
>
> Yes I realize the JSON reader is complex. I actually used the JSON reader
> and BSON (mongo) reader. So all the dfs related stuff I took from JSON and
> the actual parsing from BSON. I think I have something fairly complete.
> https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack.
>
> I'm running it with real data now and I'm uncovering edge cases like the
> issue where like you say drill can't see into the future. But like I said I
> think having a .schema file would be a very good idea.
>
> How would I best leverage such a file.
> Thank you very much
> jc
>
>
> On Mon, Oct 1, 2018 at 9:51 PM Paul Rogers <[email protected]>
> wrote:
>
> > Hi JC,
> >
> > One of Drill's challenges is that it cannot predict the future: it can't
> > know what type your column will be in later records or in another file.
> All
> > it knows is the here and now: the type of columns it actually reads. This
> > is the flip side of "schema less".
> >
> > The only real solution to this problem is for the user to tell Drill the
> > column type. The team is looking at several options for doing this. [1]
> >
> > If you are creating your own reader, you can perhaps come up with an
> > alternative solution. A quick check of the MessagePack spec [2] reveals
> > that each field is prefixed by a type code. This resolves the INT/DOUBLE
> > problems in JSON. But, it does not help if you don't actually see an
> > instance of the field: you still have to get that type information from
> > somewhere.
> >
> > There is a JSON Schema spec [3] which could be useful. You could, for
> > example. load it at plan time and pass it to your readers as part of the
> > physical plan. (If you poke around the physical plan, in JSON form,
> > returned from EXPLAIN PLAN, you'll see what I mean.) You might require,
> > say, that the user put a ".schema" file in the root directory of your
> table.
> >
> > While the JSON reader is one way to create a reader, it is probably the
> > most complex example around. I'd suggest you look at the recently-added
> log
> > reader [4] for a simpler way to get started. We also dedicate a chapter
> to
> > format plugins in the soon-to-be-published book Learning Apache Drill
> > (available in preview from Safari.)
> >
> > There is also a not-yet-committed revised version [5] of the JSON reader
> > [4] that uses state objects instead of complex if-statements to keep
> track
> > of parse states. This implementation is made more complex, however,
> because
> > of the need to deal with all the ambiguities that occur in JSON without a
> > schema. Feel free to borrow ideas if helpful.
> >
> > Thanks,
> > - Paul
> >
> > [1] https://issues.apache.org/jira/browse/DRILL-6552
> > [2] https://github.com/msgpack/msgpack/blob/master/spec.md
> >
> > [3] JSON Schema
> >
> > [4]
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/log
> >
> > [5]
> >
> https://github.com/paul-rogers/drill/tree/RowSetRev4/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
> >
> >
> >
> >
> >    On Monday, October 1, 2018, 6:03:38 PM PDT, Jean-Claude Cote <
> > [email protected]> wrote:
> >
> >  I'm implementing a msgpack reader and use the JSON reader as
> inspiration.
> > I've noticed that in the JSON reader there's code to detect if rows were
> > written but with no columns it will actually add one row with columns of
> > type INT. The comment in the code is
> > "if we had no columns, create one empty one so we can return some data
> for
> > count purposes"
> >
> > Is this what the proposal above would solve. The JSON reader would not
> have
> > to guess what is the data type of the selected fields. Right now I get
> > failures in my msgpack reader because the schema is guessed to be an INT
> > but later it encounters a value and it is a VARCHAR.
> >
> > I'm wondering why the JSON reader actually even writes a row if no fields
> > were found in the batch.. Is there something I can do better than the
> JSON
> > reader or is this a limitation of the drill engine?
> >
> > Thanks
> > jc
> >
> > On Mon, Oct 1, 2018 at 1:54 PM Arina Yelchiyeva <
> > [email protected]>
> > wrote:
> >
> > > Currently Calcite supports the following syntax, apparently used in
> > > Phoenix.
> > > *select empno + x from EMP_MODIFIABLEVIEW extend (x int not null)*
> > >
> > > Another option to consider is hint syntax (many DBs use this one)
> > basically
> > > it's a multiline comment followed by a plus:
> > > *select /*+.....*/ col_name from t*
> > > This would allow us to pass not only schema but join / index hints etc.
> > >
> > > Example:
> > > *select /*+ SCHEMA(a int not null, b int) */ a from t*
> > >
> > > One minus we would need to implement this first in Calcite if Calcite
> > > community would be in favor of such changes.
> > >
> > > Kind regards,
> > > Arina
> > >
> > > On Mon, Sep 10, 2018 at 7:42 AM Paul Rogers <[email protected]
> >
> > > wrote:
> > >
> > > > Hi Weijie,
> > > >
> > > > Thanks for the paper pointer. F1 uses the same syntax as Scope (the
> > > system
> > > > cited in my earlier note): data type after the name.
> > > >
> > > > Another description is [1]. Neither paper describe how F1 handles
> > arrays.
> > > > However, this second paper points out that Protobuf is F1's native
> > > format,
> > > > and so F1 has support for nested types. Drill does also, but in
> Drill,
> > a
> > > > reference to "customer.phone.cell" cause the nested "cell" column to
> be
> > > > projected as a top-level column. And, neither paper say whether F1 is
> > > used
> > > > with O/JDBC, and if so, how they handle the mapping from nested types
> > to
> > > > the flat tuple structure required by xDBC.
> > > >
> > > > Have you come across these details?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Thursday, September 6, 2018, 8:43:57 PM PDT, weijie tong <
> > > > [email protected]> wrote:
> > > >
> > > >  Google's latest paper about F1[1] claims to support any data sources
> > by
> > > > using an extension api called TVF see section 6.3. Also need to
> declare
> > > > column datatype before the query.
> > > >
> > > >
> > > > [1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf
> > > >
> > > > On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We've discussed quite a few times whether Drill should or should
> not
> > > > > support or require schemas, and if so, how the user might express
> the
> > > > > schema.
> > > > >
> > > > > I came across a paper [1] that suggests a simple, elegant SQL
> > > extension:
> > > > >
> > > > > EXTRACT <column>[:<type>] {,<column>[:<type>]}
> > > > > FROM <stream_name>
> > > > >
> > > > > Paraphrasing into Drill's SQL:
> > > > >
> > > > > SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS
> <alias>]}
> > > > > FROM <table_name>
> > > > >
> > > > > Have a collection of JSON files in which string column `foo`
> appears
> > in
> > > > > only half the files? Don't want to get schema conflicts with
> VARCHAR
> > > and
> > > > > nullable INT? Just do:
> > > > >
> > > > > SELECT name:VARCHAR, age:INT, foo:VARCHAR
> > > > > FROM `my-dir` ...
> > > > >
> > > > > Not only can the syntax be used to specify the "natural" type for a
> > > > > column, it might also specify a preferred type. For example.
> > "age:INT"
> > > > says
> > > > > that "age" is an INT, even though JSON would normally parse it as a
> > > > BIGINT.
> > > > > Similarly, using this syntax is a easy way to tell Drill how to
> > convert
> > > > CSV
> > > > > columns from strings to DATE, INT, FLOAT, etc. without the need for
> > > CAST
> > > > > functions. (CAST functions read the data in one format, then
> convert
> > it
> > > > to
> > > > > another in a Project operator. Using a column type might let the
> > reader
> > > > do
> > > > > the conversion -- something that is easy to implement if using the
> > > > "result
> > > > > set loader" mechanism.)
> > > > >
> > > > > Plus, the syntax fits nicely into the existing view file structure.
> > If
> > > > the
> > > > > types appear in views, then client tools can continue to use
> standard
> > > SQL
> > > > > without the type information.
> > > > >
> > > > > When this idea came up in the past, someone mentioned the issue of
> > > > > nullable vs. non-nullable. (Let's also include arrays, since Drill
> > > > supports
> > > > > that. Maybe add a suffix to the the name:
> > > > >
> > > > > SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ...
> > > > >
> > > > > Not pretty, but works with the existing SQL syntax rules.
> > > > >
> > > > > Obviously, Drill has much on its plate, so not suggestion that
> Drill
> > > > > should do this soon. Just passing it along as yet another option to
> > > > > consider.
> > > > >
> > > > > Thanks,
> > > > > - Paul
> > > > >
> > > > > [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf
> > > >
> > >
> >
>

Re: Possible way to specify column types in query

Reply via email to