I've been looking at the pcap reader https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap it has a simple PcapDrillTable (hard coded). I though of using it as inspiration but I wonder if the way it is implemented would the DESCRIBE functionality work? Or if I want DESCRIBE to work I need to implement it at the planner level?
Thanks Paul jc On Tue, Oct 2, 2018 at 12:54 PM Paul Rogers <[email protected]> wrote: > Hi JC, > > Now that you have a working reader, sounds like your next task is to pass > column schema to the reader. There are two ways to do that. There are > sketchy examples of both. First, the reader itself can obtain the schema > (as is done in Parquet.) Second, the planner can pass the schema to the > reader as part of the physical plan is is sort of done for the new log > reader. > > Let's start with the reader side first, since that is easiest to test. > When scanning a directory, the physical plan includes the path to the base > directory. (I'm sure you've see it, if not, let me know and I'll track it > down.) Each reader can use that path to infer the location of the > (optional) .schema file, then read the schema and proceed from there. In > this model, the reader has to handle columns requested by the query which > are not in the schema. Go ahead and create them of type nullable INT. > > (By the way, you can name the file anything you like. If the file starts > with a dot or underscore, Dill will ignore it when looking for data files, > which can be handy way to prevent your schema file from being treated as > data.) > > Having each of possibly hundreds of readers reading the same file with a > remote read is less than optimal, however. You probably want to read the > file once. You can do that by creating custom plan nodes using Calcite. I'm > not an expert in this area, but you can look at the existing code for, say, > HBase or other readers for the classes you need to create. Others on this > list might be able to provide more detailed hints. > > A bonus of planner-side metadata is that you can fail queries that request > columns which do not exist. (Or, you can keep the usual technique of making > up a nullable INT column in the reader.) > > Next you need to pass the schema information from the planner to each > reader. You do that via the physical plan. When you looked at the EXPLAIN > PLAN, in JSON, you should have seen the JSON form of the "easy" file scan. > You'd create your own Jackson-serializable class to use for your reader in > place of the "easy" version. You'd then pass the schema in that custom > class. (Again, you sound familiar with the code, so I've not spelled out > the details. Let me know if you get stuck.) > > Finally, I'd draw your attention to several features that the log reader > illustrates. One is handling of SELECT all/some/none projection. Another is > using "column state" classes to perform type-specific handling instead of a > big (slow) switch statement. That new JSON reader I mentioned takes the > idea a step further and has the column state classes handle data > translation, vector writing and parse state handling. > > I hope this helps. > > Thanks, > - Paul > > > > On Monday, October 1, 2018, 7:49:50 PM PDT, Jean-Claude Cote < > [email protected]> wrote: > > I've though of declaring schema like was done with the pcap reader. It > seems to use a DrillTable to define it's schema. However I like the idea of > a .schema file at the root of the table. I tried what you suggested (look > at the explain plan). However I failed to see what you were pointing me at? > > Yes I realize the JSON reader is complex. I actually used the JSON reader > and BSON (mongo) reader. So all the dfs related stuff I took from JSON and > the actual parsing from BSON. I think I have something fairly complete. > https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack. > > I'm running it with real data now and I'm uncovering edge cases like the > issue where like you say drill can't see into the future. But like I said I > think having a .schema file would be a very good idea. > > How would I best leverage such a file. > Thank you very much > jc > > > On Mon, Oct 1, 2018 at 9:51 PM Paul Rogers <[email protected]> > wrote: > > > Hi JC, > > > > One of Drill's challenges is that it cannot predict the future: it can't > > know what type your column will be in later records or in another file. > All > > it knows is the here and now: the type of columns it actually reads. This > > is the flip side of "schema less". > > > > The only real solution to this problem is for the user to tell Drill the > > column type. The team is looking at several options for doing this. [1] > > > > If you are creating your own reader, you can perhaps come up with an > > alternative solution. A quick check of the MessagePack spec [2] reveals > > that each field is prefixed by a type code. This resolves the INT/DOUBLE > > problems in JSON. But, it does not help if you don't actually see an > > instance of the field: you still have to get that type information from > > somewhere. > > > > There is a JSON Schema spec [3] which could be useful. You could, for > > example. load it at plan time and pass it to your readers as part of the > > physical plan. (If you poke around the physical plan, in JSON form, > > returned from EXPLAIN PLAN, you'll see what I mean.) You might require, > > say, that the user put a ".schema" file in the root directory of your > table. > > > > While the JSON reader is one way to create a reader, it is probably the > > most complex example around. I'd suggest you look at the recently-added > log > > reader [4] for a simpler way to get started. We also dedicate a chapter > to > > format plugins in the soon-to-be-published book Learning Apache Drill > > (available in preview from Safari.) > > > > There is also a not-yet-committed revised version [5] of the JSON reader > > [4] that uses state objects instead of complex if-statements to keep > track > > of parse states. This implementation is made more complex, however, > because > > of the need to deal with all the ambiguities that occur in JSON without a > > schema. Feel free to borrow ideas if helpful. > > > > Thanks, > > - Paul > > > > [1] https://issues.apache.org/jira/browse/DRILL-6552 > > [2] https://github.com/msgpack/msgpack/blob/master/spec.md > > > > [3] JSON Schema > > > > [4] > > > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/log > > > > [5] > > > https://github.com/paul-rogers/drill/tree/RowSetRev4/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json > > > > > > > > > > On Monday, October 1, 2018, 6:03:38 PM PDT, Jean-Claude Cote < > > [email protected]> wrote: > > > > I'm implementing a msgpack reader and use the JSON reader as > inspiration. > > I've noticed that in the JSON reader there's code to detect if rows were > > written but with no columns it will actually add one row with columns of > > type INT. The comment in the code is > > "if we had no columns, create one empty one so we can return some data > for > > count purposes" > > > > Is this what the proposal above would solve. The JSON reader would not > have > > to guess what is the data type of the selected fields. Right now I get > > failures in my msgpack reader because the schema is guessed to be an INT > > but later it encounters a value and it is a VARCHAR. > > > > I'm wondering why the JSON reader actually even writes a row if no fields > > were found in the batch.. Is there something I can do better than the > JSON > > reader or is this a limitation of the drill engine? > > > > Thanks > > jc > > > > On Mon, Oct 1, 2018 at 1:54 PM Arina Yelchiyeva < > > [email protected]> > > wrote: > > > > > Currently Calcite supports the following syntax, apparently used in > > > Phoenix. > > > *select empno + x from EMP_MODIFIABLEVIEW extend (x int not null)* > > > > > > Another option to consider is hint syntax (many DBs use this one) > > basically > > > it's a multiline comment followed by a plus: > > > *select /*+.....*/ col_name from t* > > > This would allow us to pass not only schema but join / index hints etc. > > > > > > Example: > > > *select /*+ SCHEMA(a int not null, b int) */ a from t* > > > > > > One minus we would need to implement this first in Calcite if Calcite > > > community would be in favor of such changes. > > > > > > Kind regards, > > > Arina > > > > > > On Mon, Sep 10, 2018 at 7:42 AM Paul Rogers <[email protected] > > > > > wrote: > > > > > > > Hi Weijie, > > > > > > > > Thanks for the paper pointer. F1 uses the same syntax as Scope (the > > > system > > > > cited in my earlier note): data type after the name. > > > > > > > > Another description is [1]. Neither paper describe how F1 handles > > arrays. > > > > However, this second paper points out that Protobuf is F1's native > > > format, > > > > and so F1 has support for nested types. Drill does also, but in > Drill, > > a > > > > reference to "customer.phone.cell" cause the nested "cell" column to > be > > > > projected as a top-level column. And, neither paper say whether F1 is > > > used > > > > with O/JDBC, and if so, how they handle the mapping from nested types > > to > > > > the flat tuple structure required by xDBC. > > > > > > > > Have you come across these details? > > > > > > > > Thanks, > > > > - Paul > > > > > > > > > > > > > > > > On Thursday, September 6, 2018, 8:43:57 PM PDT, weijie tong < > > > > [email protected]> wrote: > > > > > > > > Google's latest paper about F1[1] claims to support any data sources > > by > > > > using an extension api called TVF see section 6.3. Also need to > declare > > > > column datatype before the query. > > > > > > > > > > > > [1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf > > > > > > > > On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers <[email protected] > > > > > > wrote: > > > > > > > > > Hi All, > > > > > > > > > > We've discussed quite a few times whether Drill should or should > not > > > > > support or require schemas, and if so, how the user might express > the > > > > > schema. > > > > > > > > > > I came across a paper [1] that suggests a simple, elegant SQL > > > extension: > > > > > > > > > > EXTRACT <column>[:<type>] {,<column>[:<type>]} > > > > > FROM <stream_name> > > > > > > > > > > Paraphrasing into Drill's SQL: > > > > > > > > > > SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS > <alias>]} > > > > > FROM <table_name> > > > > > > > > > > Have a collection of JSON files in which string column `foo` > appears > > in > > > > > only half the files? Don't want to get schema conflicts with > VARCHAR > > > and > > > > > nullable INT? Just do: > > > > > > > > > > SELECT name:VARCHAR, age:INT, foo:VARCHAR > > > > > FROM `my-dir` ... > > > > > > > > > > Not only can the syntax be used to specify the "natural" type for a > > > > > column, it might also specify a preferred type. For example. > > "age:INT" > > > > says > > > > > that "age" is an INT, even though JSON would normally parse it as a > > > > BIGINT. > > > > > Similarly, using this syntax is a easy way to tell Drill how to > > convert > > > > CSV > > > > > columns from strings to DATE, INT, FLOAT, etc. without the need for > > > CAST > > > > > functions. (CAST functions read the data in one format, then > convert > > it > > > > to > > > > > another in a Project operator. Using a column type might let the > > reader > > > > do > > > > > the conversion -- something that is easy to implement if using the > > > > "result > > > > > set loader" mechanism.) > > > > > > > > > > Plus, the syntax fits nicely into the existing view file structure. > > If > > > > the > > > > > types appear in views, then client tools can continue to use > standard > > > SQL > > > > > without the type information. > > > > > > > > > > When this idea came up in the past, someone mentioned the issue of > > > > > nullable vs. non-nullable. (Let's also include arrays, since Drill > > > > supports > > > > > that. Maybe add a suffix to the the name: > > > > > > > > > > SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ... > > > > > > > > > > Not pretty, but works with the existing SQL syntax rules. > > > > > > > > > > Obviously, Drill has much on its plate, so not suggestion that > Drill > > > > > should do this soon. Just passing it along as yet another option to > > > > > consider. > > > > > > > > > > Thanks, > > > > > - Paul > > > > > > > > > > [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf > > > > > > > > > >
