I've though of declaring schema like was done with the pcap reader. It seems to use a DrillTable to define it's schema. However I like the idea of a .schema file at the root of the table. I tried what you suggested (look at the explain plan). However I failed to see what you were pointing me at?
Yes I realize the JSON reader is complex. I actually used the JSON reader and BSON (mongo) reader. So all the dfs related stuff I took from JSON and the actual parsing from BSON. I think I have something fairly complete. https://github.com/jcmcote/drill-1/tree/master/contrib/format-msgpack. I'm running it with real data now and I'm uncovering edge cases like the issue where like you say drill can't see into the future. But like I said I think having a .schema file would be a very good idea. How would I best leverage such a file. Thank you very much jc On Mon, Oct 1, 2018 at 9:51 PM Paul Rogers <[email protected]> wrote: > Hi JC, > > One of Drill's challenges is that it cannot predict the future: it can't > know what type your column will be in later records or in another file. All > it knows is the here and now: the type of columns it actually reads. This > is the flip side of "schema less". > > The only real solution to this problem is for the user to tell Drill the > column type. The team is looking at several options for doing this. [1] > > If you are creating your own reader, you can perhaps come up with an > alternative solution. A quick check of the MessagePack spec [2] reveals > that each field is prefixed by a type code. This resolves the INT/DOUBLE > problems in JSON. But, it does not help if you don't actually see an > instance of the field: you still have to get that type information from > somewhere. > > There is a JSON Schema spec [3] which could be useful. You could, for > example. load it at plan time and pass it to your readers as part of the > physical plan. (If you poke around the physical plan, in JSON form, > returned from EXPLAIN PLAN, you'll see what I mean.) You might require, > say, that the user put a ".schema" file in the root directory of your table. > > While the JSON reader is one way to create a reader, it is probably the > most complex example around. I'd suggest you look at the recently-added log > reader [4] for a simpler way to get started. We also dedicate a chapter to > format plugins in the soon-to-be-published book Learning Apache Drill > (available in preview from Safari.) > > There is also a not-yet-committed revised version [5] of the JSON reader > [4] that uses state objects instead of complex if-statements to keep track > of parse states. This implementation is made more complex, however, because > of the need to deal with all the ambiguities that occur in JSON without a > schema. Feel free to borrow ideas if helpful. > > Thanks, > - Paul > > [1] https://issues.apache.org/jira/browse/DRILL-6552 > [2] https://github.com/msgpack/msgpack/blob/master/spec.md > > [3] JSON Schema > > [4] > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/log > > [5] > https://github.com/paul-rogers/drill/tree/RowSetRev4/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json > > > > > On Monday, October 1, 2018, 6:03:38 PM PDT, Jean-Claude Cote < > [email protected]> wrote: > > I'm implementing a msgpack reader and use the JSON reader as inspiration. > I've noticed that in the JSON reader there's code to detect if rows were > written but with no columns it will actually add one row with columns of > type INT. The comment in the code is > "if we had no columns, create one empty one so we can return some data for > count purposes" > > Is this what the proposal above would solve. The JSON reader would not have > to guess what is the data type of the selected fields. Right now I get > failures in my msgpack reader because the schema is guessed to be an INT > but later it encounters a value and it is a VARCHAR. > > I'm wondering why the JSON reader actually even writes a row if no fields > were found in the batch.. Is there something I can do better than the JSON > reader or is this a limitation of the drill engine? > > Thanks > jc > > On Mon, Oct 1, 2018 at 1:54 PM Arina Yelchiyeva < > [email protected]> > wrote: > > > Currently Calcite supports the following syntax, apparently used in > > Phoenix. > > *select empno + x from EMP_MODIFIABLEVIEW extend (x int not null)* > > > > Another option to consider is hint syntax (many DBs use this one) > basically > > it's a multiline comment followed by a plus: > > *select /*+.....*/ col_name from t* > > This would allow us to pass not only schema but join / index hints etc. > > > > Example: > > *select /*+ SCHEMA(a int not null, b int) */ a from t* > > > > One minus we would need to implement this first in Calcite if Calcite > > community would be in favor of such changes. > > > > Kind regards, > > Arina > > > > On Mon, Sep 10, 2018 at 7:42 AM Paul Rogers <[email protected]> > > wrote: > > > > > Hi Weijie, > > > > > > Thanks for the paper pointer. F1 uses the same syntax as Scope (the > > system > > > cited in my earlier note): data type after the name. > > > > > > Another description is [1]. Neither paper describe how F1 handles > arrays. > > > However, this second paper points out that Protobuf is F1's native > > format, > > > and so F1 has support for nested types. Drill does also, but in Drill, > a > > > reference to "customer.phone.cell" cause the nested "cell" column to be > > > projected as a top-level column. And, neither paper say whether F1 is > > used > > > with O/JDBC, and if so, how they handle the mapping from nested types > to > > > the flat tuple structure required by xDBC. > > > > > > Have you come across these details? > > > > > > Thanks, > > > - Paul > > > > > > > > > > > > On Thursday, September 6, 2018, 8:43:57 PM PDT, weijie tong < > > > [email protected]> wrote: > > > > > > Google's latest paper about F1[1] claims to support any data sources > by > > > using an extension api called TVF see section 6.3. Also need to declare > > > column datatype before the query. > > > > > > > > > [1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf > > > > > > On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers <[email protected]> > > > wrote: > > > > > > > Hi All, > > > > > > > > We've discussed quite a few times whether Drill should or should not > > > > support or require schemas, and if so, how the user might express the > > > > schema. > > > > > > > > I came across a paper [1] that suggests a simple, elegant SQL > > extension: > > > > > > > > EXTRACT <column>[:<type>] {,<column>[:<type>]} > > > > FROM <stream_name> > > > > > > > > Paraphrasing into Drill's SQL: > > > > > > > > SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS <alias>]} > > > > FROM <table_name> > > > > > > > > Have a collection of JSON files in which string column `foo` appears > in > > > > only half the files? Don't want to get schema conflicts with VARCHAR > > and > > > > nullable INT? Just do: > > > > > > > > SELECT name:VARCHAR, age:INT, foo:VARCHAR > > > > FROM `my-dir` ... > > > > > > > > Not only can the syntax be used to specify the "natural" type for a > > > > column, it might also specify a preferred type. For example. > "age:INT" > > > says > > > > that "age" is an INT, even though JSON would normally parse it as a > > > BIGINT. > > > > Similarly, using this syntax is a easy way to tell Drill how to > convert > > > CSV > > > > columns from strings to DATE, INT, FLOAT, etc. without the need for > > CAST > > > > functions. (CAST functions read the data in one format, then convert > it > > > to > > > > another in a Project operator. Using a column type might let the > reader > > > do > > > > the conversion -- something that is easy to implement if using the > > > "result > > > > set loader" mechanism.) > > > > > > > > Plus, the syntax fits nicely into the existing view file structure. > If > > > the > > > > types appear in views, then client tools can continue to use standard > > SQL > > > > without the type information. > > > > > > > > When this idea came up in the past, someone mentioned the issue of > > > > nullable vs. non-nullable. (Let's also include arrays, since Drill > > > supports > > > > that. Maybe add a suffix to the the name: > > > > > > > > SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ... > > > > > > > > Not pretty, but works with the existing SQL syntax rules. > > > > > > > > Obviously, Drill has much on its plate, so not suggestion that Drill > > > > should do this soon. Just passing it along as yet another option to > > > > consider. > > > > > > > > Thanks, > > > > - Paul > > > > > > > > [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf > > > > > >
