can you elaborate on what you mean by .drill is a different use case. In my mind, .drill has 2 use cases - a way to specify hints to Drill on how read certain datasets (and potentially optimize the queries on the datasets) and a way to save the definitions of objects created via Drill for reuse/access from BI tools. Both these (i.e existing or external tables vs Drill created or internal tables) currently are not differentiated in Drill, hence I believe can use the same model in terms of metadata handling.
I would be interested in knowing your thoughts. -Neeraja On Wed, Oct 21, 2015 at 4:55 PM, Julien Le Dem <jul...@dremio.com> wrote: > I think of .drill files as a different use case but there is potentially > some overlap. > Some things to keep in mind: > - The person analyzing the data has often read-only access. > - having to write a config file on one end of the system and then query on > the other end is not analyst friendly > > We should definitely keep .drill in mind while design this. > Although I'm thinking we should probably discuss .drill on a separate > thread. > > > > On Wed, Oct 21, 2015 at 3:55 PM, Neeraja Rentachintala < > nrentachint...@maprtech.com> wrote: > > > Another alternative to do this to specify a metadata file (.drill files) > > that came up in some of the earlier discussions to solve similar use > cases. > > Rather than centrally defining configurations in storage plugin (which is > > what Drill does today), .drill files allow more granularity , potentially > > at folder or individual file which will override the central > configuration. > > > > I think the benefit of the metadata file is it can be used for other > > purposes (such as to stats etc). Another benefit is that if you are > using a > > BI/query tool to trigger Drill SQL queries, this will work seamlessly > > rather than having to rewrite the query for custom syntax. > > > > I would like to know what others think of this approach. > > > > -Neeraja > > > > > > > > On Wed, Oct 21, 2015 at 3:43 PM, Julien Le Dem <jul...@dremio.com> > wrote: > > > > > I like the approach of using views to add missing metadata to an > existing > > > raw dataset. The raw datasets stay the same and the view becomes the > > > interface to the data. Nothing is mutated and we know how things are > > > derived from one another. > > > > > > TLDR: I'm trying to summarize the options bellow and add a few > thoughts: > > > (please comment whether you think upside/downside elements are valid) > > > Let's refer to those options with the name in *bold* > > > > > > - *urls style params* (my initial strawman): *select * from > > > dfs.`default`.`/path/to/file/something.psv?type=text&delimiter=|`;* > > > Extensibility is ensured through the storage plugin interpretation of > the > > > path. > > > I agree with decomposing the syntax of format vs the data path/url. so > > this > > > would conflict with HTTP query parameters. > > > I will not pursue this one but I think it was a great conversation > > starter! > > > > > > - *table functions*: *select * > > > from delimitedFile(dfs.`default`.`/path/to/file/something.psv`, '|')* > > > It sounds like these would have to be defined in the corresponding > > > StoragePlugin as they need intimate knowledge of the underlying > storage. > > > Extensibility would come through the storage plugin defining those? > > > +1 on named parameter vs just positional. > > > The possible downside is a syntax that could be a little foreign to the > > > data analyst. > > > > > > - *fake-o parameters*. *select * > > > from dfs.`default`.`/path/to/file/something.psv` where > > magicFieldDelimiter > > > = '|';* > > > I would be inclined to avoid using filters for something that changes > > what > > > the data looks like. > > > This could be unintuitive to users. (at least it feels that way to me) > > > Extensibility would come through the storage plugin defining predicate > > push > > > down rules? > > > > > > - *WITH USING OPTIONS*: > > > In general I would feel more natural to me to put those options as part > > of > > > a select statement. the select statement can always be used in a WITH > > > clause. > > > Extensibility would come through the storage plugin receiving those > > > options? As the with statement applies to a full select statement with > > > potentially joins, how would we know where to aply those options? > > > We have two sub-options: > > > - *(type = 'text' AND lineDelimiter = '\n')*: this seems similar to > > > fake-o > > > parameters (above), same comment than above. *AND* seems out of place, > > for > > > example OR would be forbidden. > > > - *{ type: "text", linedDelimiter = "\n"}*: The advantage of this is > > that > > > you can re-use the same syntax in the configuration file. This is a > plus > > > for consistency. Users would figure out what works and would just have > to > > > put it in the central configuration once done. > > > > > > - *EXTEND WITH OPTIONS*: *SELECT FROM emp EXTEND [optional columns > list] > > > OPTIONS (type 'text', lineDelimiter);* > > > What would be the column list in that case? Would it be awkward to use > > > EXTEND without a column list? > > > Extensibility would come through the storage plugin receiving those > > extend > > > options. > > > It sounds like they could be simply SELECT OPTIONS? > > > > > > - *Specific syntax*: *select * FROM mydb.mytable* > > > > > > > > > > > > > > > * TREAT AS TEXT USING LINE DELIMITER '\n' AND > FIELD > > > DELIMITER ',' AND TREAT FIRST ROW AS HEADERS* > > > Extensibility would come through the storage plugin defining a sub > > grammar > > > for those options. > > > Possibly this is harder to implement for the storage plugin > implementor. > > > Upside is the user has a SQL like syntax to specify this (although, > I've > > > never been fond of parts where SQL is trying to be plain English) > > > > > > > > > *Appendix*: :P > > > For what it's worth, here is how Pig does it: *LOAD 'data' [USING > > function] > > > [AS schema];* > > > - Just *LOAD '/path/to/my/file'* will use the default Loader (tab > > separated > > > values) > > > - adding a custom Loader* LOAD 'data' USING MyCustomLoader('param1', > > > 'param2'); *is how you implement custom formats or remote locations > > (HBase, > > > Web service, ...) > > > So you can use the default loader with a different separator (*USING > > > PigStorage('\t')*) in parameter or write your own. > > > - The AS clause lets you set the schema. You could have a csv file > > without > > > header and define the names and types of each column that way. > > > Doc: https://pig.apache.org/docs/r0.15.0/basic.html#load > > > > > > > > > > > > On Wed, Oct 21, 2015 at 8:57 AM, Julian Hyde <jh...@apache.org> wrote: > > > > > > > Whatever API is used to scan files from SQL, there will need to be a > > > > corresponding way to accomplish the same thing in a user interface. > > > > Probably a form with various fields, some of them with drop-boxes > etc. > > > > > > > > And ideally a facility that samples a few hundred rows to deduce the > > > > probable field names and types and which fields are unique. > > > > > > > > I think that the UI is the true "user friendly" interface. A usage > > > > pattern might be for people to define a data source using in the UI, > > > > save it as a view, then switch to the command line to write queries > on > > > > that view. > > > > > > > > There are other use cases similar to reading flies. For example you > > > > would like to read data from an HTTP URL. You might want to specify > > > > similar parameters for formats, compression, parsing, and parameters > > > > in the file URI that describe a family of partitioned files. A URL > > > > might allow push-down of filters, projects and sorts. But still you > > > > would want to specify formats, compression and parsing the same way > as > > > > reading files. > > > > > > > > To me, this argues for decomposing the file scan syntax into pieces > > > > that can be re-used if you get data from places other than files. > > > > > > > > Julian > > > > > > > > > > > > On Wed, Oct 21, 2015 at 6:15 AM, Jacques Nadeau <jacq...@dremio.com> > > > > wrote: > > > > >> This fourth is also least extensible and thus most > disenfranchising > > > for > > > > >> those outside the inner group. > > > > >> > > > > >> Table functions (hopefully) would be something that others could > > > > > implement. > > > > > > > > > > This is a brainstorm about a user apis... > > > > > > > > > > It isn't appropriate to shoot down ideas immediately in a > brainstorm. > > > It > > > > > has a chilling effect on other people presenting new ideas. > > > > > > > > > > User apis should be defined first with an end-user in mind. > > Consistency > > > > in > > > > > different contexts is also very important (note my expansion of the > > > > > discussion to ASCRIBE METADATA.) > > > > > > > > > > Your statement about extensibility has no technical backing. If you > > > have > > > > a > > > > > concern like this, ask the proposer if they think that this could > be > > > done > > > > > in an extensible way. In this case I see a number of ways that this > > > could > > > > > be done. Note the mention of the grammar switch in my initial > > proposal > > > or > > > > > go review my outstanding patch for integrating JSON literals. In > many > > > > ways, > > > > > I think this approach could be considered the most extensible and > > > > > expressive for a Drill extension developer. > > > > > > > > > > I hope others will continue to brainstorm on this thread. > > > > > > > > > > > > > > > > -- > > > Julien > > > > > > > > > -- > Julien >