Matt - any contribution of your time is welcome! Thank you. These problems that we are wanting to look into are not easy problems; I would not expect quick solutions, but any good idea, contribution of time, or code will help us advance the state of the capabilities.
I might create a branch or separate Github repo, so that we just use its wiki for documentation and collaboration, and then later for scratch pad development. Regarding existing tools you might have - *do you think you could bring this code under the Apache 2 license?* Knowing what you told me before, I think that contributing this code would help advance the state of the Drill's format support tremendously. I see two major challenges related to what I am proposing: 1. (greater challenge) How to bring heterogeneously structured data logically and semantically into the tabular orientation of a typical SQL query processing engine. I think that some problems will not be completely implementable, so we'll need to either approximate or make some limiting/bounding design choices. 2. How to support these new formats through the Drill API. This is more of just a API study, design and programming effort. Nothing contradictory. Edmon On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <[email protected]> wrote: > Challenge accepted! :) are we talking about things like XML, Jsonnet, > Yaml, etc.? And/or binary file formats that are (semi-)structured in nature > like XLSX? > > If we want to go more unstructured we could look at Apache Tika to at > least pull out metadata on things like image and video files, and I'm > tinkering with the idea of a UDF called topics() for human-generated text > using Apache OpenNLP, the problem being a well-trained model for the target > data. > > Edmon, I admire your ambition and would like to help out where/when I can. > Having said that, so far my amount of available time for Drill has been > embarrassingly lower than my amount of interest. > > For well-known file formats, I may be able to help with some of our > open-source tools for parsing such files. > > Regards, > Matt > > Sent from my iPhone > > > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <[email protected]> wrote: > > > > Anyone else from the Drill team wholeheartedly invited. > > > > Edmon > > > >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <[email protected]> wrote: > >> > >> Let's do it, Ted. I think it would add tremendous value to Drill as a > >> solution. > >> > >> I will start a Google doc and share with you so we can share ideas, > >> have Hangouts, design, etc. until we have something solid to put into > Drill > >> proper. > >> > >> If you have any other suggestion for the mode of collaboration please > let > >> me know. > >> > >>> On Saturday, September 5, 2015, Ted Dunning <[email protected]> > wrote: > >>> > >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <[email protected]> > wrote: > >>>> > >>>> *My question - has this been handled already in Drill and storage > >>> formats?* > >>>> > >>>> If so, where? > >>>> > >>>> If not,what is your recommendation for handling this? > >>>> > >>>> Should it be in an independent library outside of Drill that presents > a > >>>> flattened version (not sure if this is possible), or maybe break the > >>>> message into tables corresponding to header data, items, footer. > >>> > >>> Drill does handle these kinds of data well, but currently the only file > >>> formats that it can consume for this kind of data are JSON and Parquet. > >>> > >>> IT would be great to have more. I would love to work on this with you. > >> >
