Ted, Matt, et al., I have created temporary repository for design and development of the support for EDI format in Drill. At this point, it is not a fork of Drill, but rather a collaboration space and code repository for exploratory code.
Wiki: https://github.com/ebegoli/edi-drill-store/wiki Repo: https://github.com/ebegoli/edi-drill-store Once the difficult parts specific to EDI (logical nesting, record representation) are figured out, and generic code written for I/O and translation, I will look to merge this with Drill and blend it into Drill-specific patterns. *If you wish, I will add you to the repo, so you can edit Wiki.* Let me know please. Edmon On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <ebeg...@gmail.com> wrote: > Matt - that is fantastic. Having good, liberally licensed format > converters probably takes care of the 50% of the problem. The other 50% > will be in figuring out the logical mapping. > > Let me think a little bit and propose how can we best set up a > collaboration platform. Any suggestion for this welcome. > > I personally like Google stuff, Hangouts, docs, and Github, of course. > > > On Saturday, September 5, 2015, Matthew Burgess <mattyb...@gmail.com> > wrote: > >> Edmon, >> >> All our Data Integration (file-format parsing, e.g.) code is Apache-2.0 >> licensed, we have parsers/processors >> < >> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah >> o/di/trans/steps >> <https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps>> >> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin >> <https://github.com/mattyb149/load-text-from-file-plugin> (also >> Apache-2.0) >> using Tika to extract metadata, this could be refactored as a Drill >> plugin. >> >> The (semi-)structured-to-tabular conversion will be an issue that most >> Drill >> extenders will have to deal with, although with powerful functions like >> KVGEN() and FLATTEN() it should be less daunting. For graphs >> (highly-structured but non-tabular data sources), I'm also looking into a >> Gremlin <http://tinkerpop.incubator.apache.org/> plugin, which could >> connect Graph Databases with Drill. Again, the problem is representing >> non-tabular data in a SQL environment as you mentioned. >> >> Regards, >> Matt >> >> From: Edmon Begoli <ebeg...@gmail.com> >> Reply-To: <dev@drill.apache.org> >> Date: Saturday, September 5, 2015 at 8:46 PM >> To: <dev@drill.apache.org> >> Subject: Re: Data representation and conversation - translating nested >> hierarchies into a tabular/queriable format >> >> Matt - any contribution of your time is welcome! Thank you. >> >> These problems that we are wanting to look into are not easy problems; I >> would not expect quick solutions, but any good idea, contribution of time, >> or code will help us advance the state of the capabilities. >> >> I might create a branch or separate Github repo, so that we just use its >> wiki for documentation and collaboration, and then later for scratch pad >> development. >> >> Regarding existing tools you might have - *do you think you could bring >> this code under the Apache 2 license?* >> Knowing what you told me before, I think that contributing this code would >> help advance the state of the Drill's format support tremendously. >> >> I see two major challenges related to what I am proposing: >> >> 1. (greater challenge) How to bring heterogeneously structured data >> logically and semantically into the tabular orientation of a typical SQL >> query processing engine. >> I think that some problems will not be completely implementable, so we'll >> need to either approximate or make some limiting/bounding design choices. >> >> 2. How to support these new formats through the Drill API. This is more of >> just a API study, design and programming effort. Nothing contradictory. >> >> Edmon >> >> >> >> >> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <mattyb...@gmail.com> wrote: >> >> > Challenge accepted! :) are we talking about things like XML, Jsonnet, >> > Yaml, etc.? And/or binary file formats that are (semi-)structured in >> nature >> > like XLSX? >> > >> > If we want to go more unstructured we could look at Apache Tika to at >> > least pull out metadata on things like image and video files, and I'm >> > tinkering with the idea of a UDF called topics() for human-generated >> text >> > using Apache OpenNLP, the problem being a well-trained model for the >> target >> > data. >> > >> > Edmon, I admire your ambition and would like to help out where/when I >> can. >> > Having said that, so far my amount of available time for Drill has been >> > embarrassingly lower than my amount of interest. >> > >> > For well-known file formats, I may be able to help with some of our >> > open-source tools for parsing such files. >> > >> > Regards, >> > Matt >> > >> > Sent from my iPhone >> > >> >> > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <ebeg...@gmail.com> wrote: >> >> > >> >> > Anyone else from the Drill team wholeheartedly invited. >> >> > >> >> > Edmon >> >> > >> >>> >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <ebeg...@gmail.com> >> wrote: >> >>> >> >> >>> >> Let's do it, Ted. I think it would add tremendous value to Drill >> as a >> >>> >> solution. >> >>> >> >> >>> >> I will start a Google doc and share with you so we can share >> ideas, >> >>> >> have Hangouts, design, etc. until we have something solid to put >> into >> > Drill >> >>> >> proper. >> >>> >> >> >>> >> If you have any other suggestion for the mode of collaboration >> please >> > let >> >>> >> me know. >> >>> >> >> >>>> >>> On Saturday, September 5, 2015, Ted Dunning < >> ted.dunn...@gmail.com> >> > wrote: >> >>>> >>> >> >>>>> >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli < >> ebeg...@gmail.com> >> > wrote: >> >>>>> >>>> >> >>>>> >>>> *My question - has this been handled already in Drill and >> storage >> >>>> >>> formats?* >> >>>>> >>>> >> >>>>> >>>> If so, where? >> >>>>> >>>> >> >>>>> >>>> If not,what is your recommendation for handling this? >> >>>>> >>>> >> >>>>> >>>> Should it be in an independent library outside of Drill that >> >>>>> presents >> > a >> >>>>> >>>> flattened version (not sure if this is possible), or maybe >> break the >> >>>>> >>>> message into tables corresponding to header data, items, >> footer. >> >>>> >>> >> >>>> >>> Drill does handle these kinds of data well, but currently the >> only >> file >> >>>> >>> formats that it can consume for this kind of data are JSON and >> >>>> Parquet. >> >>>> >>> >> >>>> >>> IT would be great to have more. I would love to work on this >> with >> you. >> >>> >> >> > >> >> >> >>