I understand. I hope you and the rest will help me with design guidance as I start translating EDI format into a Drill-amenable one.
On Sunday, September 13, 2015, Ted Dunning <ted.dunn...@gmail.com> wrote: > I doubt that I will be able to produce significant amounts of code. If I do > produce much of anything, I would be happy to contribute via pull requests. > > So I don't need to be on the repo as a contributor. > > On Sun, Sep 13, 2015 at 1:42 PM, Edmon Begoli <ebeg...@gmail.com > <javascript:;>> wrote: > > > Ted, Matt, et al., > > > > I have created temporary repository for design and development of the > > support for EDI format in Drill. > > At this point, it is not a fork of Drill, but rather a collaboration > space > > and code repository for exploratory code. > > > > Wiki: > > https://github.com/ebegoli/edi-drill-store/wiki > > > > Repo: > > https://github.com/ebegoli/edi-drill-store > > > > Once the difficult parts specific to EDI (logical nesting, record > > representation) are figured out, and generic code written for I/O and > > translation, > > I will look to merge this with Drill and blend it into Drill-specific > > patterns. > > > > *If you wish, I will add you to the repo, so you can edit Wiki.* > > > > Let me know please. > > > > Edmon > > > > > > On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <ebeg...@gmail.com > <javascript:;>> wrote: > > > > > Matt - that is fantastic. Having good, liberally licensed format > > > converters probably takes care of the 50% of the problem. The other 50% > > > will be in figuring out the logical mapping. > > > > > > Let me think a little bit and propose how can we best set up a > > > collaboration platform. Any suggestion for this welcome. > > > > > > I personally like Google stuff, Hangouts, docs, and Github, of course. > > > > > > > > > On Saturday, September 5, 2015, Matthew Burgess <mattyb...@gmail.com > <javascript:;>> > > > wrote: > > > > > >> Edmon, > > >> > > >> All our Data Integration (file-format parsing, e.g.) code is > Apache-2.0 > > >> licensed, we have parsers/processors > > >> < > > >> > > > https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah > > >> o/di/trans/steps > > >> < > > > https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps > > >> > > >> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin > > >> <https://github.com/mattyb149/load-text-from-file-plugin> (also > > >> Apache-2.0) > > >> using Tika to extract metadata, this could be refactored as a Drill > > >> plugin. > > >> > > >> The (semi-)structured-to-tabular conversion will be an issue that most > > >> Drill > > >> extenders will have to deal with, although with powerful functions > like > > >> KVGEN() and FLATTEN() it should be less daunting. For graphs > > >> (highly-structured but non-tabular data sources), I'm also looking > into > > a > > >> Gremlin <http://tinkerpop.incubator.apache.org/> plugin, which could > > >> connect Graph Databases with Drill. Again, the problem is representing > > >> non-tabular data in a SQL environment as you mentioned. > > >> > > >> Regards, > > >> Matt > > >> > > >> From: Edmon Begoli <ebeg...@gmail.com <javascript:;>> > > >> Reply-To: <dev@drill.apache.org <javascript:;>> > > >> Date: Saturday, September 5, 2015 at 8:46 PM > > >> To: <dev@drill.apache.org <javascript:;>> > > >> Subject: Re: Data representation and conversation - translating > nested > > >> hierarchies into a tabular/queriable format > > >> > > >> Matt - any contribution of your time is welcome! Thank you. > > >> > > >> These problems that we are wanting to look into are not easy > problems; I > > >> would not expect quick solutions, but any good idea, contribution of > > time, > > >> or code will help us advance the state of the capabilities. > > >> > > >> I might create a branch or separate Github repo, so that we just use > its > > >> wiki for documentation and collaboration, and then later for scratch > pad > > >> development. > > >> > > >> Regarding existing tools you might have - *do you think you could > bring > > >> this code under the Apache 2 license?* > > >> Knowing what you told me before, I think that contributing this code > > would > > >> help advance the state of the Drill's format support tremendously. > > >> > > >> I see two major challenges related to what I am proposing: > > >> > > >> 1. (greater challenge) How to bring heterogeneously structured data > > >> logically and semantically into the tabular orientation of a typical > SQL > > >> query processing engine. > > >> I think that some problems will not be completely implementable, so > > we'll > > >> need to either approximate or make some limiting/bounding design > > choices. > > >> > > >> 2. How to support these new formats through the Drill API. This is > more > > of > > >> just a API study, design and programming effort. Nothing > contradictory. > > >> > > >> Edmon > > >> > > >> > > >> > > >> > > >> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <mattyb...@gmail.com > <javascript:;>> > > wrote: > > >> > > >> > Challenge accepted! :) are we talking about things like XML, > Jsonnet, > > >> > Yaml, etc.? And/or binary file formats that are (semi-)structured > in > > >> nature > > >> > like XLSX? > > >> > > > >> > If we want to go more unstructured we could look at Apache Tika to > at > > >> > least pull out metadata on things like image and video files, and > I'm > > >> > tinkering with the idea of a UDF called topics() for > human-generated > > >> text > > >> > using Apache OpenNLP, the problem being a well-trained model for > the > > >> target > > >> > data. > > >> > > > >> > Edmon, I admire your ambition and would like to help out > where/when I > > >> can. > > >> > Having said that, so far my amount of available time for Drill has > > been > > >> > embarrassingly lower than my amount of interest. > > >> > > > >> > For well-known file formats, I may be able to help with some of our > > >> > open-source tools for parsing such files. > > >> > > > >> > Regards, > > >> > Matt > > >> > > > >> > Sent from my iPhone > > >> > > > >> >> > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <ebeg...@gmail.com > <javascript:;>> > > wrote: > > >> >> > > > >> >> > Anyone else from the Drill team wholeheartedly invited. > > >> >> > > > >> >> > Edmon > > >> >> > > > >> >>> >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli < > ebeg...@gmail.com <javascript:;> > > > > > >> wrote: > > >> >>> >> > > >> >>> >> Let's do it, Ted. I think it would add tremendous value to > Drill > > >> as a > > >> >>> >> solution. > > >> >>> >> > > >> >>> >> I will start a Google doc and share with you so we can share > > >> ideas, > > >> >>> >> have Hangouts, design, etc. until we have something solid to > put > > >> into > > >> > Drill > > >> >>> >> proper. > > >> >>> >> > > >> >>> >> If you have any other suggestion for the mode of collaboration > > >> please > > >> > let > > >> >>> >> me know. > > >> >>> >> > > >> >>>> >>> On Saturday, September 5, 2015, Ted Dunning < > > >> ted.dunn...@gmail.com <javascript:;>> > > >> > wrote: > > >> >>>> >>> > > >> >>>>> >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli < > > >> ebeg...@gmail.com <javascript:;>> > > >> > wrote: > > >> >>>>> >>>> > > >> >>>>> >>>> *My question - has this been handled already in Drill and > > >> storage > > >> >>>> >>> formats?* > > >> >>>>> >>>> > > >> >>>>> >>>> If so, where? > > >> >>>>> >>>> > > >> >>>>> >>>> If not,what is your recommendation for handling this? > > >> >>>>> >>>> > > >> >>>>> >>>> Should it be in an independent library outside of Drill > that > > >> >>>>> presents > > >> > a > > >> >>>>> >>>> flattened version (not sure if this is possible), or maybe > > >> break the > > >> >>>>> >>>> message into tables corresponding to header data, items, > > >> footer. > > >> >>>> >>> > > >> >>>> >>> Drill does handle these kinds of data well, but currently > the > > >> only > > >> file > > >> >>>> >>> formats that it can consume for this kind of data are JSON > and > > >> >>>> Parquet. > > >> >>>> >>> > > >> >>>> >>> IT would be great to have more. I would love to work on > this > > >> with > > >> you. > > >> >>> >> > > >> > > > >> > > >> > > >> > > >> > > >