Re: Update on EDI support for Drill - repo and design collaboratory

Ted Dunning Sun, 13 Sep 2015 18:42:07 -0700

I doubt that I will be able to produce significant amounts of code. If I do
produce much of anything, I would be happy to contribute via pull requests.


So I don't need to be on the repo as a contributor.

On Sun, Sep 13, 2015 at 1:42 PM, Edmon Begoli <[email protected]> wrote:

> Ted, Matt, et al.,
>
> I have created temporary repository for design and development of the
> support for EDI format in Drill.
> At this point, it is not a fork of Drill, but rather a collaboration space
> and code repository for exploratory code.
>
> Wiki:
> https://github.com/ebegoli/edi-drill-store/wiki
>
> Repo:
> https://github.com/ebegoli/edi-drill-store
>
> Once the difficult parts specific to EDI (logical nesting, record
> representation) are figured out, and generic code written for I/O and
> translation,
> I will look to merge this with Drill and blend it into Drill-specific
> patterns.
>
> *If you wish, I will add you to the repo, so you can edit Wiki.*
>
> Let me know please.
>
> Edmon
>
>
> On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <[email protected]> wrote:
>
> > Matt - that is fantastic. Having good, liberally licensed format
> > converters probably takes care of the 50% of the problem. The other 50%
> > will be in figuring out the logical mapping.
> >
> > Let me think a little bit and propose how can we best set up a
> > collaboration platform. Any suggestion for this welcome.
> >
> > I personally like Google stuff, Hangouts, docs, and Github, of course.
> >
> >
> > On Saturday, September 5, 2015, Matthew Burgess <[email protected]>
> > wrote:
> >
> >> Edmon,
> >>
> >> All our Data Integration (file-format parsing, e.g.) code is Apache-2.0
> >> licensed, we have parsers/processors
> >> <
> >>
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
> >> o/di/trans/steps
> >> <
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps
> >>
> >> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
> >> <https://github.com/mattyb149/load-text-from-file-plugin>  (also
> >> Apache-2.0)
> >> using Tika to extract metadata, this could be refactored as a Drill
> >> plugin.
> >>
> >> The (semi-)structured-to-tabular conversion will be an issue that most
> >> Drill
> >> extenders will have to deal with, although with powerful functions like
> >> KVGEN() and FLATTEN() it should be less daunting. For graphs
> >> (highly-structured but non-tabular data sources), I'm also looking into
> a
> >> Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which could
> >> connect Graph Databases with Drill. Again, the problem is representing
> >> non-tabular data in a SQL environment as you mentioned.
> >>
> >> Regards,
> >> Matt
> >>
> >> From:  Edmon Begoli <[email protected]>
> >> Reply-To:  <[email protected]>
> >> Date:  Saturday, September 5, 2015 at 8:46 PM
> >> To:  <[email protected]>
> >> Subject:  Re: Data representation and conversation - translating nested
> >> hierarchies into a tabular/queriable format
> >>
> >> Matt - any contribution of your time is welcome! Thank you.
> >>
> >> These problems that we are wanting to look into are not easy problems; I
> >> would not expect quick solutions, but any good idea, contribution of
> time,
> >> or code will help us advance the state of the capabilities.
> >>
> >> I might create a branch or separate Github repo, so that we just use its
> >> wiki for documentation and collaboration, and then later for scratch pad
> >> development.
> >>
> >> Regarding existing tools you might have - *do you think you could bring
> >> this code under the Apache 2 license?*
> >> Knowing what you told me before, I think that contributing this code
> would
> >> help advance the state of the Drill's format support tremendously.
> >>
> >> I see two major challenges related to what I am proposing:
> >>
> >> 1. (greater challenge) How to bring heterogeneously structured data
> >> logically and semantically into the tabular orientation of a typical SQL
> >> query processing engine.
> >> I think that some problems will not be completely implementable, so
> we'll
> >> need to either approximate or make some limiting/bounding design
> choices.
> >>
> >> 2. How to support these new formats through the Drill API. This is more
> of
> >> just a API study, design and programming effort. Nothing contradictory.
> >>
> >> Edmon
> >>
> >>
> >>
> >>
> >> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <[email protected]>
> wrote:
> >>
> >> >  Challenge accepted! :) are we talking about things like XML, Jsonnet,
> >> >  Yaml, etc.? And/or binary file formats that are (semi-)structured in
> >> nature
> >> >  like XLSX?
> >> >
> >> >  If we want to go more unstructured we could look at Apache Tika to at
> >> >  least pull out metadata on things like image and video files, and I'm
> >> >  tinkering with the idea of a UDF called topics() for human-generated
> >> text
> >> >  using Apache OpenNLP, the problem being a well-trained model for the
> >> target
> >> >  data.
> >> >
> >> >  Edmon, I admire your ambition and would like to help out where/when I
> >> can.
> >> >  Having said that, so far my amount of available time for Drill has
> been
> >> >  embarrassingly lower than my amount of interest.
> >> >
> >> >  For well-known file formats, I may be able to help with some of our
> >> >  open-source tools for parsing such files.
> >> >
> >> >  Regards,
> >> >  Matt
> >> >
> >> >  Sent from my iPhone
> >> >
> >> >>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <[email protected]>
> wrote:
> >> >>  >
> >> >>  > Anyone else from the Drill team wholeheartedly invited.
> >> >>  >
> >> >>  > Edmon
> >> >>  >
> >> >>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <[email protected]
> >
> >> wrote:
> >> >>>  >>
> >> >>>  >> Let's do it, Ted. I think it would add tremendous value to Drill
> >> as a
> >> >>>  >> solution.
> >> >>>  >>
> >> >>>  >> I will start a Google doc and share with you so we can share
> >> ideas,
> >> >>>  >> have Hangouts, design, etc. until we have something solid to put
> >> into
> >> >  Drill
> >> >>>  >> proper.
> >> >>>  >>
> >> >>>  >> If you have any other suggestion for the mode of collaboration
> >> please
> >> >  let
> >> >>>  >> me know.
> >> >>>  >>
> >> >>>>  >>> On Saturday, September 5, 2015, Ted Dunning <
> >> [email protected]>
> >> >  wrote:
> >> >>>>  >>>
> >> >>>>>  >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <
> >> [email protected]>
> >> >  wrote:
> >> >>>>>  >>>>
> >> >>>>>  >>>> *My question - has this been handled already in Drill and
> >> storage
> >> >>>>  >>> formats?*
> >> >>>>>  >>>>
> >> >>>>>  >>>> If so, where?
> >> >>>>>  >>>>
> >> >>>>>  >>>> If not,what is your recommendation for handling this?
> >> >>>>>  >>>>
> >> >>>>>  >>>> Should it be in an independent library outside of Drill that
> >> >>>>> presents
> >> >  a
> >> >>>>>  >>>> flattened version (not sure if this is possible), or maybe
> >> break the
> >> >>>>>  >>>> message into tables corresponding to header data, items,
> >> footer.
> >> >>>>  >>>
> >> >>>>  >>> Drill does handle these kinds of data well, but currently the
> >> only
> >> file
> >> >>>>  >>> formats that it can consume for this kind of data are JSON and
> >> >>>> Parquet.
> >> >>>>  >>>
> >> >>>>  >>> IT would be great to have more.  I would love to work on this
> >> with
> >> you.
> >> >>>  >>
> >> >
> >>
> >>
> >>
> >>
>

Re: Update on EDI support for Drill - repo and design collaboratory

Reply via email to