Update on EDI support for Drill - repo and design collaboratory

Edmon Begoli Sun, 13 Sep 2015 13:43:40 -0700

Ted, Matt, et al.,

I have created temporary repository for design and development of the
support for EDI format in Drill.
At this point, it is not a fork of Drill, but rather a collaboration space
and code repository for exploratory code.


Wiki:
https://github.com/ebegoli/edi-drill-store/wiki

Repo:
https://github.com/ebegoli/edi-drill-store

Once the difficult parts specific to EDI (logical nesting, record
representation) are figured out, and generic code written for I/O and
translation,
I will look to merge this with Drill and blend it into Drill-specific
patterns.

*If you wish, I will add you to the repo, so you can edit Wiki.*

Let me know please.

Edmon


On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <ebeg...@gmail.com> wrote:

> Matt - that is fantastic. Having good, liberally licensed format
> converters probably takes care of the 50% of the problem. The other 50%
> will be in figuring out the logical mapping.
>
> Let me think a little bit and propose how can we best set up a
> collaboration platform. Any suggestion for this welcome.
>
> I personally like Google stuff, Hangouts, docs, and Github, of course.
>
>
> On Saturday, September 5, 2015, Matthew Burgess <mattyb...@gmail.com>
> wrote:
>
>> Edmon,
>>
>> All our Data Integration (file-format parsing, e.g.) code is Apache-2.0
>> licensed, we have parsers/processors
>> <
>> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
>> o/di/trans/steps
>> <https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps>>
>> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
>> <https://github.com/mattyb149/load-text-from-file-plugin>  (also
>> Apache-2.0)
>> using Tika to extract metadata, this could be refactored as a Drill
>> plugin.
>>
>> The (semi-)structured-to-tabular conversion will be an issue that most
>> Drill
>> extenders will have to deal with, although with powerful functions like
>> KVGEN() and FLATTEN() it should be less daunting. For graphs
>> (highly-structured but non-tabular data sources), I'm also looking into a
>> Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which could
>> connect Graph Databases with Drill. Again, the problem is representing
>> non-tabular data in a SQL environment as you mentioned.
>>
>> Regards,
>> Matt
>>
>> From:  Edmon Begoli <ebeg...@gmail.com>
>> Reply-To:  <dev@drill.apache.org>
>> Date:  Saturday, September 5, 2015 at 8:46 PM
>> To:  <dev@drill.apache.org>
>> Subject:  Re: Data representation and conversation - translating nested
>> hierarchies into a tabular/queriable format
>>
>> Matt - any contribution of your time is welcome! Thank you.
>>
>> These problems that we are wanting to look into are not easy problems; I
>> would not expect quick solutions, but any good idea, contribution of time,
>> or code will help us advance the state of the capabilities.
>>
>> I might create a branch or separate Github repo, so that we just use its
>> wiki for documentation and collaboration, and then later for scratch pad
>> development.
>>
>> Regarding existing tools you might have - *do you think you could bring
>> this code under the Apache 2 license?*
>> Knowing what you told me before, I think that contributing this code would
>> help advance the state of the Drill's format support tremendously.
>>
>> I see two major challenges related to what I am proposing:
>>
>> 1. (greater challenge) How to bring heterogeneously structured data
>> logically and semantically into the tabular orientation of a typical SQL
>> query processing engine.
>> I think that some problems will not be completely implementable, so we'll
>> need to either approximate or make some limiting/bounding design choices.
>>
>> 2. How to support these new formats through the Drill API. This is more of
>> just a API study, design and programming effort. Nothing contradictory.
>>
>> Edmon
>>
>>
>>
>>
>> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <mattyb...@gmail.com> wrote:
>>
>> >  Challenge accepted! :) are we talking about things like XML, Jsonnet,
>> >  Yaml, etc.? And/or binary file formats that are (semi-)structured in
>> nature
>> >  like XLSX?
>> >
>> >  If we want to go more unstructured we could look at Apache Tika to at
>> >  least pull out metadata on things like image and video files, and I'm
>> >  tinkering with the idea of a UDF called topics() for human-generated
>> text
>> >  using Apache OpenNLP, the problem being a well-trained model for the
>> target
>> >  data.
>> >
>> >  Edmon, I admire your ambition and would like to help out where/when I
>> can.
>> >  Having said that, so far my amount of available time for Drill has been
>> >  embarrassingly lower than my amount of interest.
>> >
>> >  For well-known file formats, I may be able to help with some of our
>> >  open-source tools for parsing such files.
>> >
>> >  Regards,
>> >  Matt
>> >
>> >  Sent from my iPhone
>> >
>> >>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <ebeg...@gmail.com> wrote:
>> >>  >
>> >>  > Anyone else from the Drill team wholeheartedly invited.
>> >>  >
>> >>  > Edmon
>> >>  >
>> >>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <ebeg...@gmail.com>
>> wrote:
>> >>>  >>
>> >>>  >> Let's do it, Ted. I think it would add tremendous value to Drill
>> as a
>> >>>  >> solution.
>> >>>  >>
>> >>>  >> I will start a Google doc and share with you so we can share
>> ideas,
>> >>>  >> have Hangouts, design, etc. until we have something solid to put
>> into
>> >  Drill
>> >>>  >> proper.
>> >>>  >>
>> >>>  >> If you have any other suggestion for the mode of collaboration
>> please
>> >  let
>> >>>  >> me know.
>> >>>  >>
>> >>>>  >>> On Saturday, September 5, 2015, Ted Dunning <
>> ted.dunn...@gmail.com>
>> >  wrote:
>> >>>>  >>>
>> >>>>>  >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <
>> ebeg...@gmail.com>
>> >  wrote:
>> >>>>>  >>>>
>> >>>>>  >>>> *My question - has this been handled already in Drill and
>> storage
>> >>>>  >>> formats?*
>> >>>>>  >>>>
>> >>>>>  >>>> If so, where?
>> >>>>>  >>>>
>> >>>>>  >>>> If not,what is your recommendation for handling this?
>> >>>>>  >>>>
>> >>>>>  >>>> Should it be in an independent library outside of Drill that
>> >>>>> presents
>> >  a
>> >>>>>  >>>> flattened version (not sure if this is possible), or maybe
>> break the
>> >>>>>  >>>> message into tables corresponding to header data, items,
>> footer.
>> >>>>  >>>
>> >>>>  >>> Drill does handle these kinds of data well, but currently the
>> only
>> file
>> >>>>  >>> formats that it can consume for this kind of data are JSON and
>> >>>> Parquet.
>> >>>>  >>>
>> >>>>  >>> IT would be great to have more.  I would love to work on this
>> with
>> you.
>> >>>  >>
>> >
>>
>>
>>
>>

Update on EDI support for Drill - repo and design collaboratory

Reply via email to