Re: Data representation and conversation - translating nested hierarchies into a tabular/queriable format

Matthew Burgess Sat, 05 Sep 2015 18:03:37 -0700

Edmon,

All our Data Integration (file-format parsing, e.g.) code is Apache-2.0
licensed, we have parsers/processors
<https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
o/di/trans/steps>  for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
<https://github.com/mattyb149/load-text-from-file-plugin>  (also Apache-2.0)
using Tika to extract metadata, this could be refactored as a Drill plugin.

The (semi-)structured-to-tabular conversion will be an issue that most Drill
extenders will have to deal with, although with powerful functions like
KVGEN() and FLATTEN() it should be less daunting. For graphs
(highly-structured but non-tabular data sources), I'm also looking into a
Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which could
connect Graph Databases with Drill. Again, the problem is representing
non-tabular data in a SQL environment as you mentioned.

Regards,
Matt

From:  Edmon Begoli <[email protected]>
Reply-To:  <[email protected]>
Date:  Saturday, September 5, 2015 at 8:46 PM
To:  <[email protected]>
Subject:  Re: Data representation and conversation - translating nested
hierarchies into a tabular/queriable format

Matt - any contribution of your time is welcome! Thank you.

These problems that we are wanting to look into are not easy problems; I
would not expect quick solutions, but any good idea, contribution of time,
or code will help us advance the state of the capabilities.

I might create a branch or separate Github repo, so that we just use its
wiki for documentation and collaboration, and then later for scratch pad
development.

Regarding existing tools you might have - *do you think you could bring
this code under the Apache 2 license?*
Knowing what you told me before, I think that contributing this code would
help advance the state of the Drill's format support tremendously.

I see two major challenges related to what I am proposing:

1. (greater challenge) How to bring heterogeneously structured data
logically and semantically into the tabular orientation of a typical SQL
query processing engine.
I think that some problems will not be completely implementable, so we'll
need to either approximate or make some limiting/bounding design choices.

2. How to support these new formats through the Drill API. This is more of
just a API study, design and programming effort. Nothing contradictory.

Edmon

On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <[email protected]> wrote:

>  Challenge accepted! :) are we talking about things like XML, Jsonnet,
>  Yaml, etc.? And/or binary file formats that are (semi-)structured in nature
>  like XLSX?
> 
>  If we want to go more unstructured we could look at Apache Tika to at
>  least pull out metadata on things like image and video files, and I'm
>  tinkering with the idea of a UDF called topics() for human-generated text
>  using Apache OpenNLP, the problem being a well-trained model for the target
>  data.
> 
>  Edmon, I admire your ambition and would like to help out where/when I can.
>  Having said that, so far my amount of available time for Drill has been
>  embarrassingly lower than my amount of interest.
> 
>  For well-known file formats, I may be able to help with some of our
>  open-source tools for parsing such files.
> 
>  Regards,
>  Matt
> 
>  Sent from my iPhone
> 
>>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <[email protected]> wrote:
>>  >
>>  > Anyone else from the Drill team wholeheartedly invited.
>>  >
>>  > Edmon
>>  >
>>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <[email protected]> wrote:
>>>  >>
>>>  >> Let's do it, Ted. I think it would add tremendous value to Drill as a
>>>  >> solution.
>>>  >>
>>>  >> I will start a Google doc and share with you so we can share ideas,
>>>  >> have Hangouts, design, etc. until we have something solid to put into
>  Drill
>>>  >> proper.
>>>  >>
>>>  >> If you have any other suggestion for the mode of collaboration please
>  let
>>>  >> me know.
>>>  >>
>>>>  >>> On Saturday, September 5, 2015, Ted Dunning <[email protected]>
>  wrote:
>>>>  >>>
>>>>>  >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <[email protected]>
>  wrote:
>>>>>  >>>>
>>>>>  >>>> *My question - has this been handled already in Drill and storage
>>>>  >>> formats?*
>>>>>  >>>>
>>>>>  >>>> If so, where?
>>>>>  >>>>
>>>>>  >>>> If not,what is your recommendation for handling this?
>>>>>  >>>>
>>>>>  >>>> Should it be in an independent library outside of Drill that
>>>>> presents
>  a
>>>>>  >>>> flattened version (not sure if this is possible), or maybe break the
>>>>>  >>>> message into tables corresponding to header data, items, footer.
>>>>  >>>
>>>>  >>> Drill does handle these kinds of data well, but currently the only
file
>>>>  >>> formats that it can consume for this kind of data are JSON and
>>>> Parquet.
>>>>  >>>
>>>>  >>> IT would be great to have more.  I would love to work on this with
you.
>>>  >>
>

Re: Data representation and conversation - translating nested hierarchies into a tabular/queriable format

Reply via email to