Challenge accepted! :) are we talking about things like XML, Jsonnet, Yaml, 
etc.? And/or binary file formats that are (semi-)structured in nature like XLSX?

If we want to go more unstructured we could look at Apache Tika to at least 
pull out metadata on things like image and video files, and I'm tinkering with 
the idea of a UDF called topics() for human-generated text using Apache 
OpenNLP, the problem being a well-trained model for the target data.

Edmon, I admire your ambition and would like to help out where/when I can. 
Having said that, so far my amount of available time for Drill has been 
embarrassingly lower than my amount of interest.

For well-known file formats, I may be able to help with some of our open-source 
tools for parsing such files.

Regards,
Matt

Sent from my iPhone

> On Sep 5, 2015, at 7:44 PM, Edmon Begoli <[email protected]> wrote:
> 
> Anyone else from the Drill team wholeheartedly invited.
> 
> Edmon
> 
>> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <[email protected]> wrote:
>> 
>> Let's do it, Ted. I think it would add tremendous value to Drill as a
>> solution.
>> 
>> I will start a Google doc and share with you so we can share ideas,
>> have Hangouts, design, etc. until we have something solid to put into Drill
>> proper.
>> 
>> If you have any other suggestion for the mode of collaboration please let
>> me know.
>> 
>>> On Saturday, September 5, 2015, Ted Dunning <[email protected]> wrote:
>>> 
>>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <[email protected]> wrote:
>>>> 
>>>> *My question - has this been handled already in Drill and storage
>>> formats?*
>>>> 
>>>> If so, where?
>>>> 
>>>> If not,what is your recommendation for handling this?
>>>> 
>>>> Should it be in an independent library outside of Drill that presents a
>>>> flattened version (not sure if this is possible), or maybe break the
>>>> message into tables corresponding to header data, items, footer.
>>> 
>>> Drill does handle these kinds of data well, but currently the only file
>>> formats that it can consume for this kind of data are JSON and Parquet.
>>> 
>>> IT would be great to have more.  I would love to work on this with you.
>> 

Reply via email to