Re: Use cases for DFDL

Paul Rogers Thu, 07 Nov 2019 10:35:42 -0800

Hi All,

One thought to add is that if DFDL defines the file schema, then it would be 
ideal to use that schema at plan time as well as run time. Drill's Calcite 
integration provides means to do this, though I am personally a bit hazy on the 
details.


Certainly getting the reader to work is the first step; thanks Charles for the 
excellent summary. Then, add the needed Calcite integration to make the schema 
available to the planner at plan time.

Thanks,
- Paul

 

    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre 
<[email protected]> wrote:  
 
 Hi Steve, 
Thanks for responding... Here's how Drill reads a file:

Drill uses what are called "format plugins" which basically read the file in 
question and map fields to column vectors.  Note:  Drill supports nested data 
structures, so a column could contain a MAP or LIST. 

The basic steps are:
1.  Open the inputstream and read the file
2.  If the schema is known, it is advantageous to define the schema using a 
schemaBuilder object in advance and create schemaWriters for each column.  In 
this case, since we'd be using DFDL, we do know the schema so we could create 
the schema BEFORE the data actually gets read.  If the schema is not known in 
advance, JSON for instance, Drill can discover the schema as it is reading the 
data, by dynamically adding column vectors as data is ingested, but that's not 
the case here... 
3.  Once the schema is defined, Drill will then read the file row by row, parse 
the data, and assign values to each column vector. 

There are a few more details but that's the essence.  

What would be great is if we could create a function that could directly map a 
DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does 
natively support JSON, however, it would probably be more effective and 
efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need 
some sort of Iterable object so that Drill can map the parsed fields to the 
schema.  

If you want to take a look at a relatively simple format plugin take a look 
here: [2]. This file is the BatchReader which is where most of the heavy 
lifting takes place.  This plugin is for ESRI Shape files and has a mix of 
pre-defined fields, nested fields and fields that are defined after reading 
starts.


[1]: 
https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md
 
<https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
[2]: 
https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
 
<https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>


I can start a draft PR on the Drill side over the weekend and will share the 
link to this list.
Respectfully, 
-- C


> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <[email protected]> 
> wrote:
> 
> I definitely agree. Apache Drill seems like a logical place to add
> Daffodil support. And I'm sure many of us, including myself, would be
> happy to provide some time towards this effort.
> 
> The Daffodil API is actually fairly simple and is usually fairly
> straightforward to integrate--most of the complexity comes from the DFDL
> schemas. There's a good "hello world" available [1] that shows more API
> functionality/errors/etc., but the jist of it is:
> 
> 1) Compile a DFDL schema to a data processor:
> 
>  Compiler c = Daffodil.compiler();
>  ProcessorFactory pf = c.compileFile(file);
>  DataProcessor dp = pf.onPath("/");
> 
> 2) Create an input source for the data
> 
>  InputStream is = ...
>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> 
> 3) Create an infoset outputter (we have a handful of differnt kinds)
> 
>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> 
> 4) Use the DataProcessor to parse the input data to the infoset outputter
> 
>  ParseResult pr = dataProcessor.parse(in, out)
> 
> So I guess the parts that we would need more Drill understanding is what
> the InfosetOutputter (step 3) needs to look like to better integrate
> into Drill. Is there a standard data structure that Drill expects
> representations of data to look like and Drill does the querying on the
> data structure? And is there some sort of schema that Daffodil would
> need to create to describe what this structure looks like so it could
> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> this data structure, unless Drill already supports XML or JSON.
> 
> Or is it completely up to the Storage Plugin (is that the right term) to
> determine how to take a Drill query and find the appropriate data from
> the data store?
> 
> - Steve
> 
> [1]
> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
> 
> 
> On 11/3/19 9:31 AM, Charles Givre wrote:
>> Hi Julian,
>> It seems like there is a beginning of convergence of the minds here.  I went 
>> to 
>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>> immediately thought this was a really interesting possibility.
>> 
>> I'd love to see if we could foster some collaboration between the various 
>> projects on this.  From the Drill side of things, it would make it SO much 
>> easier to get Drill to read (and by extension query) various data types.  
>> I'd be 
>> willing to contribute time from the Drill side, but I definitely will need 
>> help 
>> understanding how DFDL works.
>> 
>> --C
>> 
>> 
>> 
>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi Charles,
>>> this is an interesting idea and in fact we also discussed the same matter 
>>> for 
>>> Calcite at ApacheCon NA.
>>> But, I agree that it would be really powerful together with a complete 
>>> Runtime 
>>> like Drill.
>>> Julian
>>> *Von:*Charles Givre <[email protected] <mailto:[email protected]>>
>>> *Antworten an:*"[email protected] 
>>> <mailto:[email protected]>" 
>>> <[email protected] <mailto:[email protected]>>
>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>> *An:*"Costello, Roger L." <[email protected] <mailto:[email protected]>>
>>> *Cc:*"[email protected] <mailto:[email protected]>" 
>>> <[email protected] <mailto:[email protected]>>
>>> *Betreff:*Re: Use cases for DFDL
>>> +1
>>> 
>>> 
>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Excellent! Okay, here’s the use case:
>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>> parse any kind of data with Daffodil using a DFDL schema, and then you 
>>>> could 
>>>> use ANSI SQL to query the data, join it with other data, do analysis, 
>>>> etc., 
>>>> just as if it came from a database. So, instead of parsing data to XML and 
>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>> Drill's data representation and then use ANSI SQL to pull out data, and 
>>>> even 
>>>> combine it with other non-Daffodil data types. The advantage for this 
>>>> would 
>>>> be that it would make it very easy to enable Drill to query new data types 
>>>> (IE simply by using a DFDL schema) and it would enable users to easily 
>>>> query 
>>>> this data without having to load it into another system.
>>>> How’s that Charles?
>>>> /Roger
>>>> *From:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>> *To:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
>>>> *Cc:*[email protected] <mailto:[email protected]>
>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It 
>>>> is 
>>>> regular ANSI SQL.  IMHO, I think this. would be a really great 
>>>> collaboration 
>>>> of the two communities.
>>>> --C
>>>> 
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Thanks again Charles. Is the following use case description correct?
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you 
>>>>> could 
>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts 
>>>>> of 
>>>>> that data, join it with other data, do analysis, etc., just as if it came 
>>>>> from a database. So, instead of parsing data to XML and then using XPath 
>>>>> to 
>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>> representation and then use Drills rich data-query capabilities to pull 
>>>>> out 
>>>>> data, and even combine it with other non-Daffodil data types. The 
>>>>> advantage 
>>>>> for this would be that it would make it very easy to enable Drill to 
>>>>> query 
>>>>> new data types (IE simply by using a DFDL schema) and it would enable 
>>>>> users 
>>>>> to easily query this data without having to load it into another system.
>>>>> Is that correct?
>>>>> /Roger
>>>>> *From:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>> *To:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
>>>>> *Cc:*[email protected] <mailto:[email protected]>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Not exactly...
>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>> plugin could be written for Drill that mirrors this schema and ultimately 
>>>>> reads the data files.  Drill wouldn't be populating any database, but 
>>>>> rather 
>>>>> directly querying the data.
>>>>> The advantage for this would be that it would make it very easy to enable 
>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>> would enable users to easily query this data w/o having to load it into 
>>>>> another system.  Does that make sense?
>>>>> -- C
>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill 
>>>>>> to 
>>>>>> query the database.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think 
>>>>>> a 
>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to 
>>>>>> enable 
>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>> -- C
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Thanks Mike! I updated the slide:
>>>>>>> <image002.png>
>>>>>>> *From:*Beckerle, Mike <[email protected] 
>>>>>>> <mailto:[email protected]>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>> Parsing data to populate a database (any variety) is the actual case. 
>>>>>>> The 
>>>>>>> fact that we did do one project involving RDF is why I cited that 
>>>>>>> example 
>>>>>>> in particular but pulling data into any data store/data base begins 
>>>>>>> with 
>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>> This is an incomplete list so perhaps this slide title should be 
>>>>>>> "Example 
>>>>>>> Use Cases for DFDL" ?
>>>>>>> ...mikeb
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:*Costello, Roger L. <[email protected] 
>>>>>>> <mailto:[email protected]>>
>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>> *To:*[email protected] 
>>>>>>> <mailto:[email protected]><[email protected] 
>>>>>>> <mailto:[email protected]>>
>>>>>>> *Subject:*Use cases for DFDL
>>>>>>> Hi Folks,
>>>>>>> I created a slide of use cases. See below. Do you agree with the slide? 
>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>> <image003.png>
>> 
>

Re: Use cases for DFDL

Reply via email to