Re: Use cases for DFDL

Charles Givre Thu, 07 Nov 2019 11:07:21 -0800

Ok... That makes sense.
Do you know if there's some documentation about that new feature?
-- C


> On Nov 7, 2019, at 1:57 PM, Paul Rogers <par0...@yahoo.com> wrote:
> 
> Hi Charles,
> 
> 
> Your suggestion to read the schema in each reader can work. In this case, the 
> planner knows nothing about the schema; it is discovered at scan time, by 
> each reader, as the file is read.
> 
> 
> Let's take a step back. Drill is designed for big data distributed 
> processing. We might imagine having 100+ files of some DFDL format on HDFS, 
> with, say, 10+ Drillbits reading those files in using, say, 50 scan 
> operators. in separate threads (minor fragments.)
> 
> 
> My hunch is that, since the schema is the same for all files, it would be 
> more efficient to read the schema at plan time, then pass the schema along as 
> part of the "physical plan" to each scan operator. That way, in the scenario 
> above, the schema would be read once (by the planner) rather than 100 times 
> (by each reader in each scan operator.)
> 
> 
> Further, Drill would know the type of the columns which can avoid ambiguities 
> that occur when types are unknown.
> 
> 
> Arina recently added schema support via a "provided schema." We passed this 
> information to the CSV reader so it can operate with a schema. Perhaps we can 
> look at what Arina did and figure out something similar for this use case. 
> Or, maybe even use the DFDL schema in place of the "provided" schema. Someone 
> will need to poke around a bit to figure out the best answer.
> 
> 
> Thanks,
> 
> - Paul
> 
> 
> 
> On Thursday, November 7, 2019, 10:40:39 AM PST, Charles Givre 
> <cgi...@gmail.com> wrote:
> 
> 
> @Paul, 
> Do you think a format plugin is the right way to integrate this?  My thought 
> was that we could create a folder for dfdl schemata, then the format plugin 
> could specify which schema would be used during read.  IE:
> 
> "dfdl" :{
>   "type":"dfdl",
>   "file":"myschema.dfdl",
>   "extensions":["xml"]
> }
> 
> I was envisioning this working in much the same way as other format plugins 
> that use an external parser.
> -- C
> 
> 
> > On Nov 7, 2019, at 1:35 PM, Paul Rogers <par0...@yahoo.com.INVALID 
> > <mailto:par0...@yahoo.com.INVALID>> wrote:
> > 
> > Hi All,
> > 
> > One thought to add is that if DFDL defines the file schema, then it would 
> > be ideal to use that schema at plan time as well as run time. Drill's 
> > Calcite integration provides means to do this, though I am personally a bit 
> > hazy on the details.
> > 
> > Certainly getting the reader to work is the first step; thanks Charles for 
> > the excellent summary. Then, add the needed Calcite integration to make the 
> > schema available to the planner at plan time.
> > 
> > Thanks,
> > - Paul
> > 
> > 
> > 
> >    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre 
> > <cgi...@gmail.com <mailto:cgi...@gmail.com>> wrote:  
> > 
> > Hi Steve, 
> > Thanks for responding... Here's how Drill reads a file:
> > 
> > Drill uses what are called "format plugins" which basically read the file 
> > in question and map fields to column vectors.  Note:  Drill supports nested 
> > data structures, so a column could contain a MAP or LIST. 
> > 
> > The basic steps are:
> > 1.  Open the inputstream and read the file
> > 2.  If the schema is known, it is advantageous to define the schema using a 
> > schemaBuilder object in advance and create schemaWriters for each column.  
> > In this case, since we'd be using DFDL, we do know the schema so we could 
> > create the schema BEFORE the data actually gets read.  If the schema is not 
> > known in advance, JSON for instance, Drill can discover the schema as it is 
> > reading the data, by dynamically adding column vectors as data is ingested, 
> > but that's not the case here... 
> > 3.  Once the schema is defined, Drill will then read the file row by row, 
> > parse the data, and assign values to each column vector. 
> > 
> > There are a few more details but that's the essence.  
> > 
> > What would be great is if we could create a function that could directly 
> > map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill 
> > does natively support JSON, however, it would probably be more effective 
> > and efficient if there was an InfosetOutputter custom for Drill.  Ideally, 
> > we need some sort of Iterable object so that Drill can map the parsed 
> > fields to the schema.  
> > 
> > If you want to take a look at a relatively simple format plugin take a look 
> > here: [2]. This file is the BatchReader which is where most of the heavy 
> > lifting takes place.  This plugin is for ESRI Shape files and has a mix of 
> > pre-defined fields, nested fields and fields that are defined after reading 
> > starts.
> > 
> > 
> > [1]: 
> > https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md
> >   
> > <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md><https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md
> >  
> > <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>>
> > [2]: 
> > https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
> >   
> > <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java><https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
> >  
> > <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>>
> > 
> > 
> > I can start a draft PR on the Drill side over the weekend and will share 
> > the link to this list.
> > Respectfully, 
> > -- C
> > 
> > 
> >> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <stephen.d.lawre...@gmail.com 
> >> <mailto:stephen.d.lawre...@gmail.com>> wrote:
> >> 
> >> I definitely agree. Apache Drill seems like a logical place to add
> >> Daffodil support. And I'm sure many of us, including myself, would be
> >> happy to provide some time towards this effort.
> >> 
> >> The Daffodil API is actually fairly simple and is usually fairly
> >> straightforward to integrate--most of the complexity comes from the DFDL
> >> schemas. There's a good "hello world" available [1] that shows more API
> >> functionality/errors/etc., but the jist of it is:
> >> 
> >> 1) Compile a DFDL schema to a data processor:
> >> 
> >>  Compiler c = Daffodil.compiler();
> >>  ProcessorFactory pf = c.compileFile(file);
> >>  DataProcessor dp = pf.onPath("/");
> >> 
> >> 2) Create an input source for the data
> >> 
> >>  InputStream is = ...
> >>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> >> 
> >> 3) Create an infoset outputter (we have a handful of differnt kinds)
> >> 
> >>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> >> 
> >> 4) Use the DataProcessor to parse the input data to the infoset outputter
> >> 
> >>  ParseResult pr = dataProcessor.parse(in, out)
> >> 
> >> So I guess the parts that we would need more Drill understanding is what
> >> the InfosetOutputter (step 3) needs to look like to better integrate
> >> into Drill. Is there a standard data structure that Drill expects
> >> representations of data to look like and Drill does the querying on the
> >> data structure? And is there some sort of schema that Daffodil would
> >> need to create to describe what this structure looks like so it could
> >> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> >> this data structure, unless Drill already supports XML or JSON.
> >> 
> >> Or is it completely up to the Storage Plugin (is that the right term) to
> >> determine how to take a Drill query and find the appropriate data from
> >> the data store?
> >> 
> >> - Steve
> >> 
> >> [1]
> >> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
> >>  
> >> <https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java>
> >> 
> >> 
> >> On 11/3/19 9:31 AM, Charles Givre wrote:
> >>> Hi Julian,
> >>> It seems like there is a beginning of convergence of the minds here.  I 
> >>> went to 
> >>> the Apache Roadshow in DC and that was where I learned about DFDL and 
> >>> immediately thought this was a really interesting possibility.
> >>> 
> >>> I'd love to see if we could foster some collaboration between the various 
> >>> projects on this.  From the Drill side of things, it would make it SO 
> >>> much 
> >>> easier to get Drill to read (and by extension query) various data types.  
> >>> I'd be 
> >>> willing to contribute time from the Drill side, but I definitely will 
> >>> need help 
> >>> understanding how DFDL works.
> >>> 
> >>> --C
> >>> 
> >>> 
> >>> 
> >>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer 
> >>>> <j.feina...@pragmaticminds.de <mailto:j.feina...@pragmaticminds.de> 
> >>>> <mailto:j.feina...@pragmaticminds.de 
> >>>> <mailto:j.feina...@pragmaticminds.de>>> wrote:
> >>>> 
> >>>> Hi Charles,
> >>>> this is an interesting idea and in fact we also discussed the same 
> >>>> matter for 
> >>>> Calcite at ApacheCon NA.
> >>>> But, I agree that it would be really powerful together with a complete 
> >>>> Runtime 
> >>>> like Drill.
> >>>> Julian
> >>>> *Von:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com> 
> >>>> <mailto:cgi...@gmail.com <mailto:cgi...@gmail.com>>>
> >>>> *Antworten an:*"us...@daffodil.apache.org 
> >>>> <mailto:us...@daffodil.apache.org> <mailto:us...@daffodil.apache.org 
> >>>> <mailto:us...@daffodil.apache.org>>" 
> >>>> <us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> 
> >>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>>
> >>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
> >>>> *An:*"Costello, Roger L." <coste...@mitre.org 
> >>>> <mailto:coste...@mitre.org> <mailto:coste...@mitre.org 
> >>>> <mailto:coste...@mitre.org>>>
> >>>> *Cc:*"us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> 
> >>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>" 
> >>>> <us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> 
> >>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>>
> >>>> *Betreff:*Re: Use cases for DFDL
> >>>> +1
> >>>> 
> >>>> 
> >>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <coste...@mitre.org 
> >>>>> <mailto:coste...@mitre.org> 
> >>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> wrote:
> >>>>> Excellent! Okay, here’s the use case:
> >>>>> A Daffodil extension could be created for Apache Drill so that you 
> >>>>> could 
> >>>>> parse any kind of data with Daffodil using a DFDL schema, and then you 
> >>>>> could 
> >>>>> use ANSI SQL to query the data, join it with other data, do analysis, 
> >>>>> etc., 
> >>>>> just as if it came from a database. So, instead of parsing data to XML 
> >>>>> and 
> >>>>> then using XPath to pull out data, you could instead parse data to 
> >>>>> Apache 
> >>>>> Drill's data representation and then use ANSI SQL to pull out data, and 
> >>>>> even 
> >>>>> combine it with other non-Daffodil data types. The advantage for this 
> >>>>> would 
> >>>>> be that it would make it very easy to enable Drill to query new data 
> >>>>> types 
> >>>>> (IE simply by using a DFDL schema) and it would enable users to easily 
> >>>>> query 
> >>>>> this data without having to load it into another system.
> >>>>> How’s that Charles?
> >>>>> /Roger
> >>>>> *From:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com> 
> >>>>> <mailto:cgi...@gmail.com <mailto:cgi...@gmail.com>>>
> >>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
> >>>>> *To:*Costello, Roger L. <coste...@mitre.org <mailto:coste...@mitre.org> 
> >>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>>
> >>>>> *Cc:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> 
> >>>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>
> >>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. 
> >>>>> It is 
> >>>>> regular ANSI SQL.  IMHO, I think this. would be a really great 
> >>>>> collaboration 
> >>>>> of the two communities.
> >>>>> --C
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <coste...@mitre.org 
> >>>>>> <mailto:coste...@mitre.org> 
> >>>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> wrote:
> >>>>>> Thanks again Charles. Is the following use case description correct?
> >>>>>> A Daffodil extension could be created for Apache Drill so that you 
> >>>>>> could 
> >>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you 
> >>>>>> could 
> >>>>>> use Apache Drill's query-like syntax and rich capabilities to query 
> >>>>>> parts of 
> >>>>>> that data, join it with other data, do analysis, etc., just as if it 
> >>>>>> came 
> >>>>>> from a database. So, instead of parsing data to XML and then using 
> >>>>>> XPath to 
> >>>>>> pull out data, you could instead parse data to Apache Drill's data 
> >>>>>> representation and then use Drills rich data-query capabilities to 
> >>>>>> pull out 
> >>>>>> data, and even combine it with other non-Daffodil data types. The 
> >>>>>> advantage 
> >>>>>> for this would be that it would make it very easy to enable Drill to 
> >>>>>> query 
> >>>>>> new data types (IE simply by using a DFDL schema) and it would enable 
> >>>>>> users 
> >>>>>> to easily query this data without having to load it into another 
> >>>>>> system.
> >>>>>> Is that correct?
> >>>>>> /Roger
> >>>>>> *From:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com> 
> >>>>>> <mailto:cgi...@gmail.com <mailto:cgi...@gmail.com>>>
> >>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
> >>>>>> *To:*Costello, Roger L. <coste...@mitre.org 
> >>>>>> <mailto:coste...@mitre.org> <mailto:coste...@mitre.org 
> >>>>>> <mailto:coste...@mitre.org>>>
> >>>>>> *Cc:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> 
> >>>>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>
> >>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>> Not exactly...
> >>>>>> I was thinking of using DFDL to enable Drill to create a schema for 
> >>>>>> data 
> >>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
> >>>>>> plugin could be written for Drill that mirrors this schema and 
> >>>>>> ultimately 
> >>>>>> reads the data files.  Drill wouldn't be populating any database, but 
> >>>>>> rather 
> >>>>>> directly querying the data.
> >>>>>> The advantage for this would be that it would make it very easy to 
> >>>>>> enable 
> >>>>>> Drill to query new data types (IE simply by using a DFDL schema) and 
> >>>>>> it 
> >>>>>> would enable users to easily query this data w/o having to load it 
> >>>>>> into 
> >>>>>> another system.  Does that make sense?
> >>>>>> -- C
> >>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <coste...@mitre.org 
> >>>>>>> <mailto:coste...@mitre.org> 
> >>>>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> wrote:
> >>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
> >>>>>>> Use DFDL to parse data to populate a database and then use Apache 
> >>>>>>> Drill to 
> >>>>>>> query the database.
> >>>>>>> Is that correct?
> >>>>>>> /Roger
> >>>>>>> *From:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com> 
> >>>>>>> <mailto:cgi...@gmail.com <mailto:cgi...@gmail.com>>>
> >>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
> >>>>>>> *To:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> 
> >>>>>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>
> >>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I 
> >>>>>>> think a 
> >>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to 
> >>>>>>> enable 
> >>>>>>> Drill to query data based on a DFDL schema.  This same concept could 
> >>>>>>> be 
> >>>>>>> applied to other SQL query engines such as Presto and/or Impala.
> >>>>>>> IMHO, this would facilitate the analysis of data sets supported by 
> >>>>>>> DFDL.
> >>>>>>> -- C
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <coste...@mitre.org 
> >>>>>>>> <mailto:coste...@mitre.org> 
> >>>>>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> wrote:
> >>>>>>>> Thanks Mike! I updated the slide:
> >>>>>>>> <image002.png>
> >>>>>>>> *From:*Beckerle, Mike <mbecke...@tresys.com 
> >>>>>>>> <mailto:mbecke...@tresys.com> <mailto:mbecke...@tresys.com 
> >>>>>>>> <mailto:mbecke...@tresys.com>>>
> >>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
> >>>>>>>> *To:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> 
> >>>>>>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>
> >>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
> >>>>>>>> I would not pick on RDF data stores as the target.
> >>>>>>>> Parsing data to populate a database (any variety) is the actual 
> >>>>>>>> case. The 
> >>>>>>>> fact that we did do one project involving RDF is why I cited that 
> >>>>>>>> example 
> >>>>>>>> in particular but pulling data into any data store/data base begins 
> >>>>>>>> with 
> >>>>>>>> the ability to parse the data, and then process it into suitable 
> >>>>>>>> form.
> >>>>>>>> This is an incomplete list so perhaps this slide title should be 
> >>>>>>>> "Example 
> >>>>>>>> Use Cases for DFDL" ?
> >>>>>>>> ...mikeb
> >>>>>>>> --------------------------------------------------------------------------------
> >>>>>>>> *From:*Costello, Roger L. <coste...@mitre.org 
> >>>>>>>> <mailto:coste...@mitre.org> <mailto:coste...@mitre.org 
> >>>>>>>> <mailto:coste...@mitre.org>>>
> >>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
> >>>>>>>> *To:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> 
> >>>>>>>> <mailto:us...@daffodil.apache.org 
> >>>>>>>> <mailto:us...@daffodil.apache.org>><us...@daffodil.apache.org 
> >>>>>>>> <mailto:us...@daffodil.apache.org> 
> >>>>>>>> <mailto:us...@daffodil.apache.org 
> >>>>>>>> <mailto:us...@daffodil.apache.org>>>
> >>>>>>>> *Subject:*Use cases for DFDL
> >>>>>>>> Hi Folks,
> >>>>>>>> I created a slide of use cases. See below. Do you agree with the 
> >>>>>>>> slide? 
> >>>>>>>> Anything you would add, delete, or change?  /Roger
> >>>>>>>> <image003.png>
> >>> 
> >>

Re: Use cases for DFDL

Reply via email to