@Paul, Do you think a format plugin is the right way to integrate this? My thought was that we could create a folder for dfdl schemata, then the format plugin could specify which schema would be used during read. IE:
"dfdl" :{ "type":"dfdl", "file":"myschema.dfdl", "extensions":["xml"] } I was envisioning this working in much the same way as other format plugins that use an external parser. -- C > On Nov 7, 2019, at 1:35 PM, Paul Rogers <par0...@yahoo.com.INVALID> wrote: > > Hi All, > > One thought to add is that if DFDL defines the file schema, then it would be > ideal to use that schema at plan time as well as run time. Drill's Calcite > integration provides means to do this, though I am personally a bit hazy on > the details. > > Certainly getting the reader to work is the first step; thanks Charles for > the excellent summary. Then, add the needed Calcite integration to make the > schema available to the planner at plan time. > > Thanks, > - Paul > > > > On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre > <cgi...@gmail.com> wrote: > > Hi Steve, > Thanks for responding... Here's how Drill reads a file: > > Drill uses what are called "format plugins" which basically read the file in > question and map fields to column vectors. Note: Drill supports nested data > structures, so a column could contain a MAP or LIST. > > The basic steps are: > 1. Open the inputstream and read the file > 2. If the schema is known, it is advantageous to define the schema using a > schemaBuilder object in advance and create schemaWriters for each column. In > this case, since we'd be using DFDL, we do know the schema so we could create > the schema BEFORE the data actually gets read. If the schema is not known in > advance, JSON for instance, Drill can discover the schema as it is reading > the data, by dynamically adding column vectors as data is ingested, but > that's not the case here... > 3. Once the schema is defined, Drill will then read the file row by row, > parse the data, and assign values to each column vector. > > There are a few more details but that's the essence. > > What would be great is if we could create a function that could directly map > a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1]) Drill does > natively support JSON, however, it would probably be more effective and > efficient if there was an InfosetOutputter custom for Drill. Ideally, we > need some sort of Iterable object so that Drill can map the parsed fields to > the schema. > > If you want to take a look at a relatively simple format plugin take a look > here: [2]. This file is the BatchReader which is where most of the heavy > lifting takes place. This plugin is for ESRI Shape files and has a mix of > pre-defined fields, nested fields and fields that are defined after reading > starts. > > > [1]: > https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md > > <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md> > [2]: > https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java > > <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java> > > > I can start a draft PR on the Drill side over the weekend and will share the > link to this list. > Respectfully, > -- C > > >> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <stephen.d.lawre...@gmail.com> >> wrote: >> >> I definitely agree. Apache Drill seems like a logical place to add >> Daffodil support. And I'm sure many of us, including myself, would be >> happy to provide some time towards this effort. >> >> The Daffodil API is actually fairly simple and is usually fairly >> straightforward to integrate--most of the complexity comes from the DFDL >> schemas. There's a good "hello world" available [1] that shows more API >> functionality/errors/etc., but the jist of it is: >> >> 1) Compile a DFDL schema to a data processor: >> >> Compiler c = Daffodil.compiler(); >> ProcessorFactory pf = c.compileFile(file); >> DataProcessor dp = pf.onPath("/"); >> >> 2) Create an input source for the data >> >> InputStream is = ... >> InputSourceDataInputStream in = new InputSourceDataInputStream(is); >> >> 3) Create an infoset outputter (we have a handful of differnt kinds) >> >> JDOMInfosetOutputter out = new JDOMInfosetOutputter(); >> >> 4) Use the DataProcessor to parse the input data to the infoset outputter >> >> ParseResult pr = dataProcessor.parse(in, out) >> >> So I guess the parts that we would need more Drill understanding is what >> the InfosetOutputter (step 3) needs to look like to better integrate >> into Drill. Is there a standard data structure that Drill expects >> representations of data to look like and Drill does the querying on the >> data structure? And is there some sort of schema that Daffodil would >> need to create to describe what this structure looks like so it could >> query it? Perhaps we'd have a custom Drill InfosetOutputter that create >> this data structure, unless Drill already supports XML or JSON. >> >> Or is it completely up to the Storage Plugin (is that the right term) to >> determine how to take a Drill query and find the appropriate data from >> the data store? >> >> - Steve >> >> [1] >> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java >> >> >> On 11/3/19 9:31 AM, Charles Givre wrote: >>> Hi Julian, >>> It seems like there is a beginning of convergence of the minds here. I >>> went to >>> the Apache Roadshow in DC and that was where I learned about DFDL and >>> immediately thought this was a really interesting possibility. >>> >>> I'd love to see if we could foster some collaboration between the various >>> projects on this. From the Drill side of things, it would make it SO much >>> easier to get Drill to read (and by extension query) various data types. >>> I'd be >>> willing to contribute time from the Drill side, but I definitely will need >>> help >>> understanding how DFDL works. >>> >>> --C >>> >>> >>> >>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feina...@pragmaticminds.de >>>> <mailto:j.feina...@pragmaticminds.de>> wrote: >>>> >>>> Hi Charles, >>>> this is an interesting idea and in fact we also discussed the same matter >>>> for >>>> Calcite at ApacheCon NA. >>>> But, I agree that it would be really powerful together with a complete >>>> Runtime >>>> like Drill. >>>> Julian >>>> *Von:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com>> >>>> *Antworten an:*"us...@daffodil.apache.org >>>> <mailto:us...@daffodil.apache.org>" >>>> <us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>> >>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38 >>>> *An:*"Costello, Roger L." <coste...@mitre.org <mailto:coste...@mitre.org>> >>>> *Cc:*"us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>" >>>> <us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>> >>>> *Betreff:*Re: Use cases for DFDL >>>> +1 >>>> >>>> >>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <coste...@mitre.org >>>>> <mailto:coste...@mitre.org>> wrote: >>>>> Excellent! Okay, here’s the use case: >>>>> A Daffodil extension could be created for Apache Drill so that you could >>>>> parse any kind of data with Daffodil using a DFDL schema, and then you >>>>> could >>>>> use ANSI SQL to query the data, join it with other data, do analysis, >>>>> etc., >>>>> just as if it came from a database. So, instead of parsing data to XML >>>>> and >>>>> then using XPath to pull out data, you could instead parse data to Apache >>>>> Drill's data representation and then use ANSI SQL to pull out data, and >>>>> even >>>>> combine it with other non-Daffodil data types. The advantage for this >>>>> would >>>>> be that it would make it very easy to enable Drill to query new data >>>>> types >>>>> (IE simply by using a DFDL schema) and it would enable users to easily >>>>> query >>>>> this data without having to load it into another system. >>>>> How’s that Charles? >>>>> /Roger >>>>> *From:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com>> >>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM >>>>> *To:*Costello, Roger L. <coste...@mitre.org <mailto:coste...@mitre.org>> >>>>> *Cc:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> >>>>> *Subject:*[EXT] Re: Use cases for DFDL >>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. >>>>> It is >>>>> regular ANSI SQL. IMHO, I think this. would be a really great >>>>> collaboration >>>>> of the two communities. >>>>> --C >>>>> >>>>> >>>>> >>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <coste...@mitre.org >>>>>> <mailto:coste...@mitre.org>> wrote: >>>>>> Thanks again Charles. Is the following use case description correct? >>>>>> A Daffodil extension could be created for Apache Drill so that you could >>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you >>>>>> could >>>>>> use Apache Drill's query-like syntax and rich capabilities to query >>>>>> parts of >>>>>> that data, join it with other data, do analysis, etc., just as if it >>>>>> came >>>>>> from a database. So, instead of parsing data to XML and then using XPath >>>>>> to >>>>>> pull out data, you could instead parse data to Apache Drill's data >>>>>> representation and then use Drills rich data-query capabilities to pull >>>>>> out >>>>>> data, and even combine it with other non-Daffodil data types. The >>>>>> advantage >>>>>> for this would be that it would make it very easy to enable Drill to >>>>>> query >>>>>> new data types (IE simply by using a DFDL schema) and it would enable >>>>>> users >>>>>> to easily query this data without having to load it into another system. >>>>>> Is that correct? >>>>>> /Roger >>>>>> *From:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com>> >>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM >>>>>> *To:*Costello, Roger L. <coste...@mitre.org <mailto:coste...@mitre.org>> >>>>>> *Cc:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> >>>>>> *Subject:*[EXT] Re: Use cases for DFDL >>>>>> Not exactly... >>>>>> I was thinking of using DFDL to enable Drill to create a schema for data >>>>>> that Drill cannot read. If DFDL can be used to describe the schema, a >>>>>> plugin could be written for Drill that mirrors this schema and >>>>>> ultimately >>>>>> reads the data files. Drill wouldn't be populating any database, but >>>>>> rather >>>>>> directly querying the data. >>>>>> The advantage for this would be that it would make it very easy to >>>>>> enable >>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it >>>>>> would enable users to easily query this data w/o having to load it into >>>>>> another system. Does that make sense? >>>>>> -- C >>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <coste...@mitre.org >>>>>>> <mailto:coste...@mitre.org>> wrote: >>>>>>> Thanks Charles. Let me see if I understand the use case correctly. >>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill >>>>>>> to >>>>>>> query the database. >>>>>>> Is that correct? >>>>>>> /Roger >>>>>>> *From:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com>> >>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM >>>>>>> *To:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> >>>>>>> *Subject:*[EXT] Re: Use cases for DFDL >>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill. I think >>>>>>> a >>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to >>>>>>> enable >>>>>>> Drill to query data based on a DFDL schema. This same concept could be >>>>>>> applied to other SQL query engines such as Presto and/or Impala. >>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL. >>>>>>> -- C >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <coste...@mitre.org >>>>>>>> <mailto:coste...@mitre.org>> wrote: >>>>>>>> Thanks Mike! I updated the slide: >>>>>>>> <image002.png> >>>>>>>> *From:*Beckerle, Mike <mbecke...@tresys.com >>>>>>>> <mailto:mbecke...@tresys.com>> >>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM >>>>>>>> *To:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> >>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL >>>>>>>> I would not pick on RDF data stores as the target. >>>>>>>> Parsing data to populate a database (any variety) is the actual case. >>>>>>>> The >>>>>>>> fact that we did do one project involving RDF is why I cited that >>>>>>>> example >>>>>>>> in particular but pulling data into any data store/data base begins >>>>>>>> with >>>>>>>> the ability to parse the data, and then process it into suitable form. >>>>>>>> This is an incomplete list so perhaps this slide title should be >>>>>>>> "Example >>>>>>>> Use Cases for DFDL" ? >>>>>>>> ...mikeb >>>>>>>> -------------------------------------------------------------------------------- >>>>>>>> *From:*Costello, Roger L. <coste...@mitre.org >>>>>>>> <mailto:coste...@mitre.org>> >>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM >>>>>>>> *To:*us...@daffodil.apache.org >>>>>>>> <mailto:us...@daffodil.apache.org><us...@daffodil.apache.org >>>>>>>> <mailto:us...@daffodil.apache.org>> >>>>>>>> *Subject:*Use cases for DFDL >>>>>>>> Hi Folks, >>>>>>>> I created a slide of use cases. See below. Do you agree with the >>>>>>>> slide? >>>>>>>> Anything you would add, delete, or change? /Roger >>>>>>>> <image003.png> >>> >>