Ok... That makes sense. Do you know if there's some documentation about that new feature? -- C
> On Nov 7, 2019, at 1:57 PM, Paul Rogers <par0...@yahoo.com> wrote: > > Hi Charles, > > > Your suggestion to read the schema in each reader can work. In this case, the > planner knows nothing about the schema; it is discovered at scan time, by > each reader, as the file is read. > > > Let's take a step back. Drill is designed for big data distributed > processing. We might imagine having 100+ files of some DFDL format on HDFS, > with, say, 10+ Drillbits reading those files in using, say, 50 scan > operators. in separate threads (minor fragments.) > > > My hunch is that, since the schema is the same for all files, it would be > more efficient to read the schema at plan time, then pass the schema along as > part of the "physical plan" to each scan operator. That way, in the scenario > above, the schema would be read once (by the planner) rather than 100 times > (by each reader in each scan operator.) > > > Further, Drill would know the type of the columns which can avoid ambiguities > that occur when types are unknown. > > > Arina recently added schema support via a "provided schema." We passed this > information to the CSV reader so it can operate with a schema. Perhaps we can > look at what Arina did and figure out something similar for this use case. > Or, maybe even use the DFDL schema in place of the "provided" schema. Someone > will need to poke around a bit to figure out the best answer. > > > Thanks, > > - Paul > > > > On Thursday, November 7, 2019, 10:40:39 AM PST, Charles Givre > <cgi...@gmail.com> wrote: > > > @Paul, > Do you think a format plugin is the right way to integrate this? My thought > was that we could create a folder for dfdl schemata, then the format plugin > could specify which schema would be used during read. IE: > > "dfdl" :{ > "type":"dfdl", > "file":"myschema.dfdl", > "extensions":["xml"] > } > > I was envisioning this working in much the same way as other format plugins > that use an external parser. > -- C > > > > On Nov 7, 2019, at 1:35 PM, Paul Rogers <par0...@yahoo.com.INVALID > > <mailto:par0...@yahoo.com.INVALID>> wrote: > > > > Hi All, > > > > One thought to add is that if DFDL defines the file schema, then it would > > be ideal to use that schema at plan time as well as run time. Drill's > > Calcite integration provides means to do this, though I am personally a bit > > hazy on the details. > > > > Certainly getting the reader to work is the first step; thanks Charles for > > the excellent summary. Then, add the needed Calcite integration to make the > > schema available to the planner at plan time. > > > > Thanks, > > - Paul > > > > > > > > On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre > > <cgi...@gmail.com <mailto:cgi...@gmail.com>> wrote: > > > > Hi Steve, > > Thanks for responding... Here's how Drill reads a file: > > > > Drill uses what are called "format plugins" which basically read the file > > in question and map fields to column vectors. Note: Drill supports nested > > data structures, so a column could contain a MAP or LIST. > > > > The basic steps are: > > 1. Open the inputstream and read the file > > 2. If the schema is known, it is advantageous to define the schema using a > > schemaBuilder object in advance and create schemaWriters for each column. > > In this case, since we'd be using DFDL, we do know the schema so we could > > create the schema BEFORE the data actually gets read. If the schema is not > > known in advance, JSON for instance, Drill can discover the schema as it is > > reading the data, by dynamically adding column vectors as data is ingested, > > but that's not the case here... > > 3. Once the schema is defined, Drill will then read the file row by row, > > parse the data, and assign values to each column vector. > > > > There are a few more details but that's the essence. > > > > What would be great is if we could create a function that could directly > > map a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1]) Drill > > does natively support JSON, however, it would probably be more effective > > and efficient if there was an InfosetOutputter custom for Drill. Ideally, > > we need some sort of Iterable object so that Drill can map the parsed > > fields to the schema. > > > > If you want to take a look at a relatively simple format plugin take a look > > here: [2]. This file is the BatchReader which is where most of the heavy > > lifting takes place. This plugin is for ESRI Shape files and has a mix of > > pre-defined fields, nested fields and fields that are defined after reading > > starts. > > > > > > [1]: > > https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md > > > > <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md><https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md > > > > <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>> > > [2]: > > https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java > > > > <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java><https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java > > > > <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>> > > > > > > I can start a draft PR on the Drill side over the weekend and will share > > the link to this list. > > Respectfully, > > -- C > > > > > >> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <stephen.d.lawre...@gmail.com > >> <mailto:stephen.d.lawre...@gmail.com>> wrote: > >> > >> I definitely agree. Apache Drill seems like a logical place to add > >> Daffodil support. And I'm sure many of us, including myself, would be > >> happy to provide some time towards this effort. > >> > >> The Daffodil API is actually fairly simple and is usually fairly > >> straightforward to integrate--most of the complexity comes from the DFDL > >> schemas. There's a good "hello world" available [1] that shows more API > >> functionality/errors/etc., but the jist of it is: > >> > >> 1) Compile a DFDL schema to a data processor: > >> > >> Compiler c = Daffodil.compiler(); > >> ProcessorFactory pf = c.compileFile(file); > >> DataProcessor dp = pf.onPath("/"); > >> > >> 2) Create an input source for the data > >> > >> InputStream is = ... > >> InputSourceDataInputStream in = new InputSourceDataInputStream(is); > >> > >> 3) Create an infoset outputter (we have a handful of differnt kinds) > >> > >> JDOMInfosetOutputter out = new JDOMInfosetOutputter(); > >> > >> 4) Use the DataProcessor to parse the input data to the infoset outputter > >> > >> ParseResult pr = dataProcessor.parse(in, out) > >> > >> So I guess the parts that we would need more Drill understanding is what > >> the InfosetOutputter (step 3) needs to look like to better integrate > >> into Drill. Is there a standard data structure that Drill expects > >> representations of data to look like and Drill does the querying on the > >> data structure? And is there some sort of schema that Daffodil would > >> need to create to describe what this structure looks like so it could > >> query it? Perhaps we'd have a custom Drill InfosetOutputter that create > >> this data structure, unless Drill already supports XML or JSON. > >> > >> Or is it completely up to the Storage Plugin (is that the right term) to > >> determine how to take a Drill query and find the appropriate data from > >> the data store? > >> > >> - Steve > >> > >> [1] > >> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java > >> > >> <https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java> > >> > >> > >> On 11/3/19 9:31 AM, Charles Givre wrote: > >>> Hi Julian, > >>> It seems like there is a beginning of convergence of the minds here. I > >>> went to > >>> the Apache Roadshow in DC and that was where I learned about DFDL and > >>> immediately thought this was a really interesting possibility. > >>> > >>> I'd love to see if we could foster some collaboration between the various > >>> projects on this. From the Drill side of things, it would make it SO > >>> much > >>> easier to get Drill to read (and by extension query) various data types. > >>> I'd be > >>> willing to contribute time from the Drill side, but I definitely will > >>> need help > >>> understanding how DFDL works. > >>> > >>> --C > >>> > >>> > >>> > >>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer > >>>> <j.feina...@pragmaticminds.de <mailto:j.feina...@pragmaticminds.de> > >>>> <mailto:j.feina...@pragmaticminds.de > >>>> <mailto:j.feina...@pragmaticminds.de>>> wrote: > >>>> > >>>> Hi Charles, > >>>> this is an interesting idea and in fact we also discussed the same > >>>> matter for > >>>> Calcite at ApacheCon NA. > >>>> But, I agree that it would be really powerful together with a complete > >>>> Runtime > >>>> like Drill. > >>>> Julian > >>>> *Von:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com> > >>>> <mailto:cgi...@gmail.com <mailto:cgi...@gmail.com>>> > >>>> *Antworten an:*"us...@daffodil.apache.org > >>>> <mailto:us...@daffodil.apache.org> <mailto:us...@daffodil.apache.org > >>>> <mailto:us...@daffodil.apache.org>>" > >>>> <us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> > >>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>> > >>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38 > >>>> *An:*"Costello, Roger L." <coste...@mitre.org > >>>> <mailto:coste...@mitre.org> <mailto:coste...@mitre.org > >>>> <mailto:coste...@mitre.org>>> > >>>> *Cc:*"us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> > >>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>" > >>>> <us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> > >>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>>> > >>>> *Betreff:*Re: Use cases for DFDL > >>>> +1 > >>>> > >>>> > >>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <coste...@mitre.org > >>>>> <mailto:coste...@mitre.org> > >>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> wrote: > >>>>> Excellent! Okay, here’s the use case: > >>>>> A Daffodil extension could be created for Apache Drill so that you > >>>>> could > >>>>> parse any kind of data with Daffodil using a DFDL schema, and then you > >>>>> could > >>>>> use ANSI SQL to query the data, join it with other data, do analysis, > >>>>> etc., > >>>>> just as if it came from a database. So, instead of parsing data to XML > >>>>> and > >>>>> then using XPath to pull out data, you could instead parse data to > >>>>> Apache > >>>>> Drill's data representation and then use ANSI SQL to pull out data, and > >>>>> even > >>>>> combine it with other non-Daffodil data types. The advantage for this > >>>>> would > >>>>> be that it would make it very easy to enable Drill to query new data > >>>>> types > >>>>> (IE simply by using a DFDL schema) and it would enable users to easily > >>>>> query > >>>>> this data without having to load it into another system. > >>>>> How’s that Charles? > >>>>> /Roger > >>>>> *From:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com> > >>>>> <mailto:cgi...@gmail.com <mailto:cgi...@gmail.com>>> > >>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM > >>>>> *To:*Costello, Roger L. <coste...@mitre.org <mailto:coste...@mitre.org> > >>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> > >>>>> *Cc:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> > >>>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>> > >>>>> *Subject:*[EXT] Re: Use cases for DFDL > >>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. > >>>>> It is > >>>>> regular ANSI SQL. IMHO, I think this. would be a really great > >>>>> collaboration > >>>>> of the two communities. > >>>>> --C > >>>>> > >>>>> > >>>>> > >>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <coste...@mitre.org > >>>>>> <mailto:coste...@mitre.org> > >>>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> wrote: > >>>>>> Thanks again Charles. Is the following use case description correct? > >>>>>> A Daffodil extension could be created for Apache Drill so that you > >>>>>> could > >>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you > >>>>>> could > >>>>>> use Apache Drill's query-like syntax and rich capabilities to query > >>>>>> parts of > >>>>>> that data, join it with other data, do analysis, etc., just as if it > >>>>>> came > >>>>>> from a database. So, instead of parsing data to XML and then using > >>>>>> XPath to > >>>>>> pull out data, you could instead parse data to Apache Drill's data > >>>>>> representation and then use Drills rich data-query capabilities to > >>>>>> pull out > >>>>>> data, and even combine it with other non-Daffodil data types. The > >>>>>> advantage > >>>>>> for this would be that it would make it very easy to enable Drill to > >>>>>> query > >>>>>> new data types (IE simply by using a DFDL schema) and it would enable > >>>>>> users > >>>>>> to easily query this data without having to load it into another > >>>>>> system. > >>>>>> Is that correct? > >>>>>> /Roger > >>>>>> *From:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com> > >>>>>> <mailto:cgi...@gmail.com <mailto:cgi...@gmail.com>>> > >>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM > >>>>>> *To:*Costello, Roger L. <coste...@mitre.org > >>>>>> <mailto:coste...@mitre.org> <mailto:coste...@mitre.org > >>>>>> <mailto:coste...@mitre.org>>> > >>>>>> *Cc:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> > >>>>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>> > >>>>>> *Subject:*[EXT] Re: Use cases for DFDL > >>>>>> Not exactly... > >>>>>> I was thinking of using DFDL to enable Drill to create a schema for > >>>>>> data > >>>>>> that Drill cannot read. If DFDL can be used to describe the schema, a > >>>>>> plugin could be written for Drill that mirrors this schema and > >>>>>> ultimately > >>>>>> reads the data files. Drill wouldn't be populating any database, but > >>>>>> rather > >>>>>> directly querying the data. > >>>>>> The advantage for this would be that it would make it very easy to > >>>>>> enable > >>>>>> Drill to query new data types (IE simply by using a DFDL schema) and > >>>>>> it > >>>>>> would enable users to easily query this data w/o having to load it > >>>>>> into > >>>>>> another system. Does that make sense? > >>>>>> -- C > >>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <coste...@mitre.org > >>>>>>> <mailto:coste...@mitre.org> > >>>>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> wrote: > >>>>>>> Thanks Charles. Let me see if I understand the use case correctly. > >>>>>>> Use DFDL to parse data to populate a database and then use Apache > >>>>>>> Drill to > >>>>>>> query the database. > >>>>>>> Is that correct? > >>>>>>> /Roger > >>>>>>> *From:*Charles Givre <cgi...@gmail.com <mailto:cgi...@gmail.com> > >>>>>>> <mailto:cgi...@gmail.com <mailto:cgi...@gmail.com>>> > >>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM > >>>>>>> *To:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> > >>>>>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>> > >>>>>>> *Subject:*[EXT] Re: Use cases for DFDL > >>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill. I > >>>>>>> think a > >>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to > >>>>>>> enable > >>>>>>> Drill to query data based on a DFDL schema. This same concept could > >>>>>>> be > >>>>>>> applied to other SQL query engines such as Presto and/or Impala. > >>>>>>> IMHO, this would facilitate the analysis of data sets supported by > >>>>>>> DFDL. > >>>>>>> -- C > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <coste...@mitre.org > >>>>>>>> <mailto:coste...@mitre.org> > >>>>>>>> <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> wrote: > >>>>>>>> Thanks Mike! I updated the slide: > >>>>>>>> <image002.png> > >>>>>>>> *From:*Beckerle, Mike <mbecke...@tresys.com > >>>>>>>> <mailto:mbecke...@tresys.com> <mailto:mbecke...@tresys.com > >>>>>>>> <mailto:mbecke...@tresys.com>>> > >>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM > >>>>>>>> *To:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> > >>>>>>>> <mailto:us...@daffodil.apache.org <mailto:us...@daffodil.apache.org>> > >>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL > >>>>>>>> I would not pick on RDF data stores as the target. > >>>>>>>> Parsing data to populate a database (any variety) is the actual > >>>>>>>> case. The > >>>>>>>> fact that we did do one project involving RDF is why I cited that > >>>>>>>> example > >>>>>>>> in particular but pulling data into any data store/data base begins > >>>>>>>> with > >>>>>>>> the ability to parse the data, and then process it into suitable > >>>>>>>> form. > >>>>>>>> This is an incomplete list so perhaps this slide title should be > >>>>>>>> "Example > >>>>>>>> Use Cases for DFDL" ? > >>>>>>>> ...mikeb > >>>>>>>> -------------------------------------------------------------------------------- > >>>>>>>> *From:*Costello, Roger L. <coste...@mitre.org > >>>>>>>> <mailto:coste...@mitre.org> <mailto:coste...@mitre.org > >>>>>>>> <mailto:coste...@mitre.org>>> > >>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM > >>>>>>>> *To:*us...@daffodil.apache.org <mailto:us...@daffodil.apache.org> > >>>>>>>> <mailto:us...@daffodil.apache.org > >>>>>>>> <mailto:us...@daffodil.apache.org>><us...@daffodil.apache.org > >>>>>>>> <mailto:us...@daffodil.apache.org> > >>>>>>>> <mailto:us...@daffodil.apache.org > >>>>>>>> <mailto:us...@daffodil.apache.org>>> > >>>>>>>> *Subject:*Use cases for DFDL > >>>>>>>> Hi Folks, > >>>>>>>> I created a slide of use cases. See below. Do you agree with the > >>>>>>>> slide? > >>>>>>>> Anything you would add, delete, or change? /Roger > >>>>>>>> <image003.png> > >>> > >>