Hi Mike, I hope all is well. I remembered one other piece which might be useful for you. Drill has an interface called a PersistentStore which is used for storing artifacts such as tokens etc. I've uesd it on two occasions: in the GoogleSheets plugin and the Http plugin. In both cases, I used it to store OAuth user tokens which need to be preserved and shared across drillbits, and also frequently updated. I was thinking that this might be useful for caching the DFDL schemata. If you take a look here: https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java, https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth. and here https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java, you can see how I used that.
Best, -- C > On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mbecke...@apache.org> wrote: > > Very helpful. > > Answers to your questions, and comments are below: > > On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgi...@gmail.com > <mailto:cgi...@gmail.com>> wrote: >> HI Mike, >> I hope all is well. I'll take a stab at answering your questions. But I >> have a few questions as well: >> >> 1. Are you writing a storage or format plugin for DFDL? My thinking was >> that this would be a format plugin, but let me know if you were thinking >> differently > > Format plugin. > >> 2. In traditional deployments, where do people store the DFDL schemata >> files? Are they local or accessible via URL? > > Schemas are stored in files, or in jar files created when packaging a schema > project. Hence URI is the preferred identifier for them. They are not > retrieved remotely or anything like that. It's a matter of whether they are > in jars on the classpath, directories on the classpath, or just a file > location. > > The source-code of DFDL schemas are often created using other schemas as > components, so a single "DFDL schema" may have parts that come from 5 jar > files on the classpath e.g., 2 different header schemas, a library schema, > and the "main" schema that assembles them all. Inside schemas they refer to > each other via xs:include or xs:import, and the schemaLocation attribute > takes a URI to the location of the included/imported schema and those URIs > are interpreted this same way we would want Drill to identify the location of > a schema. > > However, really people will want to pre-compile any real non-toy/test DFDL > schemas into binary ".bin" files for faster loading. Otherwise Daffodil > schema compilation time can be excessive (minutes for large DFDL schemas - > for example the DFDL schema for VMF is 180K lines of DFDL). Compiled schemas > live in exactly 1 file (relatively small. The compiled form of VMF schema is > 8Mbytes). So the path given for schema in Drill sql query, or in the config > wants to be allowed to be either a compiled schema or a source-code schema > (.xsd) this latter mostly being for test, training, and toy examples that we > would compile on-the-fly. > >> To get the DFDL schema file or URL we have a few options, all of which >> revolve around setting a config variable. For now, let's just say that the >> schema file is contained in the same folder as the data. (We can make this >> more sophisticated later...) > > It would make life difficult if the schemas and test data must be > co-resident. Most schema projects have these in entirely separate sub-trees. > Schema will be under src/main/resources/..../xsd, compiled schema would be > under target/... and test data under src/test/resources/.../data > > For now I think the easiest thing is just we get two URIs. One is for the > data, one is for the schema. We access them via getClass().getResource(). > > We should not worry about caching or anything for now. Once the above works > for a decent scope of tests we can worry about making it more convenient to > have a library of schemas at one's disposal. > >> >> Here's what you have to do. >> >> 1. In the formatConfig file, define a String called 'dfdlSchema'. Note... >> config variables must be private and final. If they aren't it can cause >> weird errors that are really difficult to debug. For some reference, take a >> look at the Excel plugin. >> (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java) >> >> Setting a config variable there will allow a user to set a global schema >> definition. This can also be configured individually for various >> workspaces. So let's say you had PCAP files in one workspace, you could >> globally set the DFDL file for that and then another workspace which has >> some other file, you could create another DFDL plugin instance for that. > > Ok, so the above lets me play with Drill and one schema by default. Ok for > using Drill to explore data, and useful for testing. > >> >> Now, this is all fine and good, but a user might also want to define the >> schema file at query time. The good news is that Drill allows you to do >> that via the table() function. >> > > This would allow real data-integration queries against multiple different > DFDL-described data sources. Needed for a compelling demo. > >> So let's say that we want to use a different schema file than the default, >> we could do something like this: >> >> SELECT .... >> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', >> dfdlSchema=>'path_to_schema') >> >> Take a look at the Excel docs >> (https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) >> which demonstrate how to write queries like that. I believe that the >> parameters in the table function take higher precedence than the parameters >> from the config. That would make sense at least. >> > > Perfect. I'll start with this. > >> >> 2. Now that we have the schema file, the next thing would be to convert >> that into a Drill schema. Let's say that we have a function called >> dfdlToDrill that handles the conversion. >> >> What you'd have to do is in the constructor for the BatchReader, you'd have >> to set the schema there. So pseudo code: >> >> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, >> FileSchemaNegotiator negotiator) { >> // Other stuff... >> >> // Get Drill schema from DFDL >> TupleMetadata schema = dfldToDrill(<dfdl schema file); >> >> // Here's the important part >> negotiator.tableSchema(schema, true); >> } >> >> The negotiator.tableSchema() accepts two args, a TupleMetadata and a boolean >> as to whether the schema is final or not. Once this schema has been added >> to the negotiator object, you can then create the writers. >> > > That negotiator.tableSchema() is ideal. I was hoping that this was going to > be the only place the metadata had to be given to drill. Excellent. > >> >> Take a look here... >> >> >> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java >> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill >> github.com >> >> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java >> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill >> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199> >> github.com >> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199> >> >> >> I see Paul just responded so I'll leave you with this. If you have >> additional questions, send them our way. Do take a look at the Excel plugin >> as I think it will be helpful. >> > Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can work > similarly. > > This will take me a few more days to get to a pull request. The first one > will be initial review, i.e., not intended to merge without more tests. > Probably it will support only integer data fields, but should support lots of > data shapes including vectors, choices, sequences, nested records, etc. > > Thanks for the help. > >> >>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbecke...@apache.org >>> <mailto:mbecke...@apache.org>> wrote: >>> >>> So when a data format is described by a DFDL schema, I can generate >>> equivalent Drill schema (TupleMetadata). This schema is always complete. I >>> have unit tests working with this. >>> >>> To do this for a real SQL query, I need the DFDL schema to be identified on >>> the SQL query by a file path or URI. >>> >>> Q: How do I get that DFDL schema File/URI parameter from the SQL query? >>> >>> Next, assuming I have the DFDL schema identified, I generate an equivalent >>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache) >>> >>> What objects do I call, or what classes do I have to create to make this >>> Drill TupleMetadata available to Drill so it uses it in all the ways a >>> static Drill schema can be useful? >>> >>> I just need pointers to the code that illustrate how to do this. Thanks >>> >>> -Mike Beckerle >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0...@gmail.com >>> <mailto:par0...@gmail.com>> wrote: >>> >>>> Mike, >>>> >>>> This is a complex question and has two answers. >>>> >>>> First, the standard enhanced vector framework (EVF) used by most readers >>>> assumes a "pull" model: read each record. This is where the next() comes >>>> in: readers just implement this to read the next record. But, the code >>>> under EVF works with a push model: the readers write to vectors, and signal >>>> the next record. EVF translates the lower-level push model to the >>>> higher-level, easier-to-use pull model. The best example of this is the >>>> JSON reader which uses Jackson to parse JSON and responds to the >>>> corresponding events. >>>> >>>> You can thus take over the task of filling a batch of records. I'd have to >>>> poke around the code to refresh my memory. Or, you can take a look at the >>>> (quite complex) JSON parser, or the EVF itself to see what it does. There >>>> are many unit tests that show this at various levels of abstraction. >>>> >>>> Basically, you have to: >>>> >>>> * Start a batch >>>> * Ask if you can start the next record (which might be declined if the >>>> batch is full) >>>> * Write each field. For complex fields, such as records, recursively do the >>>> start/end record work. >>>> * Mark the record as complete. >>>> >>>> You should be able to map event handlers to EVF actions as a result. Even >>>> though DFDL wants to "drive", it still has to give up control once the >>>> batch is full. EVF will then handle the (surprisingly complex) task of >>>> finishing up the batch and returning it as the output of the Scan operator. >>>> >>>> - Paul >>>> >>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbecke...@apache.org >>>> <mailto:mbecke...@apache.org>> >>>> wrote: >>>> >>>>> Daffodil parsing generates event callbacks to an InfosetOutputter, which >>>> is >>>>> analogous to a SAX event handler. >>>>> >>>>> Drill is expecting an iterator style of calling next() to advance through >>>>> the input, i.e., Drill has the control thread and expects to do pull >>>>> parsing. At least from the code I studied in the format-xml contrib. >>>>> >>>>> Is there any alternative? Before I dig into creating another one of these >>>>> co-routine-style control inversions (which have proven to be problematic >>>>> for performance.