One more thought... As a suggestion, I'd recommend getting the batch reader to first work with the DFDL schema file. Once that's done, Paul and I can assist with caching, metastores etc. -- C
> On Oct 12, 2023, at 5:13 PM, Charles Givre <cgi...@gmail.com> wrote: > > HI Mike, > I hope all is well. I'll take a stab at answering your questions. But I > have a few questions as well: > > 1. Are you writing a storage or format plugin for DFDL? My thinking was > that this would be a format plugin, but let me know if you were thinking > differently > 2. In traditional deployments, where do people store the DFDL schemata > files? Are they local or accessible via URL? > > To get the DFDL schema file or URL we have a few options, all of which > revolve around setting a config variable. For now, let's just say that the > schema file is contained in the same folder as the data. (We can make this > more sophisticated later...) > > Here's what you have to do. > > 1. In the formatConfig file, define a String called 'dfdlSchema'. Note... > config variables must be private and final. If they aren't it can cause > weird errors that are really difficult to debug. For some reference, take a > look at the Excel plugin. > (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java) > > Setting a config variable there will allow a user to set a global schema > definition. This can also be configured individually for various workspaces. > So let's say you had PCAP files in one workspace, you could globally set the > DFDL file for that and then another workspace which has some other file, you > could create another DFDL plugin instance for that. > > Now, this is all fine and good, but a user might also want to define the > schema file at query time. The good news is that Drill allows you to do that > via the table() function. > > So let's say that we want to use a different schema file than the default, we > could do something like this: > > SELECT .... > FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', > dfdlSchema=>'path_to_schema') > > Take a look at the Excel docs > (https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) > which demonstrate how to write queries like that. I believe that the > parameters in the table function take higher precedence than the parameters > from the config. That would make sense at least. > > > 2. Now that we have the schema file, the next thing would be to convert that > into a Drill schema. Let's say that we have a function called dfdlToDrill > that handles the conversion. > > What you'd have to do is in the constructor for the BatchReader, you'd have > to set the schema there. So pseudo code: > > public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, > FileSchemaNegotiator negotiator) { > // Other stuff... > > // Get Drill schema from DFDL > TupleMetadata schema = dfldToDrill(<dfdl schema file); > > // Here's the important part > negotiator.tableSchema(schema, true); > } > > The negotiator.tableSchema() accepts two args, a TupleMetadata and a boolean > as to whether the schema is final or not. Once this schema has been added to > the negotiator object, you can then create the writers. > > > Take a look here... > > https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199 > > > I see Paul just responded so I'll leave you with this. If you have > additional questions, send them our way. Do take a look at the Excel plugin > as I think it will be helpful. > > Best, > --C > > >> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbecke...@apache.org> wrote: >> >> So when a data format is described by a DFDL schema, I can generate >> equivalent Drill schema (TupleMetadata). This schema is always complete. I >> have unit tests working with this. >> >> To do this for a real SQL query, I need the DFDL schema to be identified on >> the SQL query by a file path or URI. >> >> Q: How do I get that DFDL schema File/URI parameter from the SQL query? >> >> Next, assuming I have the DFDL schema identified, I generate an equivalent >> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache) >> >> What objects do I call, or what classes do I have to create to make this >> Drill TupleMetadata available to Drill so it uses it in all the ways a >> static Drill schema can be useful? >> >> I just need pointers to the code that illustrate how to do this. Thanks >> >> -Mike Beckerle >> >> >> >> >> >> >> >> >> >> >> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0...@gmail.com> wrote: >> >>> Mike, >>> >>> This is a complex question and has two answers. >>> >>> First, the standard enhanced vector framework (EVF) used by most readers >>> assumes a "pull" model: read each record. This is where the next() comes >>> in: readers just implement this to read the next record. But, the code >>> under EVF works with a push model: the readers write to vectors, and signal >>> the next record. EVF translates the lower-level push model to the >>> higher-level, easier-to-use pull model. The best example of this is the >>> JSON reader which uses Jackson to parse JSON and responds to the >>> corresponding events. >>> >>> You can thus take over the task of filling a batch of records. I'd have to >>> poke around the code to refresh my memory. Or, you can take a look at the >>> (quite complex) JSON parser, or the EVF itself to see what it does. There >>> are many unit tests that show this at various levels of abstraction. >>> >>> Basically, you have to: >>> >>> * Start a batch >>> * Ask if you can start the next record (which might be declined if the >>> batch is full) >>> * Write each field. For complex fields, such as records, recursively do the >>> start/end record work. >>> * Mark the record as complete. >>> >>> You should be able to map event handlers to EVF actions as a result. Even >>> though DFDL wants to "drive", it still has to give up control once the >>> batch is full. EVF will then handle the (surprisingly complex) task of >>> finishing up the batch and returning it as the output of the Scan operator. >>> >>> - Paul >>> >>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbecke...@apache.org> >>> wrote: >>> >>>> Daffodil parsing generates event callbacks to an InfosetOutputter, which >>> is >>>> analogous to a SAX event handler. >>>> >>>> Drill is expecting an iterator style of calling next() to advance through >>>> the input, i.e., Drill has the control thread and expects to do pull >>>> parsing. At least from the code I studied in the format-xml contrib. >>>> >>>> Is there any alternative? Before I dig into creating another one of these >>>> co-routine-style control inversions (which have proven to be problematic >>>> for performance. >>>> >>> >