Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Charles Givre Thu, 12 Oct 2023 15:02:40 -0700

One more thought... As a suggestion, I'd recommend getting the batch reader to 
first work with the DFDL schema file.  Once that's done, Paul and I can assist 
with caching, metastores etc. 
-- C




> On Oct 12, 2023, at 5:13 PM, Charles Givre <[email protected]> wrote:
> 
> HI Mike, 
> I hope all is well.  I'll take a stab at answering your questions.  But I 
> have a few questions as well:
> 
> 1.  Are you writing a storage or format plugin for DFDL?  My thinking was 
> that this would be a format plugin, but let me know if you were thinking 
> differently
> 2.  In traditional deployments, where do people store the DFDL schemata 
> files?  Are they local or accessible via URL?
> 
> To get the DFDL schema file or URL we have a few options, all of which 
> revolve around setting a config variable.  For now, let's just say that the 
> schema file is contained in the same folder as the data.  (We can make this 
> more sophisticated later...)
> 
> Here's what you have to do.
> 
> 1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... 
> config variables must be private and final.  If they aren't it can cause 
> weird errors that are really difficult to debug.  For some reference, take a 
> look at the Excel plugin.  
> (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)
> 
> Setting a config variable there will allow a user to set a global schema 
> definition.  This can also be configured individually for various workspaces. 
>  So let's say you had PCAP files in one workspace, you could globally set the 
> DFDL file for that and then another workspace which has some other file, you 
> could create another DFDL plugin instance for that. 
> 
> Now, this is all fine and good, but a user might also want to define the 
> schema file at query time.  The good news is that Drill allows you to do that 
> via the table() function. 
> 
> So let's say that we want to use a different schema file than the default, we 
> could do something like this:
> 
> SELECT ....
> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', 
> dfdlSchema=>'path_to_schema')
> 
> Take a look at the Excel docs 
> (https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) 
> which demonstrate how to write queries like that.  I believe that the 
> parameters in the table function take higher precedence than the parameters 
> from the config.  That would make sense at least.
> 
> 
> 2.  Now that we have the schema file, the next thing would be to convert that 
> into a Drill schema.  Let's say that we have a function called dfdlToDrill 
> that handles the conversion.
> 
> What you'd have to do is in the constructor for the BatchReader, you'd have 
> to set the schema there.  So pseudo code:
> 
> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, 
> FileSchemaNegotiator negotiator) {
>       // Other stuff...
>       
>       // Get Drill schema from DFDL
>       TupleMetadata schema = dfldToDrill(<dfdl schema file);
>       
>       // Here's the important part
>       negotiator.tableSchema(schema, true);
> }
> 
> The negotiator.tableSchema() accepts two args, a TupleMetadata and a boolean 
> as to whether the schema is final or not.  Once this schema has been added to 
> the negotiator object, you can then create the writers. 
> 
> 
> Take a look here... 
> 
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> 
> 
> I see Paul just responded so I'll leave you with this.  If you have 
> additional questions, send them our way.  Do take a look at the Excel plugin 
> as I think it will be helpful.
> 
> Best,
> --C
> 
> 
>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <[email protected]> wrote:
>> 
>> So when a data format is described by a DFDL schema, I can generate
>> equivalent Drill schema (TupleMetadata). This schema is always complete. I
>> have unit tests working with this.
>> 
>> To do this for a real SQL query, I need the DFDL schema to be identified on
>> the SQL query by a file path or URI.
>> 
>> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>> 
>> Next, assuming I have the DFDL schema identified, I generate an equivalent
>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>> 
>> What objects do I call, or what classes do I have to create to make this
>> Drill TupleMetadata available to Drill so it uses it in all the ways a
>> static Drill schema can be useful?
>> 
>> I just need pointers to the code that illustrate how to do this. Thanks
>> 
>> -Mike Beckerle
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <[email protected]> wrote:
>> 
>>> Mike,
>>> 
>>> This is a complex question and has two answers.
>>> 
>>> First, the standard enhanced vector framework (EVF) used by most readers
>>> assumes a "pull" model: read each record. This is where the next() comes
>>> in: readers just implement this to read the next record. But, the code
>>> under EVF works with a push model: the readers write to vectors, and signal
>>> the next record. EVF translates the lower-level push model to the
>>> higher-level, easier-to-use pull model. The best example of this is the
>>> JSON reader which uses Jackson to parse JSON and responds to the
>>> corresponding events.
>>> 
>>> You can thus take over the task of filling a batch of records. I'd have to
>>> poke around the code to refresh my memory. Or, you can take a look at the
>>> (quite complex) JSON parser, or the EVF itself to see what it does. There
>>> are many unit tests that show this at various levels of abstraction.
>>> 
>>> Basically, you have to:
>>> 
>>> * Start a batch
>>> * Ask if you can start the next record (which might be declined if the
>>> batch is full)
>>> * Write each field. For complex fields, such as records, recursively do the
>>> start/end record work.
>>> * Mark the record as complete.
>>> 
>>> You should be able to map event handlers to EVF actions as a result. Even
>>> though DFDL wants to "drive", it still has to give up control once the
>>> batch is full. EVF will then handle the (surprisingly complex) task of
>>> finishing up the batch and returning it as the output of the Scan operator.
>>> 
>>> - Paul
>>> 
>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <[email protected]>
>>> wrote:
>>> 
>>>> Daffodil parsing generates event callbacks to an InfosetOutputter, which
>>> is
>>>> analogous to a SAX event handler.
>>>> 
>>>> Drill is expecting an iterator style of calling next() to advance through
>>>> the input, i.e., Drill has the control thread and expects to do pull
>>>> parsing. At least from the code I studied in the format-xml contrib.
>>>> 
>>>> Is there any alternative? Before I dig into creating another one of these
>>>> co-routine-style control inversions (which have proven to be problematic
>>>> for performance.
>>>> 
>>> 
>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Reply via email to