Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Charles Givre Thu, 12 Oct 2023 14:15:13 -0700

HI Mike, 
I hope all is well.  I'll take a stab at answering your questions.  But I have 
a few questions as well:

1.  Are you writing a storage or format plugin for DFDL?  My thinking was that 
this would be a format plugin, but let me know if you were thinking differently
2.  In traditional deployments, where do people store the DFDL schemata files?  
Are they local or accessible via URL?

To get the DFDL schema file or URL we have a few options, all of which revolve 
around setting a config variable.  For now, let's just say that the schema file 
is contained in the same folder as the data.  (We can make this more 
sophisticated later...)

Here's what you have to do.

1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... 
config variables must be private and final.  If they aren't it can cause weird 
errors that are really difficult to debug.  For some reference, take a look at 
the Excel plugin.  
(https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)

Setting a config variable there will allow a user to set a global schema 
definition.  This can also be configured individually for various workspaces.  
So let's say you had PCAP files in one workspace, you could globally set the 
DFDL file for that and then another workspace which has some other file, you 
could create another DFDL plugin instance for that. 

Now, this is all fine and good, but a user might also want to define the schema 
file at query time.  The good news is that Drill allows you to do that via the 
table() function. 

So let's say that we want to use a different schema file than the default, we 
could do something like this:

SELECT ....
FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', 
dfdlSchema=>'path_to_schema')

Take a look at the Excel docs 
(https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) 
which demonstrate how to write queries like that.  I believe that the 
parameters in the table function take higher precedence than the parameters 
from the config.  That would make sense at least.

2.  Now that we have the schema file, the next thing would be to convert that 
into a Drill schema.  Let's say that we have a function called dfdlToDrill that 
handles the conversion.

What you'd have to do is in the constructor for the BatchReader, you'd have to 
set the schema there.  So pseudo code:

public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, FileSchemaNegotiator 
negotiator) {
        // Other stuff...

        // Get Drill schema from DFDL
        TupleMetadata schema = dfldToDrill(<dfdl schema file);

        // Here's the important part
        negotiator.tableSchema(schema, true);
}

The negotiator.tableSchema() accepts two args, a TupleMetadata and a boolean as 
to whether the schema is final or not.  Once this schema has been added to the 
negotiator object, you can then create the writers. 

Take a look here... 

https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
 at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
github.com

I see Paul just responded so I'll leave you with this.  If you have additional 
questions, send them our way.  Do take a look at the Excel plugin as I think it 
will be helpful.

Best,
--C

> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <[email protected]> wrote:
> 
> So when a data format is described by a DFDL schema, I can generate
> equivalent Drill schema (TupleMetadata). This schema is always complete. I
> have unit tests working with this.
> 
> To do this for a real SQL query, I need the DFDL schema to be identified on
> the SQL query by a file path or URI.
> 
> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
> 
> Next, assuming I have the DFDL schema identified, I generate an equivalent
> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
> 
> What objects do I call, or what classes do I have to create to make this
> Drill TupleMetadata available to Drill so it uses it in all the ways a
> static Drill schema can be useful?
> 
> I just need pointers to the code that illustrate how to do this. Thanks
> 
> -Mike Beckerle
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <[email protected]> wrote:
> 
>> Mike,
>> 
>> This is a complex question and has two answers.
>> 
>> First, the standard enhanced vector framework (EVF) used by most readers
>> assumes a "pull" model: read each record. This is where the next() comes
>> in: readers just implement this to read the next record. But, the code
>> under EVF works with a push model: the readers write to vectors, and signal
>> the next record. EVF translates the lower-level push model to the
>> higher-level, easier-to-use pull model. The best example of this is the
>> JSON reader which uses Jackson to parse JSON and responds to the
>> corresponding events.
>> 
>> You can thus take over the task of filling a batch of records. I'd have to
>> poke around the code to refresh my memory. Or, you can take a look at the
>> (quite complex) JSON parser, or the EVF itself to see what it does. There
>> are many unit tests that show this at various levels of abstraction.
>> 
>> Basically, you have to:
>> 
>> * Start a batch
>> * Ask if you can start the next record (which might be declined if the
>> batch is full)
>> * Write each field. For complex fields, such as records, recursively do the
>> start/end record work.
>> * Mark the record as complete.
>> 
>> You should be able to map event handlers to EVF actions as a result. Even
>> though DFDL wants to "drive", it still has to give up control once the
>> batch is full. EVF will then handle the (surprisingly complex) task of
>> finishing up the batch and returning it as the output of the Scan operator.
>> 
>> - Paul
>> 
>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <[email protected]>
>> wrote:
>> 
>>> Daffodil parsing generates event callbacks to an InfosetOutputter, which
>> is
>>> analogous to a SAX event handler.
>>> 
>>> Drill is expecting an iterator style of calling next() to advance through
>>> the input, i.e., Drill has the control thread and expects to do pull
>>> parsing. At least from the code I studied in the format-xml contrib.
>>> 
>>> Is there any alternative? Before I dig into creating another one of these
>>> co-routine-style control inversions (which have proven to be problematic
>>> for performance.
>>> 
>>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Reply via email to