Drill expects pull parsing? Daffodil is event callbacks style

2023-10-11 Thread Mike Beckerle
Daffodil parsing generates event callbacks to an InfosetOutputter, which is
analogous to a SAX event handler.

Drill is expecting an iterator style of calling next() to advance through
the input, i.e., Drill has the control thread and expects to do pull
parsing. At least from the code I studied in the format-xml contrib.

Is there any alternative? Before I dig into creating another one of these
co-routine-style control inversions (which have proven to be problematic
for performance.


Re: Drill expects pull parsing? Daffodil is event callbacks style

2023-10-11 Thread Paul Rogers
Mike,

This is a complex question and has two answers.

First, the standard enhanced vector framework (EVF) used by most readers
assumes a "pull" model: read each record. This is where the next() comes
in: readers just implement this to read the next record. But, the code
under EVF works with a push model: the readers write to vectors, and signal
the next record. EVF translates the lower-level push model to the
higher-level, easier-to-use pull model. The best example of this is the
JSON reader which uses Jackson to parse JSON and responds to the
corresponding events.

You can thus take over the task of filling a batch of records. I'd have to
poke around the code to refresh my memory. Or, you can take a look at the
(quite complex) JSON parser, or the EVF itself to see what it does. There
are many unit tests that show this at various levels of abstraction.

Basically, you have to:

* Start a batch
* Ask if you can start the next record (which might be declined if the
batch is full)
* Write each field. For complex fields, such as records, recursively do the
start/end record work.
* Mark the record as complete.

You should be able to map event handlers to EVF actions as a result. Even
though DFDL wants to "drive", it still has to give up control once the
batch is full. EVF will then handle the (surprisingly complex) task of
finishing up the batch and returning it as the output of the Scan operator.

- Paul

On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle  wrote:

> Daffodil parsing generates event callbacks to an InfosetOutputter, which is
> analogous to a SAX event handler.
>
> Drill is expecting an iterator style of calling next() to advance through
> the input, i.e., Drill has the control thread and expects to do pull
> parsing. At least from the code I studied in the format-xml contrib.
>
> Is there any alternative? Before I dig into creating another one of these
> co-routine-style control inversions (which have proven to be problematic
> for performance.
>


Re: Drill expects pull parsing? Daffodil is event callbacks style

2023-10-12 Thread Charles Givre
Mike, 
I'll add to Paul's comments.  While Drill is expecting an iterator style 
reader, that iterator pattern really only applies to batches.  This concept 
took me a while to wrap my head around, but in any of the batch reader classes, 
it's important to remember that the next() method is really applying to the 
batch of records and not the data source itself.  In other words, it isn't a 
line iterator.  What that means in practice is that you can do whatever you 
want in the next method until the batch is full.
I hope this helps somewhat and doesn't add to the confusion.
Best,
-- C


> On Oct 12, 2023, at 12:13 AM, Paul Rogers  wrote:
> 
> Mike,
> 
> This is a complex question and has two answers.
> 
> First, the standard enhanced vector framework (EVF) used by most readers
> assumes a "pull" model: read each record. This is where the next() comes
> in: readers just implement this to read the next record. But, the code
> under EVF works with a push model: the readers write to vectors, and signal
> the next record. EVF translates the lower-level push model to the
> higher-level, easier-to-use pull model. The best example of this is the
> JSON reader which uses Jackson to parse JSON and responds to the
> corresponding events.
> 
> You can thus take over the task of filling a batch of records. I'd have to
> poke around the code to refresh my memory. Or, you can take a look at the
> (quite complex) JSON parser, or the EVF itself to see what it does. There
> are many unit tests that show this at various levels of abstraction.
> 
> Basically, you have to:
> 
> * Start a batch
> * Ask if you can start the next record (which might be declined if the
> batch is full)
> * Write each field. For complex fields, such as records, recursively do the
> start/end record work.
> * Mark the record as complete.
> 
> You should be able to map event handlers to EVF actions as a result. Even
> though DFDL wants to "drive", it still has to give up control once the
> batch is full. EVF will then handle the (surprisingly complex) task of
> finishing up the batch and returning it as the output of the Scan operator.
> 
> - Paul
> 
> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle  wrote:
> 
>> Daffodil parsing generates event callbacks to an InfosetOutputter, which is
>> analogous to a SAX event handler.
>> 
>> Drill is expecting an iterator style of calling next() to advance through
>> the input, i.e., Drill has the control thread and expects to do pull
>> parsing. At least from the code I studied in the format-xml contrib.
>> 
>> Is there any alternative? Before I dig into creating another one of these
>> co-routine-style control inversions (which have proven to be problematic
>> for performance.
>>