Re: Drill expects pull parsing? Daffodil is event callbacks style

Mike Beckerle Thu, 12 Oct 2023 12:13:25 -0700

Yup, I found a Drill class which works similarly to what we need, but for
JSON.


It is in class JsonLoaderImpl, method: boolean readBatch();

It fills in a batch of rows (for efficiency), and returns true if more data
is possible, and false if EOF.

The loop is roughly:

until data is EOF or no more drill rows available to populate
get a drill rowWriter to populate
outputter.reset()
parse - which calls methods of the outputter which populate the drill
rowWriter

There is no getResult from the outputter. Parse directly calls the
rowWriter so when parse returns, the row has been completely filled in and
we can move on to the next row if one is available to us.




On Thu, Oct 12, 2023 at 10:35 AM Steve Lawrence <slawre...@apache.org>
wrote:

> It sounds like we want to implement something very similar to what the
> Daffodil CLI calls "streaming mode". Something along the lines of this
> (making guesses about what the drill API looks like based on reading this):
>
>    val input = new InputSourceDataInputStream(inputStream)
>
>    def hasNext(): Boolean = input.hasData()
>
>    def next(): DrillRecord = {
>      val output = new DrillRecordInfosetOutputter()
>      val pr = dataProcessor.parse(input, output)
>      output.getDrillRecord
>    }
>
> So next() calls parse() once and creates a single infoset that is
> projected into a Drill data structure, with each successful next/parse
> call representing a single Drill record.
>
> Depending on the schema and the input data, it is possible the entire
> file could be projected into a single record. This is is likely the
> common case since DFDL schemas are generally written to consume the
> whole file, but that isn't always the case.
>
>
> On 2023-10-12 09:48 AM, Charles Givre wrote:
> > Mike,
> > I'll add to Paul's comments.  While Drill is expecting an iterator style
> reader, that iterator pattern really only applies to batches.  This concept
> took me a while to wrap my head around, but in any of the batch reader
> classes, it's important to remember that the next() method is really
> applying to the batch of records and not the data source itself.  In other
> words, it isn't a line iterator.  What that means in practice is that you
> can do whatever you want in the next method until the batch is full.
> > I hope this helps somewhat and doesn't add to the confusion.
> > Best,
> > -- C
> >
> >
> >> On Oct 12, 2023, at 12:13 AM, Paul Rogers <par0...@gmail.com> wrote:
> >>
> >> Mike,
> >>
> >> This is a complex question and has two answers.
> >>
> >> First, the standard enhanced vector framework (EVF) used by most readers
> >> assumes a "pull" model: read each record. This is where the next() comes
> >> in: readers just implement this to read the next record. But, the code
> >> under EVF works with a push model: the readers write to vectors, and
> signal
> >> the next record. EVF translates the lower-level push model to the
> >> higher-level, easier-to-use pull model. The best example of this is the
> >> JSON reader which uses Jackson to parse JSON and responds to the
> >> corresponding events.
> >>
> >> You can thus take over the task of filling a batch of records. I'd have
> to
> >> poke around the code to refresh my memory. Or, you can take a look at
> the
> >> (quite complex) JSON parser, or the EVF itself to see what it does.
> There
> >> are many unit tests that show this at various levels of abstraction.
> >>
> >> Basically, you have to:
> >>
> >> * Start a batch
> >> * Ask if you can start the next record (which might be declined if the
> >> batch is full)
> >> * Write each field. For complex fields, such as records, recursively do
> the
> >> start/end record work.
> >> * Mark the record as complete.
> >>
> >> You should be able to map event handlers to EVF actions as a result.
> Even
> >> though DFDL wants to "drive", it still has to give up control once the
> >> batch is full. EVF will then handle the (surprisingly complex) task of
> >> finishing up the batch and returning it as the output of the Scan
> operator.
> >>
> >> - Paul
> >>
> >> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbecke...@apache.org>
> wrote:
> >>
> >>> Daffodil parsing generates event callbacks to an InfosetOutputter,
> which is
> >>> analogous to a SAX event handler.
> >>>
> >>> Drill is expecting an iterator style of calling next() to advance
> through
> >>> the input, i.e., Drill has the control thread and expects to do pull
> >>> parsing. At least from the code I studied in the format-xml contrib.
> >>>
> >>> Is there any alternative? Before I dig into creating another one of
> these
> >>> co-routine-style control inversions (which have proven to be
> problematic
> >>> for performance.
> >>>
> >
>
>

Re: Drill expects pull parsing? Daffodil is event callbacks style

Reply via email to