Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Mike Beckerle Thu, 12 Oct 2023 11:58:26 -0700

So when a data format is described by a DFDL schema, I can generate
equivalent Drill schema (TupleMetadata). This schema is always complete. I
have unit tests working with this.


To do this for a real SQL query, I need the DFDL schema to be identified on
the SQL query by a file path or URI.

Q: How do I get that DFDL schema File/URI parameter from the SQL query?

Next, assuming I have the DFDL schema identified, I generate an equivalent
Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)

What objects do I call, or what classes do I have to create to make this
Drill TupleMetadata available to Drill so it uses it in all the ways a
static Drill schema can be useful?

I just need pointers to the code that illustrate how to do this. Thanks

-Mike Beckerle










On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0...@gmail.com> wrote:

> Mike,
>
> This is a complex question and has two answers.
>
> First, the standard enhanced vector framework (EVF) used by most readers
> assumes a "pull" model: read each record. This is where the next() comes
> in: readers just implement this to read the next record. But, the code
> under EVF works with a push model: the readers write to vectors, and signal
> the next record. EVF translates the lower-level push model to the
> higher-level, easier-to-use pull model. The best example of this is the
> JSON reader which uses Jackson to parse JSON and responds to the
> corresponding events.
>
> You can thus take over the task of filling a batch of records. I'd have to
> poke around the code to refresh my memory. Or, you can take a look at the
> (quite complex) JSON parser, or the EVF itself to see what it does. There
> are many unit tests that show this at various levels of abstraction.
>
> Basically, you have to:
>
> * Start a batch
> * Ask if you can start the next record (which might be declined if the
> batch is full)
> * Write each field. For complex fields, such as records, recursively do the
> start/end record work.
> * Mark the record as complete.
>
> You should be able to map event handlers to EVF actions as a result. Even
> though DFDL wants to "drive", it still has to give up control once the
> batch is full. EVF will then handle the (surprisingly complex) task of
> finishing up the batch and returning it as the output of the Scan operator.
>
> - Paul
>
> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbecke...@apache.org>
> wrote:
>
> > Daffodil parsing generates event callbacks to an InfosetOutputter, which
> is
> > analogous to a SAX event handler.
> >
> > Drill is expecting an iterator style of calling next() to advance through
> > the input, i.e., Drill has the control thread and expects to do pull
> > parsing. At least from the code I studied in the format-xml contrib.
> >
> > Is there any alternative? Before I dig into creating another one of these
> > co-routine-style control inversions (which have proven to be problematic
> > for performance.
> >
>

Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Reply via email to