Mike,

Excellent progress! Very impressive.

So now you are talking about the planning side of things. There are
multiple ways this could be done. Let's start with some basics. Recall that
Drill is distributed: a file can be in S3 or old-school HDFS (along with
other variations). When Drill is run as intended, you'll have a cluster of
10, 20 or more Drillbits, all on distinct nodes or K8s pods, any of which
can be asked to plan the query, and so all of them need visibility to the
shared pool of metadata.

I would argue that the simplest solution from the user's perspective is for
Drill, not the user, to associate a Daffodil schema with a file. That is, I
set up the definition once, then Drill uses that for each of my dozens (or
hundreds) of queries against that file. The alternative is to remember to
include the schema information in every query, which will get old quickly.

The simplest option is to put a schema file in the same "directory"
(however that is defined in the target distributed file system) as the
data: something like "filename.dfdl" or ".schema.dfdl", depending on
whether the shema describes a single file or (more likely) a collection of
files. The planner simply looks for the schema file based on the file name
(filename.dfdl) or location (/path/to/data/.schema.dfdl). And there is a
precedent: this is how Drill finds its old-school Parquet metadata cache.
This works, but is clunky: mixing data (which is likely generated and
expired by a pipeline) with metadata (which changes slowly and is managed
by hand) is somewhat awkward from an operations perspective. (This was one
of the many issues with that first-gen Parquet metadata cache solution.)
Still, it is the simplest option to get going.

A somewhat more useful way is to integrate schema info not with the file,
but with the storage plugin.The plugin could point to the location of the
schema for the files available from that plugin. This works because
Daffodil really only applies to file-like objects. Things like DBs or APIs
or Kafka streams have their own plugins that typically provide their own
schema. So associating schema with a plugin might make sense. Plugin A
defines our Daffodil-ready files, while plugin B is for ad-hoc data with no
schema. A new property on the plugin might provide the location where the
Daffodil schema files are stored. Matching could be done by name, file
path, file name pattern matching, or whatever. Basically, you'd be adding a
property to the DFS (distributed file system) storage plugin, along with
plan-time code to make use of the property.

Once you have the property, it could be set for the plugin, or provided in
each query. Use table functions to provide the schema:

SELECT *
FROM myFile.json (`schema` => '\path\to\schema\myFile.dfdl')

(Don't quote me on the syntax. Instead, find the unit tests that exercise
this feature to override plugin options.) The implementation of this
feature starts with the properties from the storage plugin, then allows you
to add (or overwrite) properties per-query. You should get this behavior
for free after adding a property to the plugin as described above.

You can then exploit the above approach to encourage people to wrap the
above "base query" query in a view. However, the view approach is a bit
clunky because it mixes query considerations with schema considerations.
Schema is about a file, not a particular base query on that file. Still,
you'd get the view approach for free, which is a strong argument for such
an approach.

There is a completely different approach: create a "Daffodil metastore":
something that applies to all plugins, a bit like Drill's own (seldom-used)
metastore. That is, implement a metastore, using Drill's metastore API
(which I hope we have), that stores schemas as Daffodil files rather than
the standard DB implementation. The underlying storage could be a (shared)
directory, a document store or whatever. Once you convert the Daffodil
format to Drill's internal format, you would leverage the large amount of
existing plan-time code. That is, you trade off having to become a planner
expert against becoming a metastore API expert. The advantage is that
schema is completely separated from storage: Drill "just knows" which
schema to use (as defined by the admin), the users (of which we hope there
will be many) don't care: they just get the right results without needing
to fiddle with the details.

There may be another alternative as well that I've missed. You'll find that
this area of the product will be a bit more of a challenge than the runtime
portion. The team has added many implementations of storage and format
plugins, so that area is well understood. There have been only a couple of
metadata implementations, so that area is at an earlier stage of evolution.
Still, you can look at the Parquet metadata and Drill metastore
implementations (neither of which is simple) for ideas about how to
approach the implementation.

I hope this provides a few hints to get you started.

- Paul


On Thu, Oct 12, 2023 at 11:58 AM Mike Beckerle <mbecke...@apache.org> wrote:

> So when a data format is described by a DFDL schema, I can generate
> equivalent Drill schema (TupleMetadata). This schema is always complete. I
> have unit tests working with this.
>
> To do this for a real SQL query, I need the DFDL schema to be identified on
> the SQL query by a file path or URI.
>
> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>
> Next, assuming I have the DFDL schema identified, I generate an equivalent
> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>
> What objects do I call, or what classes do I have to create to make this
> Drill TupleMetadata available to Drill so it uses it in all the ways a
> static Drill schema can be useful?
>
> I just need pointers to the code that illustrate how to do this. Thanks
>
> -Mike Beckerle
>
>
>
>
>
>
>
>
>
>
> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0...@gmail.com> wrote:
>
> > Mike,
> >
> > This is a complex question and has two answers.
> >
> > First, the standard enhanced vector framework (EVF) used by most readers
> > assumes a "pull" model: read each record. This is where the next() comes
> > in: readers just implement this to read the next record. But, the code
> > under EVF works with a push model: the readers write to vectors, and
> signal
> > the next record. EVF translates the lower-level push model to the
> > higher-level, easier-to-use pull model. The best example of this is the
> > JSON reader which uses Jackson to parse JSON and responds to the
> > corresponding events.
> >
> > You can thus take over the task of filling a batch of records. I'd have
> to
> > poke around the code to refresh my memory. Or, you can take a look at the
> > (quite complex) JSON parser, or the EVF itself to see what it does. There
> > are many unit tests that show this at various levels of abstraction.
> >
> > Basically, you have to:
> >
> > * Start a batch
> > * Ask if you can start the next record (which might be declined if the
> > batch is full)
> > * Write each field. For complex fields, such as records, recursively do
> the
> > start/end record work.
> > * Mark the record as complete.
> >
> > You should be able to map event handlers to EVF actions as a result. Even
> > though DFDL wants to "drive", it still has to give up control once the
> > batch is full. EVF will then handle the (surprisingly complex) task of
> > finishing up the batch and returning it as the output of the Scan
> operator.
> >
> > - Paul
> >
> > On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbecke...@apache.org>
> > wrote:
> >
> > > Daffodil parsing generates event callbacks to an InfosetOutputter,
> which
> > is
> > > analogous to a SAX event handler.
> > >
> > > Drill is expecting an iterator style of calling next() to advance
> through
> > > the input, i.e., Drill has the control thread and expects to do pull
> > > parsing. At least from the code I studied in the format-xml contrib.
> > >
> > > Is there any alternative? Before I dig into creating another one of
> these
> > > co-routine-style control inversions (which have proven to be
> problematic
> > > for performance.
> > >
> >
>

Reply via email to