Hi Mike,

Congrats on the PR. I'll take a look soon.

You asked about initialization. Initialization is a bit tricky in a
distributed system such as Drill. There are a number of things
"initialization" could mean:

* Global, one-time initialization (per Drillbit): Unlike Druid, Drill has
no "lifecycle" that you can plug into, sadly. Instead, you can use a
singleton created on demand. Drill is multi-threaded. so singleton creation
must be protected by a lock.
* Per-query initialization: there is no such thing in Drill, since queries
are distributed. In particular, queries will execute, in general, on a node
different than the one that did the planning.
* Per-fragment initialization: this is also hard: there is no code you can
provide that connects to the fragment lifecycle, unless you create your own
operator (by extending the existing one), but this is not at all easy (nor
the best approach).
* Per-reader (i.e. file) initialization: Such work is done in the open()
call for each reader (which, in EVF2, is actually done in the reader
constructor.) De-initialization can be done in close() for the reader. Of
course, these two methods could refer to a per-Drillbit global singleton,
if that is what is wanted.

One would think that plugins such as those for an RDBMS (i.e. the JDBC
plugin) would maintain state about the target DB so that sequential queries
against the same schema could use cached metadata. Drill wasn't designed
for this, but it can be done. The Drill metastore probably does some
caching, but I'm not as familiar with that code as I'd like.

For Daffodil, the logical approach would be to cache each schema when it is
first needed. In a cluster, each Drillbit would end up caching each schema,
since Drill randomly routes connections to Drillbits. Of course, with a
cache, we'd have to detect when the cache becomes stale (that is, a new
version of the file is created). And, we'd have to handle race conditions
(a new version of the file is written exactly when Drill tries to read it,
and Drill sees a partial file.)

In short, it is best to identify exactly what you want to initialize; both
for planning and execution. Then, we can point you to a good place to do
that work.

- Paul

On Fri, Oct 13, 2023 at 8:17 PM Mike Beckerle <mbecke...@apache.org> wrote:

> My PR needs input from drill developers.
>
> Please look for TODO and FIXME in this PR and help me get to where I can
> initialize this plugin.
>
> In general I copied things from format-xml contrib, but then took ideas
> from Json. I was unable to figure out how initialization works from the
> Excel plugin.
>
> The metadata bridge is here, and a stub of the data bridge - handles only
> simple type "INT" right now, and of course doesn't compile yet.
>
> https://github.com/apache/drill/pull/2836
>
>

Reply via email to