Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven is probably better fit than GitHub for distribution?  If Drillbits can write to their jars/3rdparty directory then I can imagine Drill gaining the ability to fetch and install plugins itself without too much trouble, at least for Drill clusters with Internet access.  "Sideloading" by downloading from Maven and copying manually would always remain possible.

@Paul I'll try to get a little time with you to get some ideas about designing a plugin API.

On 2022/01/14 23:20, Paul Rogers wrote:
Hi All,

James raises an important issue, I've noticed that it used to be easy to
build and test Drill, now it is a struggle, because of the many odd
external dependencies we have introduced. That acts as a big damper on
contributions: none of us get paid enough to spend more time fighting
builds than developing the code...

Ted is right that we need a good way to install plugins. There are two
parts. Ted is talking about the high-level part: make it easy to point to
some repo and use the plugin. Since Drill is Java, the Maven repo could be
a good mechanism. In-house stuff is often in an internal repo that does
whatever Maven needs.

The reason that plugins are in the Drill project now is that Drill's "API"
is all of Drill. Plugins can (and some do) access all of Drill though the
fragment context. The API to Calcite and other parts of Drill are wide, and
tend to be tightly coupled with Drill internals. By contrast, other tools,
such as Presto/Trino, have defined very clean APIs that extensions use. In
Druid, everything is integrated via Google Guice and an extension can
replace any part of Druid (though, I'm not convinced that's actually a good
idea.) I'm sure there are others we can learn from.

So, we need to define a plugin API for Drill. I started down that route a
while back: the first step was to refactor the plugin registry so it is
ready for extensions. The idea was to use the same mechanism for all kinds
of extensions (security, UDFs, metastore, etc.) The next step was to build
something that roughly followed Presto, but that kind of stalled out.

In terms of ordering, we'd first need to define the plugin API. Then, we
can shift plugins to use that. Once that is done, we can move plugins to
separate projects. (The metastore implementation can also move, if we
want.) Finally, figure out a solution for Ted's suggestion to make it easy
to grab new extensions. Drill is distributed, so adding a new plugin has to
happen on all nodes, which is a bit more complex than the typical
Julia/Python/R kind of extension.

The reason we're where we're at is that it is the path of least resistance.
Creating a good extension mechanism is hard, but valuable, as Ted noted.

Thanks,

- Paul

On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<ted.dunn...@gmail.com>  wrote:

The bigger reason for a separate plug-in world is the enhancement of
community.

I would recommend looking at the Julia community for examples of
effective ways to drive plug in structure.

At the core, for any pure julia package, you can simply add a package by
referring to the github repository where the package is stored. For
packages that are "registered" (i.e. a path and a checksum is recorded in a
well known data store), you can add a package by simply naming it without
knowing the path.  All such plugins are tested by the authors and the
project records all dependencies with version constraints so that cascading
additions are easy. The community leaders have made tooling available so
that you can test your package against a range of versions of Julia by
pretty simple (to use) Github actions.

The result has been an absolute explosion in the number of pure Julia
packages.

For packages that include C or Fortran (or whatever) code, there is some
amazing tooling available that lets you record a build process on any of
the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, OSX
and so on). WHen you register such a package, it is automagically built on
all the platforms you indicate and the binary results are checked into a
central repository known as Yggdrasil.

All of these registration events for different packages are recorded in a
central registry as I mentioned. That registry is recorded in Github as
well which makes it easy to propagate changes.



On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>  wrote:

Hello dev community

Discussions about reorganising the Drill source code to better position
the project to support plug-ins for the "long tail" of weird and
wonderful systems and data formats have been coming up here and there
for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.

A view which I personally share is that adding too large a number and
variety of plug-ins to the main tree would create a lethal maintenance
burden for developers working there and lead down a road of accumulating
technical debt.  The Maven tricks we must employ to harmonise the
growing set of dependencies of the main tree to keep it buildable are
already enough, as is the size of our distributable and the count of
open bug reports.


Thus, the idea of splitting out "/contrib" into a new
apache/drill-contrib repo after selecting a subset of plugins to remain
in apache/drill.  I'll now volunteer a set of criteria to decide whether
a plug-in should live in this notional apache/drill-contrib.

  1. The plug-in queries an unstructured data format (even if it only
     reads metadata fields) e.g. Image format plug-in.
  2. The plug-in queries a data format that was designed for human
     consumption e.g. Excel format plug-in.
  3. The plug-in cannot be expected to run with speed and reliability
     comparable to querying structured data on the local network e.g.
     Dropbox storage plugin.
  4. The plug-in queries an obscure system or format e.g. we receive a
     plug-in for some data format used only on old Cray supercomputers.
  5. The plug-in can for some reason not be well supported by the Drill
     devs e.g. it has a JNI dependency on some difficult native libs.


Any one of those suggests that an apache/drill-contrib is the better
home to me, but what is your view?  Would we apply significantly more
relaxed standards when reviewing PRs to apache/drill-contrib?  Would we
tag, build and test apache/drill-contrib with every release of
apache/drill, or would it run on its own schedule, perhaps with users
downloading builds made continuously from snapshots of HEAD?


Regards
James




Reply via email to