Re: [DISCUSS] Drill 2 and plug-in organisation

Ted Dunning Mon, 17 Jan 2022 09:40:26 -0800

I don't think that Maven is a forced move just because Drill is in Java. It
may be a good move, but it isn't a forgone conclusion. For one thing, the
conventions that Maven uses are pretty hard-wired and it may be difficult
to have a reliable deny-list of known problematic plugins. Publishing to
Maven is more of a pain than simply pushing to github.


The usability here is paramount both for the ultimate Drill user, but also
for the writer of plugins.



On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote:

> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
> is probably better fit than GitHub for distribution?  If Drillbits can
> write to their jars/3rdparty directory then I can imagine Drill gaining
> the ability to fetch and install plugins itself without too much
> trouble, at least for Drill clusters with Internet access.
> "Sideloading" by downloading from Maven and copying manually would
> always remain possible.
>
> @Paul I'll try to get a little time with you to get some ideas about
> designing a plugin API.
>
> On 2022/01/14 23:20, Paul Rogers wrote:
> > Hi All,
> >
> > James raises an important issue, I've noticed that it used to be easy to
> > build and test Drill, now it is a struggle, because of the many odd
> > external dependencies we have introduced. That acts as a big damper on
> > contributions: none of us get paid enough to spend more time fighting
> > builds than developing the code...
> >
> > Ted is right that we need a good way to install plugins. There are two
> > parts. Ted is talking about the high-level part: make it easy to point to
> > some repo and use the plugin. Since Drill is Java, the Maven repo could
> be
> > a good mechanism. In-house stuff is often in an internal repo that does
> > whatever Maven needs.
> >
> > The reason that plugins are in the Drill project now is that Drill's
> "API"
> > is all of Drill. Plugins can (and some do) access all of Drill though the
> > fragment context. The API to Calcite and other parts of Drill are wide,
> and
> > tend to be tightly coupled with Drill internals. By contrast, other
> tools,
> > such as Presto/Trino, have defined very clean APIs that extensions use.
> In
> > Druid, everything is integrated via Google Guice and an extension can
> > replace any part of Druid (though, I'm not convinced that's actually a
> good
> > idea.) I'm sure there are others we can learn from.
> >
> > So, we need to define a plugin API for Drill. I started down that route a
> > while back: the first step was to refactor the plugin registry so it is
> > ready for extensions. The idea was to use the same mechanism for all
> kinds
> > of extensions (security, UDFs, metastore, etc.) The next step was to
> build
> > something that roughly followed Presto, but that kind of stalled out.
> >
> > In terms of ordering, we'd first need to define the plugin API. Then, we
> > can shift plugins to use that. Once that is done, we can move plugins to
> > separate projects. (The metastore implementation can also move, if we
> > want.) Finally, figure out a solution for Ted's suggestion to make it
> easy
> > to grab new extensions. Drill is distributed, so adding a new plugin has
> to
> > happen on all nodes, which is a bit more complex than the typical
> > Julia/Python/R kind of extension.
> >
> > The reason we're where we're at is that it is the path of least
> resistance.
> > Creating a good extension mechanism is hard, but valuable, as Ted noted.
> >
> > Thanks,
> >
> > - Paul
> >
> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<ted.dunn...@gmail.com>
> wrote:
> >
> >> The bigger reason for a separate plug-in world is the enhancement of
> >> community.
> >>
> >> I would recommend looking at the Julia community for examples of
> >> effective ways to drive plug in structure.
> >>
> >> At the core, for any pure julia package, you can simply add a package by
> >> referring to the github repository where the package is stored. For
> >> packages that are "registered" (i.e. a path and a checksum is recorded
> in a
> >> well known data store), you can add a package by simply naming it
> without
> >> knowing the path.  All such plugins are tested by the authors and the
> >> project records all dependencies with version constraints so that
> cascading
> >> additions are easy. The community leaders have made tooling available so
> >> that you can test your package against a range of versions of Julia by
> >> pretty simple (to use) Github actions.
> >>
> >> The result has been an absolute explosion in the number of pure Julia
> >> packages.
> >>
> >> For packages that include C or Fortran (or whatever) code, there is some
> >> amazing tooling available that lets you record a build process on any of
> >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD,
> OSX
> >> and so on). WHen you register such a package, it is automagically built
> on
> >> all the platforms you indicate and the binary results are checked into a
> >> central repository known as Yggdrasil.
> >>
> >> All of these registration events for different packages are recorded in
> a
> >> central registry as I mentioned. That registry is recorded in Github as
> >> well which makes it easy to propagate changes.
> >>
> >>
> >>
> >> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>  wrote:
> >>
> >>> Hello dev community
> >>>
> >>> Discussions about reorganising the Drill source code to better position
> >>> the project to support plug-ins for the "long tail" of weird and
> >>> wonderful systems and data formats have been coming up here and there
> >>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.
> >>>
> >>> A view which I personally share is that adding too large a number and
> >>> variety of plug-ins to the main tree would create a lethal maintenance
> >>> burden for developers working there and lead down a road of
> accumulating
> >>> technical debt.  The Maven tricks we must employ to harmonise the
> >>> growing set of dependencies of the main tree to keep it buildable are
> >>> already enough, as is the size of our distributable and the count of
> >>> open bug reports.
> >>>
> >>>
> >>> Thus, the idea of splitting out "/contrib" into a new
> >>> apache/drill-contrib repo after selecting a subset of plugins to remain
> >>> in apache/drill.  I'll now volunteer a set of criteria to decide
> whether
> >>> a plug-in should live in this notional apache/drill-contrib.
> >>>
> >>>   1. The plug-in queries an unstructured data format (even if it only
> >>>      reads metadata fields) e.g. Image format plug-in.
> >>>   2. The plug-in queries a data format that was designed for human
> >>>      consumption e.g. Excel format plug-in.
> >>>   3. The plug-in cannot be expected to run with speed and reliability
> >>>      comparable to querying structured data on the local network e.g.
> >>>      Dropbox storage plugin.
> >>>   4. The plug-in queries an obscure system or format e.g. we receive a
> >>>      plug-in for some data format used only on old Cray supercomputers.
> >>>   5. The plug-in can for some reason not be well supported by the Drill
> >>>      devs e.g. it has a JNI dependency on some difficult native libs.
> >>>
> >>>
> >>> Any one of those suggests that an apache/drill-contrib is the better
> >>> home to me, but what is your view?  Would we apply significantly more
> >>> relaxed standards when reviewing PRs to apache/drill-contrib?  Would we
> >>> tag, build and test apache/drill-contrib with every release of
> >>> apache/drill, or would it run on its own schedule, perhaps with users
> >>> downloading builds made continuously from snapshots of HEAD?
> >>>
> >>>
> >>> Regards
> >>> James
> >>>
> >>>
> >>>
>
>

Re: [DISCUSS] Drill 2 and plug-in organisation

Reply via email to