Re: [DISCUSS] Drill 2 and plug-in organisation

James Turton Tue, 18 Jan 2022 00:29:28 -0800

For my part, I'd forgotten that GitHub does give users the opportunityto attach binary distributables to releases. So my first thought of"GitHub would mean using Git repositories to host Jar files" was off themark.

Paul, setting aside the hosting and distribution for a moment, may I askabout the statement "ensure plugins can be built outside of the Drillrepo"? Released versions of Drill's own libs are already published toMaven. E.g.


https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.19.0

Can a plugin writer not create a new project which lists the requiredDrill libs in its pom.xml deps and proceed to build a plugin away fromthe main tree? Interactive debugging without the Drill main tree shouldeven be possible by attaching a debugger to a running embedded Drillwith the storage plugin deployed to it, or am I wrong here?


On 2022/01/18 00:32, Paul Rogers wrote:

Hi Ted,

Thanks for the explanation, makes sense.

Ideally, the client side would be somewhat agnostic about the repo it pulls
from. In a corporate setting, it should pull from the "JFrog Repository"
that everyone seems to use (but which I know basically nothing.) Oh, lord,
a plugin architecture for the repo for the plugin architecture?

- Paul

On Mon, Jan 17, 2022 at 1:46 PM Ted Dunning <[email protected]> wrote:

Paul,

I understood your suggestion.  My point is that publishing to Maven
central is a bit of a pain while publishing by posting to Github is nearly
painless.  In particular, because Github inherently produces a relatively
difficult to fake hash for each commit, referring to a dependency using
that hash is relatively safe which saves a lot of agony regarding keys and
trust.

Further, Github or any comparable service provides the same "already
exists" benefit as does Maven.



On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers <[email protected]> wrote:

Hi Ted,

Well said. Just to be clear, I wasn't suggesting that we use
Maven-the-build-tool to distribute plugins. Rather, I was simply observing
that building a global repo is a bit of a project and asked, "what could we
use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
Linux repos? Maybe. Maven's repo? Why not?

The idea would be that Drill might have a tool that says, "install the
FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
the plugin in the proper plugins directory. In a cluster, either it does
that on every node, or the work is done as part of preparing a Docker
container which is then pushed to every node.

The key thought is just to make the problem simpler by avoiding the need
to create and maintain a Drill-specific repo when we can barely have enough
resources to keep Drill itself afloat.

None of this can happen, however, unless we clean up the plugin APIs and
ensure plugins can be built outside of the Drill repo. (That means, say,
that Drill needs an API library that resides in Maven.)

There are probably many ways this has been done. Anyone know of any good
examples we can learn from?

Thanks,

- Paul


On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning <[email protected]>
wrote:

I don't think that Maven is a forced move just because Drill is in Java.
It may be a good move, but it isn't a forgone conclusion. For one thing,
the conventions that Maven uses are pretty hard-wired and it may be
difficult to have a reliable deny-list of known problematic plugins.
Publishing to Maven is more of a pain than simply pushing to github.

The usability here is paramount both for the ultimate Drill user, but
also for the writer of plugins.



On Mon, Jan 17, 2022 at 5:06 AM James Turton <[email protected]> wrote:

Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
is probably better fit than GitHub for distribution?  If Drillbits can
write to their jars/3rdparty directory then I can imagine Drill gaining
the ability to fetch and install plugins itself without too much
trouble, at least for Drill clusters with Internet access.
"Sideloading" by downloading from Maven and copying manually would
always remain possible.

@Paul I'll try to get a little time with you to get some ideas about
designing a plugin API.

On 2022/01/14 23:20, Paul Rogers wrote:

Hi All,

James raises an important issue, I've noticed that it used to be easy

to

build and test Drill, now it is a struggle, because of the many odd
external dependencies we have introduced. That acts as a big damper on
contributions: none of us get paid enough to spend more time fighting
builds than developing the code...

Ted is right that we need a good way to install plugins. There are two
parts. Ted is talking about the high-level part: make it easy to

point to

some repo and use the plugin. Since Drill is Java, the Maven repo

could be

a good mechanism. In-house stuff is often in an internal repo that

does

whatever Maven needs.

The reason that plugins are in the Drill project now is that Drill's

"API"

is all of Drill. Plugins can (and some do) access all of Drill though

the

fragment context. The API to Calcite and other parts of Drill are

wide, and

tend to be tightly coupled with Drill internals. By contrast, other

tools,

such as Presto/Trino, have defined very clean APIs that extensions

use. In

Druid, everything is integrated via Google Guice and an extension can
replace any part of Druid (though, I'm not convinced that's actually

a good

idea.) I'm sure there are others we can learn from.

So, we need to define a plugin API for Drill. I started down that

route a

while back: the first step was to refactor the plugin registry so it

is

ready for extensions. The idea was to use the same mechanism for all

kinds

of extensions (security, UDFs, metastore, etc.) The next step was to

build

something that roughly followed Presto, but that kind of stalled out.

In terms of ordering, we'd first need to define the plugin API. Then,

we

can shift plugins to use that. Once that is done, we can move plugins

to

separate projects. (The metastore implementation can also move, if we
want.) Finally, figure out a solution for Ted's suggestion to make it

easy

to grab new extensions. Drill is distributed, so adding a new plugin

has to

happen on all nodes, which is a bit more complex than the typical
Julia/Python/R kind of extension.

The reason we're where we're at is that it is the path of least

resistance.

Creating a good extension mechanism is hard, but valuable, as Ted

noted.

Thanks,

- Paul

On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<[email protected]>

wrote:

The bigger reason for a separate plug-in world is the enhancement of
community.

I would recommend looking at the Julia community for examples of
effective ways to drive plug in structure.

At the core, for any pure julia package, you can simply add a

package by

referring to the github repository where the package is stored. For
packages that are "registered" (i.e. a path and a checksum is

recorded in a

well known data store), you can add a package by simply naming it

without

knowing the path.  All such plugins are tested by the authors and the
project records all dependencies with version constraints so that

cascading

additions are easy. The community leaders have made tooling

available so

that you can test your package against a range of versions of Julia

by

pretty simple (to use) Github actions.

The result has been an absolute explosion in the number of pure Julia
packages.

For packages that include C or Fortran (or whatever) code, there is

some

amazing tooling available that lets you record a build process on

any of

the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows,

BSD, OSX

and so on). WHen you register such a package, it is automagically

built on

all the platforms you indicate and the binary results are checked

into a

central repository known as Yggdrasil.

All of these registration events for different packages are recorded

in a

central registry as I mentioned. That registry is recorded in Github

as

well which makes it easy to propagate changes.



On Thu, Jan 13, 2022 at 8:45 PM James Turton<[email protected]>

wrote:

Hello dev community

Discussions about reorganising the Drill source code to better

position

the project to support plug-ins for the "long tail" of weird and
wonderful systems and data formats have been coming up here and

there

for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.

A view which I personally share is that adding too large a number

and

variety of plug-ins to the main tree would create a lethal

maintenance

burden for developers working there and lead down a road of

accumulating

technical debt.  The Maven tricks we must employ to harmonise the
growing set of dependencies of the main tree to keep it buildable

are

already enough, as is the size of our distributable and the count of
open bug reports.


Thus, the idea of splitting out "/contrib" into a new
apache/drill-contrib repo after selecting a subset of plugins to

remain

in apache/drill.  I'll now volunteer a set of criteria to decide

whether

a plug-in should live in this notional apache/drill-contrib.

   1. The plug-in queries an unstructured data format (even if it

only

      reads metadata fields) e.g. Image format plug-in.
   2. The plug-in queries a data format that was designed for human
      consumption e.g. Excel format plug-in.
   3. The plug-in cannot be expected to run with speed and

reliability

      comparable to querying structured data on the local network

e.g.

      Dropbox storage plugin.
   4. The plug-in queries an obscure system or format e.g. we

receive a

      plug-in for some data format used only on old Cray

supercomputers.

   5. The plug-in can for some reason not be well supported by the

Drill

      devs e.g. it has a JNI dependency on some difficult native

libs.


Any one of those suggests that an apache/drill-contrib is the better
home to me, but what is your view?  Would we apply significantly

more

relaxed standards when reviewing PRs to apache/drill-contrib?

Would we

tag, build and test apache/drill-contrib with every release of
apache/drill, or would it run on its own schedule, perhaps with

users

downloading builds made continuously from snapshots of HEAD?


Regards
James

Re: [DISCUSS] Drill 2 and plug-in organisation

Reply via email to