Re: [DISCUSS] Drill 2 and plug-in organisation

Ted Dunning Mon, 17 Jan 2022 13:46:44 -0800

Paul,

I understood your suggestion.  My point is that publishing to Maven central
is a bit of a pain while publishing by posting to Github is nearly
painless.  In particular, because Github inherently produces a relatively
difficult to fake hash for each commit, referring to a dependency using
that hash is relatively safe which saves a lot of agony regarding keys and
trust.


Further, Github or any comparable service provides the same "already
exists" benefit as does Maven.



On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers <par0...@gmail.com> wrote:

> Hi Ted,
>
> Well said. Just to be clear, I wasn't suggesting that we use
> Maven-the-build-tool to distribute plugins. Rather, I was simply observing
> that building a global repo is a bit of a project and asked, "what could we
> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
> Linux repos? Maybe. Maven's repo? Why not?
>
> The idea would be that Drill might have a tool that says, "install the
> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
> the plugin in the proper plugins directory. In a cluster, either it does
> that on every node, or the work is done as part of preparing a Docker
> container which is then pushed to every node.
>
> The key thought is just to make the problem simpler by avoiding the need
> to create and maintain a Drill-specific repo when we can barely have enough
> resources to keep Drill itself afloat.
>
> None of this can happen, however, unless we clean up the plugin APIs and
> ensure plugins can be built outside of the Drill repo. (That means, say,
> that Drill needs an API library that resides in Maven.)
>
> There are probably many ways this has been done. Anyone know of any good
> examples we can learn from?
>
> Thanks,
>
> - Paul
>
>
> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning <ted.dunn...@gmail.com> wrote:
>
>>
>> I don't think that Maven is a forced move just because Drill is in Java.
>> It may be a good move, but it isn't a forgone conclusion. For one thing,
>> the conventions that Maven uses are pretty hard-wired and it may be
>> difficult to have a reliable deny-list of known problematic plugins.
>> Publishing to Maven is more of a pain than simply pushing to github.
>>
>> The usability here is paramount both for the ultimate Drill user, but
>> also for the writer of plugins.
>>
>>
>>
>> On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote:
>>
>>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>>> is probably better fit than GitHub for distribution?  If Drillbits can
>>> write to their jars/3rdparty directory then I can imagine Drill gaining
>>> the ability to fetch and install plugins itself without too much
>>> trouble, at least for Drill clusters with Internet access.
>>> "Sideloading" by downloading from Maven and copying manually would
>>> always remain possible.
>>>
>>> @Paul I'll try to get a little time with you to get some ideas about
>>> designing a plugin API.
>>>
>>> On 2022/01/14 23:20, Paul Rogers wrote:
>>> > Hi All,
>>> >
>>> > James raises an important issue, I've noticed that it used to be easy
>>> to
>>> > build and test Drill, now it is a struggle, because of the many odd
>>> > external dependencies we have introduced. That acts as a big damper on
>>> > contributions: none of us get paid enough to spend more time fighting
>>> > builds than developing the code...
>>> >
>>> > Ted is right that we need a good way to install plugins. There are two
>>> > parts. Ted is talking about the high-level part: make it easy to point
>>> to
>>> > some repo and use the plugin. Since Drill is Java, the Maven repo
>>> could be
>>> > a good mechanism. In-house stuff is often in an internal repo that does
>>> > whatever Maven needs.
>>> >
>>> > The reason that plugins are in the Drill project now is that Drill's
>>> "API"
>>> > is all of Drill. Plugins can (and some do) access all of Drill though
>>> the
>>> > fragment context. The API to Calcite and other parts of Drill are
>>> wide, and
>>> > tend to be tightly coupled with Drill internals. By contrast, other
>>> tools,
>>> > such as Presto/Trino, have defined very clean APIs that extensions
>>> use. In
>>> > Druid, everything is integrated via Google Guice and an extension can
>>> > replace any part of Druid (though, I'm not convinced that's actually a
>>> good
>>> > idea.) I'm sure there are others we can learn from.
>>> >
>>> > So, we need to define a plugin API for Drill. I started down that
>>> route a
>>> > while back: the first step was to refactor the plugin registry so it is
>>> > ready for extensions. The idea was to use the same mechanism for all
>>> kinds
>>> > of extensions (security, UDFs, metastore, etc.) The next step was to
>>> build
>>> > something that roughly followed Presto, but that kind of stalled out.
>>> >
>>> > In terms of ordering, we'd first need to define the plugin API. Then,
>>> we
>>> > can shift plugins to use that. Once that is done, we can move plugins
>>> to
>>> > separate projects. (The metastore implementation can also move, if we
>>> > want.) Finally, figure out a solution for Ted's suggestion to make it
>>> easy
>>> > to grab new extensions. Drill is distributed, so adding a new plugin
>>> has to
>>> > happen on all nodes, which is a bit more complex than the typical
>>> > Julia/Python/R kind of extension.
>>> >
>>> > The reason we're where we're at is that it is the path of least
>>> resistance.
>>> > Creating a good extension mechanism is hard, but valuable, as Ted
>>> noted.
>>> >
>>> > Thanks,
>>> >
>>> > - Paul
>>> >
>>> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<ted.dunn...@gmail.com>
>>> wrote:
>>> >
>>> >> The bigger reason for a separate plug-in world is the enhancement of
>>> >> community.
>>> >>
>>> >> I would recommend looking at the Julia community for examples of
>>> >> effective ways to drive plug in structure.
>>> >>
>>> >> At the core, for any pure julia package, you can simply add a package
>>> by
>>> >> referring to the github repository where the package is stored. For
>>> >> packages that are "registered" (i.e. a path and a checksum is
>>> recorded in a
>>> >> well known data store), you can add a package by simply naming it
>>> without
>>> >> knowing the path.  All such plugins are tested by the authors and the
>>> >> project records all dependencies with version constraints so that
>>> cascading
>>> >> additions are easy. The community leaders have made tooling available
>>> so
>>> >> that you can test your package against a range of versions of Julia by
>>> >> pretty simple (to use) Github actions.
>>> >>
>>> >> The result has been an absolute explosion in the number of pure Julia
>>> >> packages.
>>> >>
>>> >> For packages that include C or Fortran (or whatever) code, there is
>>> some
>>> >> amazing tooling available that lets you record a build process on any
>>> of
>>> >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD,
>>> OSX
>>> >> and so on). WHen you register such a package, it is automagically
>>> built on
>>> >> all the platforms you indicate and the binary results are checked
>>> into a
>>> >> central repository known as Yggdrasil.
>>> >>
>>> >> All of these registration events for different packages are recorded
>>> in a
>>> >> central registry as I mentioned. That registry is recorded in Github
>>> as
>>> >> well which makes it easy to propagate changes.
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>
>>> wrote:
>>> >>
>>> >>> Hello dev community
>>> >>>
>>> >>> Discussions about reorganising the Drill source code to better
>>> position
>>> >>> the project to support plug-ins for the "long tail" of weird and
>>> >>> wonderful systems and data formats have been coming up here and there
>>> >>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.
>>> >>>
>>> >>> A view which I personally share is that adding too large a number and
>>> >>> variety of plug-ins to the main tree would create a lethal
>>> maintenance
>>> >>> burden for developers working there and lead down a road of
>>> accumulating
>>> >>> technical debt.  The Maven tricks we must employ to harmonise the
>>> >>> growing set of dependencies of the main tree to keep it buildable are
>>> >>> already enough, as is the size of our distributable and the count of
>>> >>> open bug reports.
>>> >>>
>>> >>>
>>> >>> Thus, the idea of splitting out "/contrib" into a new
>>> >>> apache/drill-contrib repo after selecting a subset of plugins to
>>> remain
>>> >>> in apache/drill.  I'll now volunteer a set of criteria to decide
>>> whether
>>> >>> a plug-in should live in this notional apache/drill-contrib.
>>> >>>
>>> >>>   1. The plug-in queries an unstructured data format (even if it only
>>> >>>      reads metadata fields) e.g. Image format plug-in.
>>> >>>   2. The plug-in queries a data format that was designed for human
>>> >>>      consumption e.g. Excel format plug-in.
>>> >>>   3. The plug-in cannot be expected to run with speed and reliability
>>> >>>      comparable to querying structured data on the local network e.g.
>>> >>>      Dropbox storage plugin.
>>> >>>   4. The plug-in queries an obscure system or format e.g. we receive
>>> a
>>> >>>      plug-in for some data format used only on old Cray
>>> supercomputers.
>>> >>>   5. The plug-in can for some reason not be well supported by the
>>> Drill
>>> >>>      devs e.g. it has a JNI dependency on some difficult native libs.
>>> >>>
>>> >>>
>>> >>> Any one of those suggests that an apache/drill-contrib is the better
>>> >>> home to me, but what is your view?  Would we apply significantly more
>>> >>> relaxed standards when reviewing PRs to apache/drill-contrib?  Would
>>> we
>>> >>> tag, build and test apache/drill-contrib with every release of
>>> >>> apache/drill, or would it run on its own schedule, perhaps with users
>>> >>> downloading builds made continuously from snapshots of HEAD?
>>> >>>
>>> >>>
>>> >>> Regards
>>> >>> James
>>> >>>
>>> >>>
>>> >>>
>>>
>>>

Re: [DISCUSS] Drill 2 and plug-in organisation

Reply via email to