Re: [DISCUSS] Drill 2 and plug-in organisation

Paul Rogers Mon, 17 Jan 2022 14:32:58 -0800

Hi Ted,

Thanks for the explanation, makes sense.


Ideally, the client side would be somewhat agnostic about the repo it pulls
from. In a corporate setting, it should pull from the "JFrog Repository"
that everyone seems to use (but which I know basically nothing.) Oh, lord,
a plugin architecture for the repo for the plugin architecture?

- Paul

On Mon, Jan 17, 2022 at 1:46 PM Ted Dunning <ted.dunn...@gmail.com> wrote:

>
> Paul,
>
> I understood your suggestion.  My point is that publishing to Maven
> central is a bit of a pain while publishing by posting to Github is nearly
> painless.  In particular, because Github inherently produces a relatively
> difficult to fake hash for each commit, referring to a dependency using
> that hash is relatively safe which saves a lot of agony regarding keys and
> trust.
>
> Further, Github or any comparable service provides the same "already
> exists" benefit as does Maven.
>
>
>
> On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers <par0...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> Well said. Just to be clear, I wasn't suggesting that we use
>> Maven-the-build-tool to distribute plugins. Rather, I was simply observing
>> that building a global repo is a bit of a project and asked, "what could we
>> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
>> Linux repos? Maybe. Maven's repo? Why not?
>>
>> The idea would be that Drill might have a tool that says, "install the
>> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
>> the plugin in the proper plugins directory. In a cluster, either it does
>> that on every node, or the work is done as part of preparing a Docker
>> container which is then pushed to every node.
>>
>> The key thought is just to make the problem simpler by avoiding the need
>> to create and maintain a Drill-specific repo when we can barely have enough
>> resources to keep Drill itself afloat.
>>
>> None of this can happen, however, unless we clean up the plugin APIs and
>> ensure plugins can be built outside of the Drill repo. (That means, say,
>> that Drill needs an API library that resides in Maven.)
>>
>> There are probably many ways this has been done. Anyone know of any good
>> examples we can learn from?
>>
>> Thanks,
>>
>> - Paul
>>
>>
>> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>>
>>>
>>> I don't think that Maven is a forced move just because Drill is in Java.
>>> It may be a good move, but it isn't a forgone conclusion. For one thing,
>>> the conventions that Maven uses are pretty hard-wired and it may be
>>> difficult to have a reliable deny-list of known problematic plugins.
>>> Publishing to Maven is more of a pain than simply pushing to github.
>>>
>>> The usability here is paramount both for the ultimate Drill user, but
>>> also for the writer of plugins.
>>>
>>>
>>>
>>> On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote:
>>>
>>>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>>>> is probably better fit than GitHub for distribution?  If Drillbits can
>>>> write to their jars/3rdparty directory then I can imagine Drill gaining
>>>> the ability to fetch and install plugins itself without too much
>>>> trouble, at least for Drill clusters with Internet access.
>>>> "Sideloading" by downloading from Maven and copying manually would
>>>> always remain possible.
>>>>
>>>> @Paul I'll try to get a little time with you to get some ideas about
>>>> designing a plugin API.
>>>>
>>>> On 2022/01/14 23:20, Paul Rogers wrote:
>>>> > Hi All,
>>>> >
>>>> > James raises an important issue, I've noticed that it used to be easy
>>>> to
>>>> > build and test Drill, now it is a struggle, because of the many odd
>>>> > external dependencies we have introduced. That acts as a big damper on
>>>> > contributions: none of us get paid enough to spend more time fighting
>>>> > builds than developing the code...
>>>> >
>>>> > Ted is right that we need a good way to install plugins. There are two
>>>> > parts. Ted is talking about the high-level part: make it easy to
>>>> point to
>>>> > some repo and use the plugin. Since Drill is Java, the Maven repo
>>>> could be
>>>> > a good mechanism. In-house stuff is often in an internal repo that
>>>> does
>>>> > whatever Maven needs.
>>>> >
>>>> > The reason that plugins are in the Drill project now is that Drill's
>>>> "API"
>>>> > is all of Drill. Plugins can (and some do) access all of Drill though
>>>> the
>>>> > fragment context. The API to Calcite and other parts of Drill are
>>>> wide, and
>>>> > tend to be tightly coupled with Drill internals. By contrast, other
>>>> tools,
>>>> > such as Presto/Trino, have defined very clean APIs that extensions
>>>> use. In
>>>> > Druid, everything is integrated via Google Guice and an extension can
>>>> > replace any part of Druid (though, I'm not convinced that's actually
>>>> a good
>>>> > idea.) I'm sure there are others we can learn from.
>>>> >
>>>> > So, we need to define a plugin API for Drill. I started down that
>>>> route a
>>>> > while back: the first step was to refactor the plugin registry so it
>>>> is
>>>> > ready for extensions. The idea was to use the same mechanism for all
>>>> kinds
>>>> > of extensions (security, UDFs, metastore, etc.) The next step was to
>>>> build
>>>> > something that roughly followed Presto, but that kind of stalled out.
>>>> >
>>>> > In terms of ordering, we'd first need to define the plugin API. Then,
>>>> we
>>>> > can shift plugins to use that. Once that is done, we can move plugins
>>>> to
>>>> > separate projects. (The metastore implementation can also move, if we
>>>> > want.) Finally, figure out a solution for Ted's suggestion to make it
>>>> easy
>>>> > to grab new extensions. Drill is distributed, so adding a new plugin
>>>> has to
>>>> > happen on all nodes, which is a bit more complex than the typical
>>>> > Julia/Python/R kind of extension.
>>>> >
>>>> > The reason we're where we're at is that it is the path of least
>>>> resistance.
>>>> > Creating a good extension mechanism is hard, but valuable, as Ted
>>>> noted.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > - Paul
>>>> >
>>>> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<ted.dunn...@gmail.com>
>>>> wrote:
>>>> >
>>>> >> The bigger reason for a separate plug-in world is the enhancement of
>>>> >> community.
>>>> >>
>>>> >> I would recommend looking at the Julia community for examples of
>>>> >> effective ways to drive plug in structure.
>>>> >>
>>>> >> At the core, for any pure julia package, you can simply add a
>>>> package by
>>>> >> referring to the github repository where the package is stored. For
>>>> >> packages that are "registered" (i.e. a path and a checksum is
>>>> recorded in a
>>>> >> well known data store), you can add a package by simply naming it
>>>> without
>>>> >> knowing the path.  All such plugins are tested by the authors and the
>>>> >> project records all dependencies with version constraints so that
>>>> cascading
>>>> >> additions are easy. The community leaders have made tooling
>>>> available so
>>>> >> that you can test your package against a range of versions of Julia
>>>> by
>>>> >> pretty simple (to use) Github actions.
>>>> >>
>>>> >> The result has been an absolute explosion in the number of pure Julia
>>>> >> packages.
>>>> >>
>>>> >> For packages that include C or Fortran (or whatever) code, there is
>>>> some
>>>> >> amazing tooling available that lets you record a build process on
>>>> any of
>>>> >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows,
>>>> BSD, OSX
>>>> >> and so on). WHen you register such a package, it is automagically
>>>> built on
>>>> >> all the platforms you indicate and the binary results are checked
>>>> into a
>>>> >> central repository known as Yggdrasil.
>>>> >>
>>>> >> All of these registration events for different packages are recorded
>>>> in a
>>>> >> central registry as I mentioned. That registry is recorded in Github
>>>> as
>>>> >> well which makes it easy to propagate changes.
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>
>>>> wrote:
>>>> >>
>>>> >>> Hello dev community
>>>> >>>
>>>> >>> Discussions about reorganising the Drill source code to better
>>>> position
>>>> >>> the project to support plug-ins for the "long tail" of weird and
>>>> >>> wonderful systems and data formats have been coming up here and
>>>> there
>>>> >>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.
>>>> >>>
>>>> >>> A view which I personally share is that adding too large a number
>>>> and
>>>> >>> variety of plug-ins to the main tree would create a lethal
>>>> maintenance
>>>> >>> burden for developers working there and lead down a road of
>>>> accumulating
>>>> >>> technical debt.  The Maven tricks we must employ to harmonise the
>>>> >>> growing set of dependencies of the main tree to keep it buildable
>>>> are
>>>> >>> already enough, as is the size of our distributable and the count of
>>>> >>> open bug reports.
>>>> >>>
>>>> >>>
>>>> >>> Thus, the idea of splitting out "/contrib" into a new
>>>> >>> apache/drill-contrib repo after selecting a subset of plugins to
>>>> remain
>>>> >>> in apache/drill.  I'll now volunteer a set of criteria to decide
>>>> whether
>>>> >>> a plug-in should live in this notional apache/drill-contrib.
>>>> >>>
>>>> >>>   1. The plug-in queries an unstructured data format (even if it
>>>> only
>>>> >>>      reads metadata fields) e.g. Image format plug-in.
>>>> >>>   2. The plug-in queries a data format that was designed for human
>>>> >>>      consumption e.g. Excel format plug-in.
>>>> >>>   3. The plug-in cannot be expected to run with speed and
>>>> reliability
>>>> >>>      comparable to querying structured data on the local network
>>>> e.g.
>>>> >>>      Dropbox storage plugin.
>>>> >>>   4. The plug-in queries an obscure system or format e.g. we
>>>> receive a
>>>> >>>      plug-in for some data format used only on old Cray
>>>> supercomputers.
>>>> >>>   5. The plug-in can for some reason not be well supported by the
>>>> Drill
>>>> >>>      devs e.g. it has a JNI dependency on some difficult native
>>>> libs.
>>>> >>>
>>>> >>>
>>>> >>> Any one of those suggests that an apache/drill-contrib is the better
>>>> >>> home to me, but what is your view?  Would we apply significantly
>>>> more
>>>> >>> relaxed standards when reviewing PRs to apache/drill-contrib?
>>>> Would we
>>>> >>> tag, build and test apache/drill-contrib with every release of
>>>> >>> apache/drill, or would it run on its own schedule, perhaps with
>>>> users
>>>> >>> downloading builds made continuously from snapshots of HEAD?
>>>> >>>
>>>> >>>
>>>> >>> Regards
>>>> >>> James
>>>> >>>
>>>> >>>
>>>> >>>
>>>>
>>>>

Re: [DISCUSS] Drill 2 and plug-in organisation

Reply via email to