I don't think that Maven is a forced move just because Drill is in Java. It may be a good move, but it isn't a forgone conclusion. For one thing, the conventions that Maven uses are pretty hard-wired and it may be difficult to have a reliable deny-list of known problematic plugins. Publishing to Maven is more of a pain than simply pushing to github.
The usability here is paramount both for the ultimate Drill user, but also for the writer of plugins. On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote: > Thank you Ted and Paul for the feedback. Since Java is compiled, Maven > is probably better fit than GitHub for distribution? If Drillbits can > write to their jars/3rdparty directory then I can imagine Drill gaining > the ability to fetch and install plugins itself without too much > trouble, at least for Drill clusters with Internet access. > "Sideloading" by downloading from Maven and copying manually would > always remain possible. > > @Paul I'll try to get a little time with you to get some ideas about > designing a plugin API. > > On 2022/01/14 23:20, Paul Rogers wrote: > > Hi All, > > > > James raises an important issue, I've noticed that it used to be easy to > > build and test Drill, now it is a struggle, because of the many odd > > external dependencies we have introduced. That acts as a big damper on > > contributions: none of us get paid enough to spend more time fighting > > builds than developing the code... > > > > Ted is right that we need a good way to install plugins. There are two > > parts. Ted is talking about the high-level part: make it easy to point to > > some repo and use the plugin. Since Drill is Java, the Maven repo could > be > > a good mechanism. In-house stuff is often in an internal repo that does > > whatever Maven needs. > > > > The reason that plugins are in the Drill project now is that Drill's > "API" > > is all of Drill. Plugins can (and some do) access all of Drill though the > > fragment context. The API to Calcite and other parts of Drill are wide, > and > > tend to be tightly coupled with Drill internals. By contrast, other > tools, > > such as Presto/Trino, have defined very clean APIs that extensions use. > In > > Druid, everything is integrated via Google Guice and an extension can > > replace any part of Druid (though, I'm not convinced that's actually a > good > > idea.) I'm sure there are others we can learn from. > > > > So, we need to define a plugin API for Drill. I started down that route a > > while back: the first step was to refactor the plugin registry so it is > > ready for extensions. The idea was to use the same mechanism for all > kinds > > of extensions (security, UDFs, metastore, etc.) The next step was to > build > > something that roughly followed Presto, but that kind of stalled out. > > > > In terms of ordering, we'd first need to define the plugin API. Then, we > > can shift plugins to use that. Once that is done, we can move plugins to > > separate projects. (The metastore implementation can also move, if we > > want.) Finally, figure out a solution for Ted's suggestion to make it > easy > > to grab new extensions. Drill is distributed, so adding a new plugin has > to > > happen on all nodes, which is a bit more complex than the typical > > Julia/Python/R kind of extension. > > > > The reason we're where we're at is that it is the path of least > resistance. > > Creating a good extension mechanism is hard, but valuable, as Ted noted. > > > > Thanks, > > > > - Paul > > > > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<ted.dunn...@gmail.com> > wrote: > > > >> The bigger reason for a separate plug-in world is the enhancement of > >> community. > >> > >> I would recommend looking at the Julia community for examples of > >> effective ways to drive plug in structure. > >> > >> At the core, for any pure julia package, you can simply add a package by > >> referring to the github repository where the package is stored. For > >> packages that are "registered" (i.e. a path and a checksum is recorded > in a > >> well known data store), you can add a package by simply naming it > without > >> knowing the path. All such plugins are tested by the authors and the > >> project records all dependencies with version constraints so that > cascading > >> additions are easy. The community leaders have made tooling available so > >> that you can test your package against a range of versions of Julia by > >> pretty simple (to use) Github actions. > >> > >> The result has been an absolute explosion in the number of pure Julia > >> packages. > >> > >> For packages that include C or Fortran (or whatever) code, there is some > >> amazing tooling available that lets you record a build process on any of > >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, > OSX > >> and so on). WHen you register such a package, it is automagically built > on > >> all the platforms you indicate and the binary results are checked into a > >> central repository known as Yggdrasil. > >> > >> All of these registration events for different packages are recorded in > a > >> central registry as I mentioned. That registry is recorded in Github as > >> well which makes it easy to propagate changes. > >> > >> > >> > >> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org> wrote: > >> > >>> Hello dev community > >>> > >>> Discussions about reorganising the Drill source code to better position > >>> the project to support plug-ins for the "long tail" of weird and > >>> wonderful systems and data formats have been coming up here and there > >>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359. > >>> > >>> A view which I personally share is that adding too large a number and > >>> variety of plug-ins to the main tree would create a lethal maintenance > >>> burden for developers working there and lead down a road of > accumulating > >>> technical debt. The Maven tricks we must employ to harmonise the > >>> growing set of dependencies of the main tree to keep it buildable are > >>> already enough, as is the size of our distributable and the count of > >>> open bug reports. > >>> > >>> > >>> Thus, the idea of splitting out "/contrib" into a new > >>> apache/drill-contrib repo after selecting a subset of plugins to remain > >>> in apache/drill. I'll now volunteer a set of criteria to decide > whether > >>> a plug-in should live in this notional apache/drill-contrib. > >>> > >>> 1. The plug-in queries an unstructured data format (even if it only > >>> reads metadata fields) e.g. Image format plug-in. > >>> 2. The plug-in queries a data format that was designed for human > >>> consumption e.g. Excel format plug-in. > >>> 3. The plug-in cannot be expected to run with speed and reliability > >>> comparable to querying structured data on the local network e.g. > >>> Dropbox storage plugin. > >>> 4. The plug-in queries an obscure system or format e.g. we receive a > >>> plug-in for some data format used only on old Cray supercomputers. > >>> 5. The plug-in can for some reason not be well supported by the Drill > >>> devs e.g. it has a JNI dependency on some difficult native libs. > >>> > >>> > >>> Any one of those suggests that an apache/drill-contrib is the better > >>> home to me, but what is your view? Would we apply significantly more > >>> relaxed standards when reviewing PRs to apache/drill-contrib? Would we > >>> tag, build and test apache/drill-contrib with every release of > >>> apache/drill, or would it run on its own schedule, perhaps with users > >>> downloading builds made continuously from snapshots of HEAD? > >>> > >>> > >>> Regards > >>> James > >>> > >>> > >>> > >