Paul, I understood your suggestion. My point is that publishing to Maven central is a bit of a pain while publishing by posting to Github is nearly painless. In particular, because Github inherently produces a relatively difficult to fake hash for each commit, referring to a dependency using that hash is relatively safe which saves a lot of agony regarding keys and trust.
Further, Github or any comparable service provides the same "already exists" benefit as does Maven. On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers <par0...@gmail.com> wrote: > Hi Ted, > > Well said. Just to be clear, I wasn't suggesting that we use > Maven-the-build-tool to distribute plugins. Rather, I was simply observing > that building a global repo is a bit of a project and asked, "what could we > use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever > Linux repos? Maybe. Maven's repo? Why not? > > The idea would be that Drill might have a tool that says, "install the > FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts > the plugin in the proper plugins directory. In a cluster, either it does > that on every node, or the work is done as part of preparing a Docker > container which is then pushed to every node. > > The key thought is just to make the problem simpler by avoiding the need > to create and maintain a Drill-specific repo when we can barely have enough > resources to keep Drill itself afloat. > > None of this can happen, however, unless we clean up the plugin APIs and > ensure plugins can be built outside of the Drill repo. (That means, say, > that Drill needs an API library that resides in Maven.) > > There are probably many ways this has been done. Anyone know of any good > examples we can learn from? > > Thanks, > > - Paul > > > On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning <ted.dunn...@gmail.com> wrote: > >> >> I don't think that Maven is a forced move just because Drill is in Java. >> It may be a good move, but it isn't a forgone conclusion. For one thing, >> the conventions that Maven uses are pretty hard-wired and it may be >> difficult to have a reliable deny-list of known problematic plugins. >> Publishing to Maven is more of a pain than simply pushing to github. >> >> The usability here is paramount both for the ultimate Drill user, but >> also for the writer of plugins. >> >> >> >> On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote: >> >>> Thank you Ted and Paul for the feedback. Since Java is compiled, Maven >>> is probably better fit than GitHub for distribution? If Drillbits can >>> write to their jars/3rdparty directory then I can imagine Drill gaining >>> the ability to fetch and install plugins itself without too much >>> trouble, at least for Drill clusters with Internet access. >>> "Sideloading" by downloading from Maven and copying manually would >>> always remain possible. >>> >>> @Paul I'll try to get a little time with you to get some ideas about >>> designing a plugin API. >>> >>> On 2022/01/14 23:20, Paul Rogers wrote: >>> > Hi All, >>> > >>> > James raises an important issue, I've noticed that it used to be easy >>> to >>> > build and test Drill, now it is a struggle, because of the many odd >>> > external dependencies we have introduced. That acts as a big damper on >>> > contributions: none of us get paid enough to spend more time fighting >>> > builds than developing the code... >>> > >>> > Ted is right that we need a good way to install plugins. There are two >>> > parts. Ted is talking about the high-level part: make it easy to point >>> to >>> > some repo and use the plugin. Since Drill is Java, the Maven repo >>> could be >>> > a good mechanism. In-house stuff is often in an internal repo that does >>> > whatever Maven needs. >>> > >>> > The reason that plugins are in the Drill project now is that Drill's >>> "API" >>> > is all of Drill. Plugins can (and some do) access all of Drill though >>> the >>> > fragment context. The API to Calcite and other parts of Drill are >>> wide, and >>> > tend to be tightly coupled with Drill internals. By contrast, other >>> tools, >>> > such as Presto/Trino, have defined very clean APIs that extensions >>> use. In >>> > Druid, everything is integrated via Google Guice and an extension can >>> > replace any part of Druid (though, I'm not convinced that's actually a >>> good >>> > idea.) I'm sure there are others we can learn from. >>> > >>> > So, we need to define a plugin API for Drill. I started down that >>> route a >>> > while back: the first step was to refactor the plugin registry so it is >>> > ready for extensions. The idea was to use the same mechanism for all >>> kinds >>> > of extensions (security, UDFs, metastore, etc.) The next step was to >>> build >>> > something that roughly followed Presto, but that kind of stalled out. >>> > >>> > In terms of ordering, we'd first need to define the plugin API. Then, >>> we >>> > can shift plugins to use that. Once that is done, we can move plugins >>> to >>> > separate projects. (The metastore implementation can also move, if we >>> > want.) Finally, figure out a solution for Ted's suggestion to make it >>> easy >>> > to grab new extensions. Drill is distributed, so adding a new plugin >>> has to >>> > happen on all nodes, which is a bit more complex than the typical >>> > Julia/Python/R kind of extension. >>> > >>> > The reason we're where we're at is that it is the path of least >>> resistance. >>> > Creating a good extension mechanism is hard, but valuable, as Ted >>> noted. >>> > >>> > Thanks, >>> > >>> > - Paul >>> > >>> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<ted.dunn...@gmail.com> >>> wrote: >>> > >>> >> The bigger reason for a separate plug-in world is the enhancement of >>> >> community. >>> >> >>> >> I would recommend looking at the Julia community for examples of >>> >> effective ways to drive plug in structure. >>> >> >>> >> At the core, for any pure julia package, you can simply add a package >>> by >>> >> referring to the github repository where the package is stored. For >>> >> packages that are "registered" (i.e. a path and a checksum is >>> recorded in a >>> >> well known data store), you can add a package by simply naming it >>> without >>> >> knowing the path. All such plugins are tested by the authors and the >>> >> project records all dependencies with version constraints so that >>> cascading >>> >> additions are easy. The community leaders have made tooling available >>> so >>> >> that you can test your package against a range of versions of Julia by >>> >> pretty simple (to use) Github actions. >>> >> >>> >> The result has been an absolute explosion in the number of pure Julia >>> >> packages. >>> >> >>> >> For packages that include C or Fortran (or whatever) code, there is >>> some >>> >> amazing tooling available that lets you record a build process on any >>> of >>> >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, >>> OSX >>> >> and so on). WHen you register such a package, it is automagically >>> built on >>> >> all the platforms you indicate and the binary results are checked >>> into a >>> >> central repository known as Yggdrasil. >>> >> >>> >> All of these registration events for different packages are recorded >>> in a >>> >> central registry as I mentioned. That registry is recorded in Github >>> as >>> >> well which makes it easy to propagate changes. >>> >> >>> >> >>> >> >>> >> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org> >>> wrote: >>> >> >>> >>> Hello dev community >>> >>> >>> >>> Discussions about reorganising the Drill source code to better >>> position >>> >>> the project to support plug-ins for the "long tail" of weird and >>> >>> wonderful systems and data formats have been coming up here and there >>> >>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359. >>> >>> >>> >>> A view which I personally share is that adding too large a number and >>> >>> variety of plug-ins to the main tree would create a lethal >>> maintenance >>> >>> burden for developers working there and lead down a road of >>> accumulating >>> >>> technical debt. The Maven tricks we must employ to harmonise the >>> >>> growing set of dependencies of the main tree to keep it buildable are >>> >>> already enough, as is the size of our distributable and the count of >>> >>> open bug reports. >>> >>> >>> >>> >>> >>> Thus, the idea of splitting out "/contrib" into a new >>> >>> apache/drill-contrib repo after selecting a subset of plugins to >>> remain >>> >>> in apache/drill. I'll now volunteer a set of criteria to decide >>> whether >>> >>> a plug-in should live in this notional apache/drill-contrib. >>> >>> >>> >>> 1. The plug-in queries an unstructured data format (even if it only >>> >>> reads metadata fields) e.g. Image format plug-in. >>> >>> 2. The plug-in queries a data format that was designed for human >>> >>> consumption e.g. Excel format plug-in. >>> >>> 3. The plug-in cannot be expected to run with speed and reliability >>> >>> comparable to querying structured data on the local network e.g. >>> >>> Dropbox storage plugin. >>> >>> 4. The plug-in queries an obscure system or format e.g. we receive >>> a >>> >>> plug-in for some data format used only on old Cray >>> supercomputers. >>> >>> 5. The plug-in can for some reason not be well supported by the >>> Drill >>> >>> devs e.g. it has a JNI dependency on some difficult native libs. >>> >>> >>> >>> >>> >>> Any one of those suggests that an apache/drill-contrib is the better >>> >>> home to me, but what is your view? Would we apply significantly more >>> >>> relaxed standards when reviewing PRs to apache/drill-contrib? Would >>> we >>> >>> tag, build and test apache/drill-contrib with every release of >>> >>> apache/drill, or would it run on its own schedule, perhaps with users >>> >>> downloading builds made continuously from snapshots of HEAD? >>> >>> >>> >>> >>> >>> Regards >>> >>> James >>> >>> >>> >>> >>> >>> >>> >>>