As the project grows bigger and bigger, I think we should definitely make room for the "contrib" concept to exist. For example, postgresql has a contrib package consisting of an assortment of extensions within the postgresql ecosystem [1].
Of course, for any "contrib" component, there should be some basic expectations of enabling other developers/maintainers to maintain the components, verify that they still work, and so forth. Having "ossified" components in the project do not necessary do harm unless they are creating a lot of regular maintenance burden -- Plasma is one example of something that's ended up in the ossified/unmaintained bucket, but not causing too much active harm. The exact structure of where add-on / contrib components go and how we manage and document their existence in the codebase (e.g. if something is written in C++, does it live in the cpp/src tree or in a contrib/$thing subdirectory where you build the add-on separately assuming you're already built the main project) can be a subject of iteration based on what seems to make sense in terms of maintainability. If there are components which are causing a burden (e.g. an every-commit CI/CD burden) such that having them in the core monorepo is undesirable, then they could be moved outside, but that seems like something to consider on a case by case basis so we aren't introducing too many avenues to introduce bitrot into the project. [1]: https://packages.debian.org/jessie/postgresql-contrib-9.4 On Thu, Jul 29, 2021 at 4:01 PM Weston Pace <weston.p...@gmail.com> wrote: > > In reviewing the RADOS PR (which I think is very cool) I am running > into some interesting questions that might be good to flesh out here. > > The first of which is related to the scope of the Github repo. For > context the RADOS PR introduces a Ceph object class (a plugin for > CephFS, a cloud based file system) called Skyhook which is a > standalone artifact that depends on Arrow and is installed into CephFS > servers. An argument could be made that such an artifact does not > belong in the Arrow repo since it could conceivably be hosted in its > own repository. > > On the other hand, the current description for the repo is "Apache > Arrow is a multi-language toolbox for accelerated data interchange and > in-memory processing". This extension doesn't necessarily have a home > elsewhere (i.e. I don't think Ceph hosts object classes) and it is > needed by the datasets module (the topic of a later email) so I think > it could be considered a tool. Also, there is some precedent with > tools like crossbow, plasma or extensions with 3rd party libaries such > as pandas, orc, etc. > > So noodling on this I would think a good starting point for criteria > to be eligible for the Git repo is: > > * It doesn't have a good home elsewhere > * The authors are willing to have it Apache licensed and be subject > to Apache Arrow's ownership > * There are integration tests ensuring the tool is functioning > * Someone is maintaining the tool and the integration tests > * One of: > * The tool integrates Arrow with a 3rd party library > * The tool is used by Arrow (e.g. crossbow) > > If repository size starts to become a problem then it shouldn't be too > difficult to split the arrow repo (into arrow, arrow-tools, etc) in > the future. > > -Weston