Re: [DISCUSS] What tools go in the Arrow toolbox (i.e. git repo)

Wes McKinney Thu, 29 Jul 2021 14:31:26 -0700

As the project grows bigger and bigger, I think we should definitely
make room for the "contrib" concept to exist. For example, postgresql
has a contrib package consisting of an assortment of extensions within
the postgresql ecosystem [1].

Of course, for any "contrib" component, there should be some basic
expectations of enabling other developers/maintainers to maintain the
components, verify that they still work, and so forth. Having
"ossified" components in the project do not necessary do harm unless
they are creating a lot of regular maintenance burden -- Plasma is one
example of something that's ended up in the ossified/unmaintained
bucket, but not causing too much active harm.

The exact structure of where add-on / contrib components go and how we
manage and document their existence in the codebase (e.g. if something
is written in C++, does it live in the cpp/src tree or in a
contrib/$thing subdirectory where you build the add-on separately
assuming you're already built the main project) can be a subject of
iteration based on what seems to make sense in terms of
maintainability.

If there are components which are causing a burden (e.g. an
every-commit CI/CD burden) such that having them in the core monorepo
is undesirable, then they could be moved outside, but that seems like
something to consider on a case by case basis so we aren't introducing
too many avenues to introduce bitrot into the project.

[1]: https://packages.debian.org/jessie/postgresql-contrib-9.4

On Thu, Jul 29, 2021 at 4:01 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> In reviewing the RADOS PR (which I think is very cool) I am running
> into some interesting questions that might be good to flesh out here.
>
> The first of which is related to the scope of the Github repo.  For
> context the RADOS PR introduces a Ceph object class (a plugin for
> CephFS, a cloud based file system) called Skyhook which is a
> standalone artifact that depends on Arrow and is installed into CephFS
> servers.  An argument could be made that such an artifact does not
> belong in the Arrow repo since it could conceivably be hosted in its
> own repository.
>
> On the other hand, the current description for the repo is "Apache
> Arrow is a multi-language toolbox for accelerated data interchange and
> in-memory processing".  This extension doesn't necessarily have a home
> elsewhere (i.e. I don't think Ceph hosts object classes) and it is
> needed by the datasets module (the topic of a later email) so I think
> it could be considered a tool.  Also, there is some precedent with
> tools like crossbow, plasma or extensions with 3rd party libaries such
> as pandas, orc, etc.
>
> So noodling on this I would think a good starting point for criteria
> to be eligible for the Git repo is:
>
>  * It doesn't have a good home elsewhere
>  * The authors are willing to have it Apache licensed and be subject
> to Apache Arrow's ownership
>  * There are integration tests ensuring the tool is functioning
>  * Someone is maintaining the tool and the integration tests
>  * One of:
>     * The tool integrates Arrow with a 3rd party library
>     * The tool is used by Arrow (e.g. crossbow)
>
> If repository size starts to become a problem then it shouldn't be too
> difficult to split the arrow repo (into arrow, arrow-tools, etc) in
> the future.
>
> -Weston

Re: [DISCUSS] What tools go in the Arrow toolbox (i.e. git repo)

Reply via email to