In reviewing the RADOS PR (which I think is very cool) I am running
into some interesting questions that might be good to flesh out here.

The first of which is related to the scope of the Github repo.  For
context the RADOS PR introduces a Ceph object class (a plugin for
CephFS, a cloud based file system) called Skyhook which is a
standalone artifact that depends on Arrow and is installed into CephFS
servers.  An argument could be made that such an artifact does not
belong in the Arrow repo since it could conceivably be hosted in its
own repository.

On the other hand, the current description for the repo is "Apache
Arrow is a multi-language toolbox for accelerated data interchange and
in-memory processing".  This extension doesn't necessarily have a home
elsewhere (i.e. I don't think Ceph hosts object classes) and it is
needed by the datasets module (the topic of a later email) so I think
it could be considered a tool.  Also, there is some precedent with
tools like crossbow, plasma or extensions with 3rd party libaries such
as pandas, orc, etc.

So noodling on this I would think a good starting point for criteria
to be eligible for the Git repo is:

 * It doesn't have a good home elsewhere
 * The authors are willing to have it Apache licensed and be subject
to Apache Arrow's ownership
 * There are integration tests ensuring the tool is functioning
 * Someone is maintaining the tool and the integration tests
 * One of:
    * The tool integrates Arrow with a 3rd party library
    * The tool is used by Arrow (e.g. crossbow)

If repository size starts to become a problem then it shouldn't be too
difficult to split the arrow repo (into arrow, arrow-tools, etc) in
the future.

-Weston

Reply via email to