In reviewing the RADOS PR (which I think is very cool) I am running into some interesting questions that might be good to flesh out here.
The first of which is related to the scope of the Github repo. For context the RADOS PR introduces a Ceph object class (a plugin for CephFS, a cloud based file system) called Skyhook which is a standalone artifact that depends on Arrow and is installed into CephFS servers. An argument could be made that such an artifact does not belong in the Arrow repo since it could conceivably be hosted in its own repository. On the other hand, the current description for the repo is "Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing". This extension doesn't necessarily have a home elsewhere (i.e. I don't think Ceph hosts object classes) and it is needed by the datasets module (the topic of a later email) so I think it could be considered a tool. Also, there is some precedent with tools like crossbow, plasma or extensions with 3rd party libaries such as pandas, orc, etc. So noodling on this I would think a good starting point for criteria to be eligible for the Git repo is: * It doesn't have a good home elsewhere * The authors are willing to have it Apache licensed and be subject to Apache Arrow's ownership * There are integration tests ensuring the tool is functioning * Someone is maintaining the tool and the integration tests * One of: * The tool integrates Arrow with a 3rd party library * The tool is used by Arrow (e.g. crossbow) If repository size starts to become a problem then it shouldn't be too difficult to split the arrow repo (into arrow, arrow-tools, etc) in the future. -Weston