Le 29/07/2021 à 23:01, Weston Pace a écrit :
In reviewing the RADOS PR (which I think is very cool) I am running into some interesting questions that might be good to flesh out here. The first of which is related to the scope of the Github repo. For context the RADOS PR introduces a Ceph object class (a plugin for CephFS, a cloud based file system) called Skyhook which is a standalone artifact that depends on Arrow and is installed into CephFS servers. An argument could be made that such an artifact does not belong in the Arrow repo since it could conceivably be hosted in its own repository. On the other hand, the current description for the repo is "Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing". This extension doesn't necessarily have a home elsewhere (i.e. I don't think Ceph hosts object classes) and it is needed by the datasets module (the topic of a later email) so I think it could be considered a tool. Also, there is some precedent with tools like crossbow, plasma or extensions with 3rd party libaries such as pandas, orc, etc. So noodling on this I would think a good starting point for criteria to be eligible for the Git repo is: * It doesn't have a good home elsewhere * The authors are willing to have it Apache licensed and be subject to Apache Arrow's ownership * There are integration tests ensuring the tool is functioning * Someone is maintaining the tool and the integration tests * One of: * The tool integrates Arrow with a 3rd party library * The tool is used by Arrow (e.g. crossbow)
I think these criteria must also include an evaluation of the maintenance and packaging burden, and a real commitment from the original authors to participate in in-tree maintenance (with an emphasis on *in-tree*, because in other contexts I've seen people integrate a sizable contribution into a large open source project, only to continue maintaining it in the original repo and disregard the social dynamics of the "large open source project").
Intuitively, I'd say adding a Ceph integration layer in the Arrow repo pushes maintenance and expertise requirements beyond the capabilities of our current team. But I may mistaken.
Regards Antoine.