I'd like to add some other criteria here (agree with the ones proposed
already):

- for large net new contributions they are discussed first (if authors
aren't familiar with the ml doing it concurrently with the pr) and someone
has to be willing to review them.
- original authors are willing to split contributions into digestible
reviews

I think we need to have a willingness for deleting code if maintenance or
other issues are raised and there are no maintainers willing to address
them on a reasonable timeline.

In the past I've also been skeptical of large one off contributions but I'm
starting to lean towards a policy of being more open and willing to delete
as long as someone is willing to review the code.

-Micah

On Friday, July 30, 2021, Wes McKinney <wesmck...@gmail.com> wrote:

> On Fri, Jul 30, 2021 at 4:58 AM Antoine Pitrou <anto...@python.org> wrote:
> >
> >
> > Le 29/07/2021 à 23:01, Weston Pace a écrit :
> > > In reviewing the RADOS PR (which I think is very cool) I am running
> > > into some interesting questions that might be good to flesh out here.
> > >
> > > The first of which is related to the scope of the Github repo.  For
> > > context the RADOS PR introduces a Ceph object class (a plugin for
> > > CephFS, a cloud based file system) called Skyhook which is a
> > > standalone artifact that depends on Arrow and is installed into CephFS
> > > servers.  An argument could be made that such an artifact does not
> > > belong in the Arrow repo since it could conceivably be hosted in its
> > > own repository.
> > >
> > > On the other hand, the current description for the repo is "Apache
> > > Arrow is a multi-language toolbox for accelerated data interchange and
> > > in-memory processing".  This extension doesn't necessarily have a home
> > > elsewhere (i.e. I don't think Ceph hosts object classes) and it is
> > > needed by the datasets module (the topic of a later email) so I think
> > > it could be considered a tool.  Also, there is some precedent with
> > > tools like crossbow, plasma or extensions with 3rd party libaries such
> > > as pandas, orc, etc.
> > >
> > > So noodling on this I would think a good starting point for criteria
> > > to be eligible for the Git repo is:
> > >
> > >   * It doesn't have a good home elsewhere
> > >   * The authors are willing to have it Apache licensed and be subject
> > > to Apache Arrow's ownership
> > >   * There are integration tests ensuring the tool is functioning
> > >   * Someone is maintaining the tool and the integration tests
> > >   * One of:
> > >      * The tool integrates Arrow with a 3rd party library
> > >      * The tool is used by Arrow (e.g. crossbow)
> >
> > I think these criteria must also include an evaluation of the
> > maintenance and packaging burden, and a real commitment from the
> > original authors to participate in in-tree maintenance (with an emphasis
> > on *in-tree*, because in other contexts I've seen people integrate a
> > sizable contribution into a large open source project, only to continue
> > maintaining it in the original repo and disregard the social dynamics of
> > the "large open source project").
>
> I do agree with this. We don't want people to "throw code over the
> wall". I see this as a strategy for community development / growth.
>
> > Intuitively, I'd say adding a Ceph integration layer in the Arrow repo
> > pushes maintenance and expertise requirements beyond the capabilities of
> > our current team.  But I may mistaken.
>
> You could be right, but it can also serve as a test-run for these
> principles. If it doesn't work out, or it's causing issues, we aren't
> committing ourselves to maintaining it without the involvement of the
> new code authors.
>
> > Regards
> >
> > Antoine.
>

Reply via email to