hi Remi,

It's no problem, it's a common question we get. Some developers
believe as a matter of principle that large projects should be broken
up into many smaller repositories.

Arrow is a different than many open source projects. Maintaining
protocol-level interoperability (although note that Rust does not yet
participate in the integration tests) has been a great deal of effort,
and the community has felt that trying to coordinate changes that
impact interoperability is substantially simpler in a monorepo
arrangement on GitHub. That we always know with relative certainty
whether any pull request may break interoperability between one
component and another. It's very easy to get into a situation where
you have a mess of cross-repository (or even circular) build and
runtime dependencies -- the monorepo makes all of this pain go away.
If you have a change that affects multiple repositories, CI tools
don't make it easy to test those PRs together, generally you'll just
see that a PR on one repo is breaking against the master of the other
repository.

In some cases, components may not have integrations with other
languages but that may not always be the case in the future. We have
just developed the C interface, for example, which would enable
DataFusion to be built as a shared library and imported in Python (if
someone wanted to do that).

Another dimension is that all of the PLs and components have benefited
greatly from the community's investment in CI and packaging
infrastructure.

I also believe that the project's common PR queue helps create a sense
of community awareness and solidarity amongst projects contributors.
If Rust were working off in their own corner of GitHub, I think it
would be easy for people who are not working on Rust to ignore them. I
think the net result of the way that we currently operate is that
we're producing higher quality software and have a healthier community
than we would otherwise with a more fragmented approach.

Lastly, the shared release cycle creates social pressure to get
patches finished and merged. Anecdotally this seems to be effective.

On the governance questions, see the roles section on
https://www.apache.org/foundation/how-it-works.html#roles

If a part of apache/arrow truly believed that they were being hindered
by being a part of monorepo, we could create a new repository under
apache/ on GitHub for the part that wants to split into a standalone
GitHub repository. That wouldn't change the governance of that code.

- Wes

On Tue, Apr 14, 2020 at 1:26 PM Rémi Dettai <rdet...@gmail.com> wrote:
>
> This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451.
>
> First thanks for your answer!
>
> It's true that I was also surprised to see all implementations of Arrow
> mixed up in a single repository!
>
> I was really considering the separation of the repositories as a mean to
> separate concerns. I am not 100% sure to understand how it would fragment
> the community but I think I get the point, even though I still believe that
> it is at the cost of extra complexity.
>
> As for the legal protection, I did not take that aspect into consideration,
> and I find it very interesting! What is the PMC exactly and why would
> Datafusion be more exposed in a separate repository?

Reply via email to