hi Remi, It's no problem, it's a common question we get. Some developers believe as a matter of principle that large projects should be broken up into many smaller repositories.
Arrow is a different than many open source projects. Maintaining protocol-level interoperability (although note that Rust does not yet participate in the integration tests) has been a great deal of effort, and the community has felt that trying to coordinate changes that impact interoperability is substantially simpler in a monorepo arrangement on GitHub. That we always know with relative certainty whether any pull request may break interoperability between one component and another. It's very easy to get into a situation where you have a mess of cross-repository (or even circular) build and runtime dependencies -- the monorepo makes all of this pain go away. If you have a change that affects multiple repositories, CI tools don't make it easy to test those PRs together, generally you'll just see that a PR on one repo is breaking against the master of the other repository. In some cases, components may not have integrations with other languages but that may not always be the case in the future. We have just developed the C interface, for example, which would enable DataFusion to be built as a shared library and imported in Python (if someone wanted to do that). Another dimension is that all of the PLs and components have benefited greatly from the community's investment in CI and packaging infrastructure. I also believe that the project's common PR queue helps create a sense of community awareness and solidarity amongst projects contributors. If Rust were working off in their own corner of GitHub, I think it would be easy for people who are not working on Rust to ignore them. I think the net result of the way that we currently operate is that we're producing higher quality software and have a healthier community than we would otherwise with a more fragmented approach. Lastly, the shared release cycle creates social pressure to get patches finished and merged. Anecdotally this seems to be effective. On the governance questions, see the roles section on https://www.apache.org/foundation/how-it-works.html#roles If a part of apache/arrow truly believed that they were being hindered by being a part of monorepo, we could create a new repository under apache/ on GitHub for the part that wants to split into a standalone GitHub repository. That wouldn't change the governance of that code. - Wes On Tue, Apr 14, 2020 at 1:26 PM Rémi Dettai <rdet...@gmail.com> wrote: > > This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451. > > First thanks for your answer! > > It's true that I was also surprised to see all implementations of Arrow > mixed up in a single repository! > > I was really considering the separation of the repositories as a mean to > separate concerns. I am not 100% sure to understand how it would fragment > the community but I think I get the point, even though I still believe that > it is at the cost of extra complexity. > > As for the legal protection, I did not take that aspect into consideration, > and I find it very interesting! What is the PMC exactly and why would > Datafusion be more exposed in a separate repository?