Re: Follow up on ARROW-8451, datafusion part of Arrow

2020-04-15 Thread Rémi Dettai
Hi Wes !

Thanks for your reply, all much clearer now. I guess it is just a question
of getting used to it :-)

Remi

Le mar. 14 avr. 2020 à 22:54, Wes McKinney  a écrit :

> hi Remi,
>
> It's no problem, it's a common question we get. Some developers
> believe as a matter of principle that large projects should be broken
> up into many smaller repositories.
>
> Arrow is a different than many open source projects. Maintaining
> protocol-level interoperability (although note that Rust does not yet
> participate in the integration tests) has been a great deal of effort,
> and the community has felt that trying to coordinate changes that
> impact interoperability is substantially simpler in a monorepo
> arrangement on GitHub. That we always know with relative certainty
> whether any pull request may break interoperability between one
> component and another. It's very easy to get into a situation where
> you have a mess of cross-repository (or even circular) build and
> runtime dependencies -- the monorepo makes all of this pain go away.
> If you have a change that affects multiple repositories, CI tools
> don't make it easy to test those PRs together, generally you'll just
> see that a PR on one repo is breaking against the master of the other
> repository.
>
> In some cases, components may not have integrations with other
> languages but that may not always be the case in the future. We have
> just developed the C interface, for example, which would enable
> DataFusion to be built as a shared library and imported in Python (if
> someone wanted to do that).
>
> Another dimension is that all of the PLs and components have benefited
> greatly from the community's investment in CI and packaging
> infrastructure.
>
> I also believe that the project's common PR queue helps create a sense
> of community awareness and solidarity amongst projects contributors.
> If Rust were working off in their own corner of GitHub, I think it
> would be easy for people who are not working on Rust to ignore them. I
> think the net result of the way that we currently operate is that
> we're producing higher quality software and have a healthier community
> than we would otherwise with a more fragmented approach.
>
> Lastly, the shared release cycle creates social pressure to get
> patches finished and merged. Anecdotally this seems to be effective.
>
> On the governance questions, see the roles section on
> https://www.apache.org/foundation/how-it-works.html#roles
>
> If a part of apache/arrow truly believed that they were being hindered
> by being a part of monorepo, we could create a new repository under
> apache/ on GitHub for the part that wants to split into a standalone
> GitHub repository. That wouldn't change the governance of that code.
>
> - Wes
>
> On Tue, Apr 14, 2020 at 1:26 PM Rémi Dettai  wrote:
> >
> > This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451.
> >
> > First thanks for your answer!
> >
> > It's true that I was also surprised to see all implementations of Arrow
> > mixed up in a single repository!
> >
> > I was really considering the separation of the repositories as a mean to
> > separate concerns. I am not 100% sure to understand how it would fragment
> > the community but I think I get the point, even though I still believe
> that
> > it is at the cost of extra complexity.
> >
> > As for the legal protection, I did not take that aspect into
> consideration,
> > and I find it very interesting! What is the PMC exactly and why would
> > Datafusion be more exposed in a separate repository?
>


Re: Follow up on ARROW-8451, datafusion part of Arrow

2020-04-14 Thread Wes McKinney
hi Remi,

It's no problem, it's a common question we get. Some developers
believe as a matter of principle that large projects should be broken
up into many smaller repositories.

Arrow is a different than many open source projects. Maintaining
protocol-level interoperability (although note that Rust does not yet
participate in the integration tests) has been a great deal of effort,
and the community has felt that trying to coordinate changes that
impact interoperability is substantially simpler in a monorepo
arrangement on GitHub. That we always know with relative certainty
whether any pull request may break interoperability between one
component and another. It's very easy to get into a situation where
you have a mess of cross-repository (or even circular) build and
runtime dependencies -- the monorepo makes all of this pain go away.
If you have a change that affects multiple repositories, CI tools
don't make it easy to test those PRs together, generally you'll just
see that a PR on one repo is breaking against the master of the other
repository.

In some cases, components may not have integrations with other
languages but that may not always be the case in the future. We have
just developed the C interface, for example, which would enable
DataFusion to be built as a shared library and imported in Python (if
someone wanted to do that).

Another dimension is that all of the PLs and components have benefited
greatly from the community's investment in CI and packaging
infrastructure.

I also believe that the project's common PR queue helps create a sense
of community awareness and solidarity amongst projects contributors.
If Rust were working off in their own corner of GitHub, I think it
would be easy for people who are not working on Rust to ignore them. I
think the net result of the way that we currently operate is that
we're producing higher quality software and have a healthier community
than we would otherwise with a more fragmented approach.

Lastly, the shared release cycle creates social pressure to get
patches finished and merged. Anecdotally this seems to be effective.

On the governance questions, see the roles section on
https://www.apache.org/foundation/how-it-works.html#roles

If a part of apache/arrow truly believed that they were being hindered
by being a part of monorepo, we could create a new repository under
apache/ on GitHub for the part that wants to split into a standalone
GitHub repository. That wouldn't change the governance of that code.

- Wes

On Tue, Apr 14, 2020 at 1:26 PM Rémi Dettai  wrote:
>
> This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451.
>
> First thanks for your answer!
>
> It's true that I was also surprised to see all implementations of Arrow
> mixed up in a single repository!
>
> I was really considering the separation of the repositories as a mean to
> separate concerns. I am not 100% sure to understand how it would fragment
> the community but I think I get the point, even though I still believe that
> it is at the cost of extra complexity.
>
> As for the legal protection, I did not take that aspect into consideration,
> and I find it very interesting! What is the PMC exactly and why would
> Datafusion be more exposed in a separate repository?


Follow up on ARROW-8451, datafusion part of Arrow

2020-04-14 Thread Rémi Dettai
This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451.

First thanks for your answer!

It's true that I was also surprised to see all implementations of Arrow
mixed up in a single repository!

I was really considering the separation of the repositories as a mean to
separate concerns. I am not 100% sure to understand how it would fragment
the community but I think I get the point, even though I still believe that
it is at the cost of extra complexity.

As for the legal protection, I did not take that aspect into consideration,
and I find it very interesting! What is the PMC exactly and why would
Datafusion be more exposed in a separate repository?