Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

Jorge Cardoso Leitão Wed, 10 Mar 2021 14:13:43 -0800

Hi,

Wes, thanks a lot for your reply. Let me try to answer:


1. If the purpose of Ballista is to support multiple language
> executors, what does segregating it from the other PL's (where
> executors are being developed, too) serve to facilitate this goal?
>

It facilitates because the stronger the coupling is, the more entropic the
setup is, and the more energy is required to develop and maintain it.
In this particular case, I Imagine that each executor would depend on
specific versions of each implementation, just like any other dependent
that is not
maintained by Apache Arrow does.

Or is the idea that every dependent should be on the mono-repo? If we need
to control our dependents like that, that usually indicates that we can't
guarantee a stable API (which IMO is the root cause).

2. Use of the monorepo does not require a synchronized release cycle,
> just as Rust does not require it now either. The only reason there
> have not been independent Rust releases is because someone has not
> volunteered to do it. Likewise, if DataFusion and Ballista are in the
> same git repository, they don't have to release at the same time as
> the core arrow / parquet crates.
>

I thought that Rust needed to be synchronized with the major release of the
repo. Isn't this the case anymore?

3. On an incremental basis, I do not believe the increased complexity
is significant. A multi-repository setup can be actively worse when
development work involves both repositories at the same time. This can
be mitigated by pinning the arrow / parquet crates as you point out,
but that creates other issues.

Could you enumerate parts from DataFusion or Ballista that would require
work on Arrow at the same time? I proposed that division because I am
reasonably confident will not need to be developed at the same time. I am
confident of this because a) the APIs used by DataFusion are written to
minimize public surfaces, so that arrow can mutate without affecting those
APIs; b) I designed and implemented most of the DataFusion code around
built-in functions, aggregate functions, UDFs and UDAF.

But maybe we can validate this here: Andy, during the development of
Ballista, on which the largest changes on Arrow repo were needed, did you
have to change anything on the Arrow crate or parquet crate, or was
everything done on DataFusion? If yes to any, was there a significant
burden in doing so?

4. Even without Jira, there is still the expectation for contributors
> to communicate in a way that is compatible with the Apache Way. So
> even without Jira, PMCs have an obligation to establish an alternative
> structure to have consistently open dialogue / planning about what
> people are working on or planning to work on in the future. If
> contributors are extensively discussing / planning privately, these
> discussions must be moved into the open, whether with design documents
> or issues or e-mail discussions. This was discussed ad nauseam in the
> other thread so I won't rehash those arguments.
>

I fully agree, even though I think it is a bit difficult to operationalize.
Thus, let's try like this: would you consider, under the definition used
above, discussions happening on github PRs and issues, such as what airflow
does <https://github.com/apache/airflow/issues> , as open?

Aside from these issues, the biggest lost opportunity I see if
> DF/Baliista "cast away" as it were, is that it becomes unattractive
> for the rest of us to build anything on top of these platforms
> (because at that point we have a circular dependency, which is the
> hellscape we escaped from with Parquet C++). I used the
> datafusion-python project as an example — if that were in the Arrow
> project I might consider using it in various ways or contribute to it,
> but as an external project it's less interesting to me as something to
> build on.
>

My feelings about transferring datafusion-python to arrow are shared above:
I find the idea of picking something that is well encapsulated and
decoupled from the rest and blending it into something large and less
decoupled as an entropy-generating activity, which requires more energy to
maintain. Operationally, the way I would merge a project like
datafusion-python into Apache would be by transferring ownership of the
repo on github, transfer ownership of the pypi project, and create some
secrets on github to keep twine working. Just like I mentioned for
Ballista. If people lose interest in the project, then deprecating it would
be trivial (archive the repo). If people gain interest in it, growth is
also trivial (there is already a house in place and the goals are well
defined). The interfaces are the API contracts declared as pinned
dependencies (in Cargo.toml / setup.py).

Best,
Jorge




On Wed, Mar 10, 2021 at 7:50 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Jorge,
>
> I have some thoughts / questions on your arguments against use of the
> monorepo:
>
> 1. If the purpose of Ballista is to support multiple language
> executors, what does segregating it from the other PL's (where
> executors are being developed, too) serve to facilitate this goal?
>
> 2. Use of the monorepo does not require a synchronized release cycle,
> just as Rust does not require it now either. The only reason there
> have not been independent Rust releases is because someone has not
> volunteered to do it. Likewise, if DataFusion and Ballista are in the
> same git repository, they don't have to release at the same time as
> the core arrow / parquet crates.
>
> 3. On an incremental basis, I do not believe the increased complexity
> is significant. A multi-repository setup can be actively worse when
> development work involves both repositories at the same time. This can
> be mitigated by pinning the arrow / parquet crates as you point out,
> but that creates other issues.
>
> 4. Even without Jira, there is still the expectation for contributors
> to communicate in a way that is compatible with the Apache Way. So
> even without Jira, PMCs have an obligation to establish an alternative
> structure to have consistently open dialogue / planning about what
> people are working on or planning to work on in the future. If
> contributors are extensively discussing / planning privately, these
> discussions must be moved into the open, whether with design documents
> or issues or e-mail discussions. This was discussed ad nauseam in the
> other thread so I won't rehash those arguments.
>
> Aside from these issues, the biggest lost opportunity I see if
> DF/Baliista "cast away" as it were, is that it becomes unattractive
> for the rest of us to build anything on top of these platforms
> (because at that point we have a circular dependency, which is the
> hellscape we escaped from with Parquet C++). I used the
> datafusion-python project as an example — if that were in the Arrow
> project I might consider using it in various ways or contribute to it,
> but as an external project it's less interesting to me as something to
> build on.
>
> On Wed, Mar 10, 2021 at 12:13 PM Jorge Cardoso Leitão
> <jorgecarlei...@gmail.com> wrote:
> >
> > Hi,
> >
> > First of all, I want to thank you very much for your work on Ballista and
> > for doing it in an open source environment. It is something that should
> be
> > emphasised and celebrated.
> >
> > Secondly, wrt to considering donating it to the Apache Foundation and
> > Apache project in particular, I would say that we should be honored by
> such
> > consideration. In this context, my immediate reaction is: how can we best
> > support Ballista's community?
> >
> > My initial thoughts in this direction are:
> >
> > * create a new git repo for DataFusion and Ballista to reside on (e.g.
> > arrow/ballista)
> > * do not require the release cycle and versioning to be aligned with
> > arrow's release cycle
> > * do not require the usage of JIRA
> > * pin the dependency of Datafusion on Arrow and parquet crate (e.g. to a
> > specific commit)
> >
> > I feel that this setup would keep Ballista under the Foundation and
> Apache
> > Arrow's umbrella and aligned with its goals, while at the same time put
> the
> > least amount of burden on its community, both in terms of keeping a
> strict
> > release schedule, tooling and CI.
> >
> > The rationale for the above is that whenever something is released on
> > DataFusion (which hosts most of the physical ops), people will also want
> it
> > quickly available on Ballista. Thus, having the two release cycles more
> > closely related and independent of the arrow implementation's cycle is
> > good. DataFusion does not have integration tests against other arrow
> > implementations, and thus the integration tests are not relevant.
> >
> > There are 4 main reasons I would not recommend placing it in the
> mono-repo:
> >
> > 1. It would not add much
> > 2. It would place Ballista on the same release schedule and git system as
> > the rest of Arrow's implementation, which may not suit Ballista's own
> > development pace (in either direction)
> > 3. It further increases the complexity of the current repo
> > 4. It would force its community to use JIRA, merge process, components,
> > etc, which may not be what its community wishes for
> >
> > The main risk I see is that because arrow's release cycle is slow and
> major
> > releases only, DataFusion risks missing arrow features from time to time.
> > We can mitigate this with cargo and pins to commit hashes. IMO this risk
> > exists in any dependency relationship and is usually a sign that there is
> > an API contract and thus a trust relationship involved, which is a good
> > thing.
> >
> > Best,
> > Jorge
> >
> > On Tue, Mar 9, 2021 at 6:31 PM Andy Grove <andygrov...@gmail.com> wrote:
> >
> > > As many of you know, the reason that I got involved in Arrow back in
> 2018
> > > was that I wanted to build a distributed compute platform in Rust, with
> > > capabilities similar to Apache Spark. This led to the creation of the
> > > DataFusion query engine, which is an in-memory query engine and is now
> part
> > > of the Arrow repo.
> > >
> > > Over the past couple of years, I have been working outside of Arrow on
> a
> > > project named “Ballista” [1] to continue the journey of trying to
> build a
> > > distributed version. Due to the pandemic, I have had time over the
> winter
> > > to put more effort into this project and have managed to build a small
> > > community around it over the past few months and the project has now
> > > reached a point where the basic architecture has been proven and it is
> now
> > > getting a lot of attention (more than 2k stars on GitHub just
> recently) and
> > > I think that it would now make sense to donate some or all of the
> project
> > > to Apache Arrow and continue its growth here.
> > >
> > > For an overview of the project, please see the talk I recently gave at
> the
> > > New York Open Statistical Programming Meetup [2].
> > >
> > > Some of the benefits that I see in donating the project to Arrow are:
> > >
> > >
> > >    -
> > >
> > >    DataFusion also needs a scheduler and it would probably make sense
> to
> > >    push some parts of the Ballista scheduler down a level in the stack
> so
> > > that
> > >    the same approach is used to scale across cores in DataFusion and to
> > > scale
> > >    across nodes in Ballista.
> > >    -
> > >
> > >    Ballista provides preliminary support for spill-to-disk
> functionality
> > >    (in Arrow IPC format) which could also benefit DataFusion and
> provide
> > >    better scalability through out-of-core processing.
> > >    -
> > >
> > >    Although the Ballista scheduler is implemented in Rust, it is
> possible
> > >    to implement executors in other languages due to the use of Flight,
> > > gRPC,
> > >    and protobuf, so this may be of interest to other language
> > > implementations
> > >    of Arrow as well.
> > >    -
> > >
> > >    There is already some overlap between Arrow and Ballista
> contributors.
> > >    -
> > >
> > >    Ballista unit tests will be part of Arrow CI which means that any
> > >    changes to Arrow or DataFusion APIs that Ballista depends on will
> also
> > >    require that the corresponding Ballista code is updated as part of
> the
> > > same
> > >    PR.
> > >
> > >
> > > My main goal with this email thread is to gauge interest in donating
> this
> > > code. If there is interest in doing so then we can have a more detailed
> > > follow-up conversation on exactly what would be donated and where it
> would
> > > go.
> > >
> > >
> > > I have also filed a GitHub issue in Ballista to get feedback from
> current
> > > contributors [3].
> > >
> > >
> > > I'm looking forward to hearing opinions on this!
> > >
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > > [1] https://github.com/ballista-compute/ballista
> > >
> > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
> > >
> > > [3] https://github.com/ballista-compute/ballista/issues/646
> > >
>

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

Reply via email to