I think having Ballista in Arrow sounds like a good idea in the short
term.  It sounds like there is enough developer pain, that bringing it here
makes sense (providing existing Ballista contributors are happy with the
change and current Rust maintainers are open to the work involved).

One longer term concern is CI.  Setting up a good system for distributed
testing requires a lot of investment and compute resources, but I think we
can figure it out when it comes time.  In the short term it seems a
mono-repo reduces the engineering effort to get a sane CI system working.

As a point of reference Flink, Beam and Spark all seem to use mono-repos
(their goals are likely a little different then Arrow's though).

-Micah

P.S.  I do think the tooling/workflow conversation should be discussed more
but I think having a more concrete proposal that first starts from
requirements and nice to haves and then gets to a proposed solution is
important (i.e. pointing out pain points and problems is useful, but I
think it ignores some of the current value the existing process provides).

On Wed, Mar 10, 2021 at 5:13 PM Andy Grove <andygrov...@gmail.com> wrote:

> Thanks for the feedback so far on this proposal. I really appreciate
> everyone taking the time to put so much thought (and passion!) into this.
>
> So far, I don't think anyone is opposed to the idea of donating Ballista
> but there are clearly concerns about an increased burden on current
> maintainers.
>
> We also have re-started discussions around tooling and release processes,
> but it seems that there is no objection to Rust / DataFusion / Ballista
> having more control over the release process but we have to put in the work
> to make that happen. I am certainly motivated to help with this but I think
> that is a separate conversation to donating Ballista.
>
> To reduce the burden on existing maintainers, we could consider initially
> adding Ballista in such a way that it doesn't slow down momentum on Arrow &
> DataFusion by adding it as a separate Rust subproject that is not part of
> the Rust workspace, and have it depend on pinned commits initially. This
> would be a lightweight way of incubating the project within the mono-repo
> and at some point, we can add it to the main workspace. This would be no
> worse than the current situation, and it would be better because it is at
> least under Arrow governance.
>
> I would like to talk a bit more specifically about the donation at this
> point now that there is some feedback.
>
> What I propose we donate from Ballista is:
>
>    -
>
>    The ballista.proto file that defines an encoding for logical and
>    physical query plans as well as cluster meta-data (this protobuf file
> could
>    eventually be split into separate files for each area)
>    -
>
>    The Rust source code, which consists of these main areas:
>    -
>
>       serde code for translating between protobuf and
>       Arrow/DataFusion/Ballista data structures
>       -
>
>       Distributed query planner
>       -
>
>       Scheduler process that coordinates query execution across available
>       executors
>       -
>
>       Executor process that implements Flight protocol and executes query
>       partitions and serializes results in Arrow IPC format
>
> I am proposing that we specifically exclude the following parts of the
> Ballista repo from the donation:
>
>    -
>
>    The work-in-progress JDBC driver which is not currently functional
>    -
>
>    The Spark benchmark code that I have been using for comparing
> performance
>    -
>
>    The Python bindings, which as far as I know are pretty much a fork of
>    Jorge's datafusion-python project.
>
> I think it is also worth mentioning that Ballista is currently only ~8k
> lines of code, which is pretty small in contrast to the >100k lines of code
> in the Arrow Rust project currently.
>
> Let's keep the conversation going and see what other feedback there is
> regarding the merits of donating Ballista, or not.
>
> Thanks,
>
> Andy.
>
> On Wed, Mar 10, 2021 at 3:13 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > Wes, thanks a lot for your reply. Let me try to answer:
> >
> > 1. If the purpose of Ballista is to support multiple language
> > > executors, what does segregating it from the other PL's (where
> > > executors are being developed, too) serve to facilitate this goal?
> > >
> >
> > It facilitates because the stronger the coupling is, the more entropic
> the
> > setup is, and the more energy is required to develop and maintain it.
> > In this particular case, I Imagine that each executor would depend on
> > specific versions of each implementation, just like any other dependent
> > that is not
> > maintained by Apache Arrow does.
> >
> > Or is the idea that every dependent should be on the mono-repo? If we
> need
> > to control our dependents like that, that usually indicates that we can't
> > guarantee a stable API (which IMO is the root cause).
> >
> > 2. Use of the monorepo does not require a synchronized release cycle,
> > > just as Rust does not require it now either. The only reason there
> > > have not been independent Rust releases is because someone has not
> > > volunteered to do it. Likewise, if DataFusion and Ballista are in the
> > > same git repository, they don't have to release at the same time as
> > > the core arrow / parquet crates.
> > >
> >
> > I thought that Rust needed to be synchronized with the major release of
> the
> > repo. Isn't this the case anymore?
> >
> > 3. On an incremental basis, I do not believe the increased complexity
> > is significant. A multi-repository setup can be actively worse when
> > development work involves both repositories at the same time. This can
> > be mitigated by pinning the arrow / parquet crates as you point out,
> > but that creates other issues.
> >
> > Could you enumerate parts from DataFusion or Ballista that would require
> > work on Arrow at the same time? I proposed that division because I am
> > reasonably confident will not need to be developed at the same time. I am
> > confident of this because a) the APIs used by DataFusion are written to
> > minimize public surfaces, so that arrow can mutate without affecting
> those
> > APIs; b) I designed and implemented most of the DataFusion code around
> > built-in functions, aggregate functions, UDFs and UDAF.
> >
> > But maybe we can validate this here: Andy, during the development of
> > Ballista, on which the largest changes on Arrow repo were needed, did you
> > have to change anything on the Arrow crate or parquet crate, or was
> > everything done on DataFusion? If yes to any, was there a significant
> > burden in doing so?
> >
> > 4. Even without Jira, there is still the expectation for contributors
> > > to communicate in a way that is compatible with the Apache Way. So
> > > even without Jira, PMCs have an obligation to establish an alternative
> > > structure to have consistently open dialogue / planning about what
> > > people are working on or planning to work on in the future. If
> > > contributors are extensively discussing / planning privately, these
> > > discussions must be moved into the open, whether with design documents
> > > or issues or e-mail discussions. This was discussed ad nauseam in the
> > > other thread so I won't rehash those arguments.
> > >
> >
> > I fully agree, even though I think it is a bit difficult to
> operationalize.
> > Thus, let's try like this: would you consider, under the definition used
> > above, discussions happening on github PRs and issues, such as what
> airflow
> > does <https://github.com/apache/airflow/issues> , as open?
> >
> > Aside from these issues, the biggest lost opportunity I see if
> > > DF/Baliista "cast away" as it were, is that it becomes unattractive
> > > for the rest of us to build anything on top of these platforms
> > > (because at that point we have a circular dependency, which is the
> > > hellscape we escaped from with Parquet C++). I used the
> > > datafusion-python project as an example — if that were in the Arrow
> > > project I might consider using it in various ways or contribute to it,
> > > but as an external project it's less interesting to me as something to
> > > build on.
> > >
> >
> > My feelings about transferring datafusion-python to arrow are shared
> above:
> > I find the idea of picking something that is well encapsulated and
> > decoupled from the rest and blending it into something large and less
> > decoupled as an entropy-generating activity, which requires more energy
> to
> > maintain. Operationally, the way I would merge a project like
> > datafusion-python into Apache would be by transferring ownership of the
> > repo on github, transfer ownership of the pypi project, and create some
> > secrets on github to keep twine working. Just like I mentioned for
> > Ballista. If people lose interest in the project, then deprecating it
> would
> > be trivial (archive the repo). If people gain interest in it, growth is
> > also trivial (there is already a house in place and the goals are well
> > defined). The interfaces are the API contracts declared as pinned
> > dependencies (in Cargo.toml / setup.py).
> >
> > Best,
> > Jorge
> >
> >
> >
> >
> > On Wed, Mar 10, 2021 at 7:50 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > hi Jorge,
> > >
> > > I have some thoughts / questions on your arguments against use of the
> > > monorepo:
> > >
> > > 1. If the purpose of Ballista is to support multiple language
> > > executors, what does segregating it from the other PL's (where
> > > executors are being developed, too) serve to facilitate this goal?
> > >
> > > 2. Use of the monorepo does not require a synchronized release cycle,
> > > just as Rust does not require it now either. The only reason there
> > > have not been independent Rust releases is because someone has not
> > > volunteered to do it. Likewise, if DataFusion and Ballista are in the
> > > same git repository, they don't have to release at the same time as
> > > the core arrow / parquet crates.
> > >
> > > 3. On an incremental basis, I do not believe the increased complexity
> > > is significant. A multi-repository setup can be actively worse when
> > > development work involves both repositories at the same time. This can
> > > be mitigated by pinning the arrow / parquet crates as you point out,
> > > but that creates other issues.
> > >
> > > 4. Even without Jira, there is still the expectation for contributors
> > > to communicate in a way that is compatible with the Apache Way. So
> > > even without Jira, PMCs have an obligation to establish an alternative
> > > structure to have consistently open dialogue / planning about what
> > > people are working on or planning to work on in the future. If
> > > contributors are extensively discussing / planning privately, these
> > > discussions must be moved into the open, whether with design documents
> > > or issues or e-mail discussions. This was discussed ad nauseam in the
> > > other thread so I won't rehash those arguments.
> > >
> > > Aside from these issues, the biggest lost opportunity I see if
> > > DF/Baliista "cast away" as it were, is that it becomes unattractive
> > > for the rest of us to build anything on top of these platforms
> > > (because at that point we have a circular dependency, which is the
> > > hellscape we escaped from with Parquet C++). I used the
> > > datafusion-python project as an example — if that were in the Arrow
> > > project I might consider using it in various ways or contribute to it,
> > > but as an external project it's less interesting to me as something to
> > > build on.
> > >
> > > On Wed, Mar 10, 2021 at 12:13 PM Jorge Cardoso Leitão
> > > <jorgecarlei...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > First of all, I want to thank you very much for your work on Ballista
> > and
> > > > for doing it in an open source environment. It is something that
> should
> > > be
> > > > emphasised and celebrated.
> > > >
> > > > Secondly, wrt to considering donating it to the Apache Foundation and
> > > > Apache project in particular, I would say that we should be honored
> by
> > > such
> > > > consideration. In this context, my immediate reaction is: how can we
> > best
> > > > support Ballista's community?
> > > >
> > > > My initial thoughts in this direction are:
> > > >
> > > > * create a new git repo for DataFusion and Ballista to reside on
> (e.g.
> > > > arrow/ballista)
> > > > * do not require the release cycle and versioning to be aligned with
> > > > arrow's release cycle
> > > > * do not require the usage of JIRA
> > > > * pin the dependency of Datafusion on Arrow and parquet crate (e.g.
> to
> > a
> > > > specific commit)
> > > >
> > > > I feel that this setup would keep Ballista under the Foundation and
> > > Apache
> > > > Arrow's umbrella and aligned with its goals, while at the same time
> put
> > > the
> > > > least amount of burden on its community, both in terms of keeping a
> > > strict
> > > > release schedule, tooling and CI.
> > > >
> > > > The rationale for the above is that whenever something is released on
> > > > DataFusion (which hosts most of the physical ops), people will also
> > want
> > > it
> > > > quickly available on Ballista. Thus, having the two release cycles
> more
> > > > closely related and independent of the arrow implementation's cycle
> is
> > > > good. DataFusion does not have integration tests against other arrow
> > > > implementations, and thus the integration tests are not relevant.
> > > >
> > > > There are 4 main reasons I would not recommend placing it in the
> > > mono-repo:
> > > >
> > > > 1. It would not add much
> > > > 2. It would place Ballista on the same release schedule and git
> system
> > as
> > > > the rest of Arrow's implementation, which may not suit Ballista's own
> > > > development pace (in either direction)
> > > > 3. It further increases the complexity of the current repo
> > > > 4. It would force its community to use JIRA, merge process,
> components,
> > > > etc, which may not be what its community wishes for
> > > >
> > > > The main risk I see is that because arrow's release cycle is slow and
> > > major
> > > > releases only, DataFusion risks missing arrow features from time to
> > time.
> > > > We can mitigate this with cargo and pins to commit hashes. IMO this
> > risk
> > > > exists in any dependency relationship and is usually a sign that
> there
> > is
> > > > an API contract and thus a trust relationship involved, which is a
> good
> > > > thing.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > On Tue, Mar 9, 2021 at 6:31 PM Andy Grove <andygrov...@gmail.com>
> > wrote:
> > > >
> > > > > As many of you know, the reason that I got involved in Arrow back
> in
> > > 2018
> > > > > was that I wanted to build a distributed compute platform in Rust,
> > with
> > > > > capabilities similar to Apache Spark. This led to the creation of
> the
> > > > > DataFusion query engine, which is an in-memory query engine and is
> > now
> > > part
> > > > > of the Arrow repo.
> > > > >
> > > > > Over the past couple of years, I have been working outside of Arrow
> > on
> > > a
> > > > > project named “Ballista” [1] to continue the journey of trying to
> > > build a
> > > > > distributed version. Due to the pandemic, I have had time over the
> > > winter
> > > > > to put more effort into this project and have managed to build a
> > small
> > > > > community around it over the past few months and the project has
> now
> > > > > reached a point where the basic architecture has been proven and it
> > is
> > > now
> > > > > getting a lot of attention (more than 2k stars on GitHub just
> > > recently) and
> > > > > I think that it would now make sense to donate some or all of the
> > > project
> > > > > to Apache Arrow and continue its growth here.
> > > > >
> > > > > For an overview of the project, please see the talk I recently gave
> > at
> > > the
> > > > > New York Open Statistical Programming Meetup [2].
> > > > >
> > > > > Some of the benefits that I see in donating the project to Arrow
> are:
> > > > >
> > > > >
> > > > >    -
> > > > >
> > > > >    DataFusion also needs a scheduler and it would probably make
> sense
> > > to
> > > > >    push some parts of the Ballista scheduler down a level in the
> > stack
> > > so
> > > > > that
> > > > >    the same approach is used to scale across cores in DataFusion
> and
> > to
> > > > > scale
> > > > >    across nodes in Ballista.
> > > > >    -
> > > > >
> > > > >    Ballista provides preliminary support for spill-to-disk
> > > functionality
> > > > >    (in Arrow IPC format) which could also benefit DataFusion and
> > > provide
> > > > >    better scalability through out-of-core processing.
> > > > >    -
> > > > >
> > > > >    Although the Ballista scheduler is implemented in Rust, it is
> > > possible
> > > > >    to implement executors in other languages due to the use of
> > Flight,
> > > > > gRPC,
> > > > >    and protobuf, so this may be of interest to other language
> > > > > implementations
> > > > >    of Arrow as well.
> > > > >    -
> > > > >
> > > > >    There is already some overlap between Arrow and Ballista
> > > contributors.
> > > > >    -
> > > > >
> > > > >    Ballista unit tests will be part of Arrow CI which means that
> any
> > > > >    changes to Arrow or DataFusion APIs that Ballista depends on
> will
> > > also
> > > > >    require that the corresponding Ballista code is updated as part
> of
> > > the
> > > > > same
> > > > >    PR.
> > > > >
> > > > >
> > > > > My main goal with this email thread is to gauge interest in
> donating
> > > this
> > > > > code. If there is interest in doing so then we can have a more
> > detailed
> > > > > follow-up conversation on exactly what would be donated and where
> it
> > > would
> > > > > go.
> > > > >
> > > > >
> > > > > I have also filed a GitHub issue in Ballista to get feedback from
> > > current
> > > > > contributors [3].
> > > > >
> > > > >
> > > > > I'm looking forward to hearing opinions on this!
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > > > [1] https://github.com/ballista-compute/ballista
> > > > >
> > > > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
> > > > >
> > > > > [3] https://github.com/ballista-compute/ballista/issues/646
> > > > >
> > >
> >
>

Reply via email to