> think that the problem of "there are too many PRs in the review
> queue that are not relevant to me" has straightforward solutions\

For sure -- I welcome any and all technical assistance to improving
efficiency.

 > Andrew - do you have more specific concerns that I am missing here?

I think burden on existing maintainers is my primary concern in adding
another major project to the same repo.

I certainly didn't mean to restart a discussion soliciting opinions about
our current tools / process -- it has all been articulated well in previous
threads :)

On Wed, Mar 10, 2021 at 1:13 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> First of all, I want to thank you very much for your work on Ballista and
> for doing it in an open source environment. It is something that should be
> emphasised and celebrated.
>
> Secondly, wrt to considering donating it to the Apache Foundation and
> Apache project in particular, I would say that we should be honored by such
> consideration. In this context, my immediate reaction is: how can we best
> support Ballista's community?
>
> My initial thoughts in this direction are:
>
> * create a new git repo for DataFusion and Ballista to reside on (e.g.
> arrow/ballista)
> * do not require the release cycle and versioning to be aligned with
> arrow's release cycle
> * do not require the usage of JIRA
> * pin the dependency of Datafusion on Arrow and parquet crate (e.g. to a
> specific commit)
>
> I feel that this setup would keep Ballista under the Foundation and Apache
> Arrow's umbrella and aligned with its goals, while at the same time put the
> least amount of burden on its community, both in terms of keeping a strict
> release schedule, tooling and CI.
>
> The rationale for the above is that whenever something is released on
> DataFusion (which hosts most of the physical ops), people will also want it
> quickly available on Ballista. Thus, having the two release cycles more
> closely related and independent of the arrow implementation's cycle is
> good. DataFusion does not have integration tests against other arrow
> implementations, and thus the integration tests are not relevant.
>
> There are 4 main reasons I would not recommend placing it in the mono-repo:
>
> 1. It would not add much
> 2. It would place Ballista on the same release schedule and git system as
> the rest of Arrow's implementation, which may not suit Ballista's own
> development pace (in either direction)
> 3. It further increases the complexity of the current repo
> 4. It would force its community to use JIRA, merge process, components,
> etc, which may not be what its community wishes for
>
> The main risk I see is that because arrow's release cycle is slow and major
> releases only, DataFusion risks missing arrow features from time to time.
> We can mitigate this with cargo and pins to commit hashes. IMO this risk
> exists in any dependency relationship and is usually a sign that there is
> an API contract and thus a trust relationship involved, which is a good
> thing.
>
> Best,
> Jorge
>
> On Tue, Mar 9, 2021 at 6:31 PM Andy Grove <andygrov...@gmail.com> wrote:
>
> > As many of you know, the reason that I got involved in Arrow back in 2018
> > was that I wanted to build a distributed compute platform in Rust, with
> > capabilities similar to Apache Spark. This led to the creation of the
> > DataFusion query engine, which is an in-memory query engine and is now
> part
> > of the Arrow repo.
> >
> > Over the past couple of years, I have been working outside of Arrow on a
> > project named “Ballista” [1] to continue the journey of trying to build a
> > distributed version. Due to the pandemic, I have had time over the winter
> > to put more effort into this project and have managed to build a small
> > community around it over the past few months and the project has now
> > reached a point where the basic architecture has been proven and it is
> now
> > getting a lot of attention (more than 2k stars on GitHub just recently)
> and
> > I think that it would now make sense to donate some or all of the project
> > to Apache Arrow and continue its growth here.
> >
> > For an overview of the project, please see the talk I recently gave at
> the
> > New York Open Statistical Programming Meetup [2].
> >
> > Some of the benefits that I see in donating the project to Arrow are:
> >
> >
> >    -
> >
> >    DataFusion also needs a scheduler and it would probably make sense to
> >    push some parts of the Ballista scheduler down a level in the stack so
> > that
> >    the same approach is used to scale across cores in DataFusion and to
> > scale
> >    across nodes in Ballista.
> >    -
> >
> >    Ballista provides preliminary support for spill-to-disk functionality
> >    (in Arrow IPC format) which could also benefit DataFusion and provide
> >    better scalability through out-of-core processing.
> >    -
> >
> >    Although the Ballista scheduler is implemented in Rust, it is possible
> >    to implement executors in other languages due to the use of Flight,
> > gRPC,
> >    and protobuf, so this may be of interest to other language
> > implementations
> >    of Arrow as well.
> >    -
> >
> >    There is already some overlap between Arrow and Ballista contributors.
> >    -
> >
> >    Ballista unit tests will be part of Arrow CI which means that any
> >    changes to Arrow or DataFusion APIs that Ballista depends on will also
> >    require that the corresponding Ballista code is updated as part of the
> > same
> >    PR.
> >
> >
> > My main goal with this email thread is to gauge interest in donating this
> > code. If there is interest in doing so then we can have a more detailed
> > follow-up conversation on exactly what would be donated and where it
> would
> > go.
> >
> >
> > I have also filed a GitHub issue in Ballista to get feedback from current
> > contributors [3].
> >
> >
> > I'm looking forward to hearing opinions on this!
> >
> >
> > Thanks,
> >
> > Andy.
> >
> > [1] https://github.com/ballista-compute/ballista
> >
> > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
> >
> > [3] https://github.com/ballista-compute/ballista/issues/646
> >
>

Reply via email to