> think that the problem of "there are too many PRs in the review > queue that are not relevant to me" has straightforward solutions\
For sure -- I welcome any and all technical assistance to improving efficiency. > Andrew - do you have more specific concerns that I am missing here? I think burden on existing maintainers is my primary concern in adding another major project to the same repo. I certainly didn't mean to restart a discussion soliciting opinions about our current tools / process -- it has all been articulated well in previous threads :) On Wed, Mar 10, 2021 at 1:13 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > First of all, I want to thank you very much for your work on Ballista and > for doing it in an open source environment. It is something that should be > emphasised and celebrated. > > Secondly, wrt to considering donating it to the Apache Foundation and > Apache project in particular, I would say that we should be honored by such > consideration. In this context, my immediate reaction is: how can we best > support Ballista's community? > > My initial thoughts in this direction are: > > * create a new git repo for DataFusion and Ballista to reside on (e.g. > arrow/ballista) > * do not require the release cycle and versioning to be aligned with > arrow's release cycle > * do not require the usage of JIRA > * pin the dependency of Datafusion on Arrow and parquet crate (e.g. to a > specific commit) > > I feel that this setup would keep Ballista under the Foundation and Apache > Arrow's umbrella and aligned with its goals, while at the same time put the > least amount of burden on its community, both in terms of keeping a strict > release schedule, tooling and CI. > > The rationale for the above is that whenever something is released on > DataFusion (which hosts most of the physical ops), people will also want it > quickly available on Ballista. Thus, having the two release cycles more > closely related and independent of the arrow implementation's cycle is > good. DataFusion does not have integration tests against other arrow > implementations, and thus the integration tests are not relevant. > > There are 4 main reasons I would not recommend placing it in the mono-repo: > > 1. It would not add much > 2. It would place Ballista on the same release schedule and git system as > the rest of Arrow's implementation, which may not suit Ballista's own > development pace (in either direction) > 3. It further increases the complexity of the current repo > 4. It would force its community to use JIRA, merge process, components, > etc, which may not be what its community wishes for > > The main risk I see is that because arrow's release cycle is slow and major > releases only, DataFusion risks missing arrow features from time to time. > We can mitigate this with cargo and pins to commit hashes. IMO this risk > exists in any dependency relationship and is usually a sign that there is > an API contract and thus a trust relationship involved, which is a good > thing. > > Best, > Jorge > > On Tue, Mar 9, 2021 at 6:31 PM Andy Grove <andygrov...@gmail.com> wrote: > > > As many of you know, the reason that I got involved in Arrow back in 2018 > > was that I wanted to build a distributed compute platform in Rust, with > > capabilities similar to Apache Spark. This led to the creation of the > > DataFusion query engine, which is an in-memory query engine and is now > part > > of the Arrow repo. > > > > Over the past couple of years, I have been working outside of Arrow on a > > project named “Ballista” [1] to continue the journey of trying to build a > > distributed version. Due to the pandemic, I have had time over the winter > > to put more effort into this project and have managed to build a small > > community around it over the past few months and the project has now > > reached a point where the basic architecture has been proven and it is > now > > getting a lot of attention (more than 2k stars on GitHub just recently) > and > > I think that it would now make sense to donate some or all of the project > > to Apache Arrow and continue its growth here. > > > > For an overview of the project, please see the talk I recently gave at > the > > New York Open Statistical Programming Meetup [2]. > > > > Some of the benefits that I see in donating the project to Arrow are: > > > > > > - > > > > DataFusion also needs a scheduler and it would probably make sense to > > push some parts of the Ballista scheduler down a level in the stack so > > that > > the same approach is used to scale across cores in DataFusion and to > > scale > > across nodes in Ballista. > > - > > > > Ballista provides preliminary support for spill-to-disk functionality > > (in Arrow IPC format) which could also benefit DataFusion and provide > > better scalability through out-of-core processing. > > - > > > > Although the Ballista scheduler is implemented in Rust, it is possible > > to implement executors in other languages due to the use of Flight, > > gRPC, > > and protobuf, so this may be of interest to other language > > implementations > > of Arrow as well. > > - > > > > There is already some overlap between Arrow and Ballista contributors. > > - > > > > Ballista unit tests will be part of Arrow CI which means that any > > changes to Arrow or DataFusion APIs that Ballista depends on will also > > require that the corresponding Ballista code is updated as part of the > > same > > PR. > > > > > > My main goal with this email thread is to gauge interest in donating this > > code. If there is interest in doing so then we can have a more detailed > > follow-up conversation on exactly what would be donated and where it > would > > go. > > > > > > I have also filed a GitHub issue in Ballista to get feedback from current > > contributors [3]. > > > > > > I'm looking forward to hearing opinions on this! > > > > > > Thanks, > > > > Andy. > > > > [1] https://github.com/ballista-compute/ballista > > > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ > > > > [3] https://github.com/ballista-compute/ballista/issues/646 > > >