Thanks, Andrew. I agree with your points and I do see the argument for DataFusion/Ballista being in their own repo. When I first donated DataFusion there was a discussion about the fact that it could be moved back out later on once it was more mature. I will go see if I can find that conversation.
Another option here would be to propose creating a new top-level Apache project but I don't know if these components would qualify or what the process would be. I imagine they would need to be much more mature before this would be an option. Thanks, Andy. On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb <al...@influxdata.com> wrote: > My thoughts are: > > 1. The scheduler and spill-to-disk/out of core operations sound very good > to bring into DataFusion and many people would benefit > > 2. I think the arrow github project and the unified workflow process in > particular is reaching its limits. Adding another cool, but non trivial > project like Ballista will likely exacerbate the challenges even more. > > 3. My sense is that the Rust arrow implementation is nearing feature > completion (though we may still have one last big revamp, depending on > Jorge's plans) and so I expect breaking API changes there to slow down, > lessening the value of keeping everything in the same rep. > > 4. What would you think about pulling DataFusion out of the arrow crate in > the medium term (2-3 releases from now) and putting it into a new place > (alongside Ballista)? > > Andrew > > On Tue, Mar 9, 2021 at 12:30 PM Andy Grove <andygrov...@gmail.com> wrote: > > > As many of you know, the reason that I got involved in Arrow back in 2018 > > was that I wanted to build a distributed compute platform in Rust, with > > capabilities similar to Apache Spark. This led to the creation of the > > DataFusion query engine, which is an in-memory query engine and is now > part > > of the Arrow repo. > > > > Over the past couple of years, I have been working outside of Arrow on a > > project named “Ballista” [1] to continue the journey of trying to build a > > distributed version. Due to the pandemic, I have had time over the winter > > to put more effort into this project and have managed to build a small > > community around it over the past few months and the project has now > > reached a point where the basic architecture has been proven and it is > now > > getting a lot of attention (more than 2k stars on GitHub just recently) > and > > I think that it would now make sense to donate some or all of the project > > to Apache Arrow and continue its growth here. > > > > For an overview of the project, please see the talk I recently gave at > the > > New York Open Statistical Programming Meetup [2]. > > > > Some of the benefits that I see in donating the project to Arrow are: > > > > > > - > > > > DataFusion also needs a scheduler and it would probably make sense to > > push some parts of the Ballista scheduler down a level in the stack so > > that > > the same approach is used to scale across cores in DataFusion and to > > scale > > across nodes in Ballista. > > - > > > > Ballista provides preliminary support for spill-to-disk functionality > > (in Arrow IPC format) which could also benefit DataFusion and provide > > better scalability through out-of-core processing. > > - > > > > Although the Ballista scheduler is implemented in Rust, it is possible > > to implement executors in other languages due to the use of Flight, > > gRPC, > > and protobuf, so this may be of interest to other language > > implementations > > of Arrow as well. > > - > > > > There is already some overlap between Arrow and Ballista contributors. > > - > > > > Ballista unit tests will be part of Arrow CI which means that any > > changes to Arrow or DataFusion APIs that Ballista depends on will also > > require that the corresponding Ballista code is updated as part of the > > same > > PR. > > > > > > My main goal with this email thread is to gauge interest in donating this > > code. If there is interest in doing so then we can have a more detailed > > follow-up conversation on exactly what would be donated and where it > would > > go. > > > > > > I have also filed a GitHub issue in Ballista to get feedback from current > > contributors [3]. > > > > > > I'm looking forward to hearing opinions on this! > > > > > > Thanks, > > > > Andy. > > > > [1] https://github.com/ballista-compute/ballista > > > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ > > > > [3] https://github.com/ballista-compute/ballista/issues/646 > > >