Thanks Wes -- I agree. I think moving datafusion out of the main arrow repo only makes sense when the interfaces it depends on (in arrow and parquet) have stabilized as that will minimize the mess / pain you describe.
Andrew On Wed, Mar 10, 2021 at 10:09 AM Wes McKinney <wesmck...@gmail.com> wrote: > To give you an example of what I’m talking about. Jorge has been building > this project > > https://github.com/jorgecarleitao/datafusion-python > > I think it would actually be preferable to build projects like this in the > monorepo because of the challenges and opportunities that arise in long > term project interdependence (API changes, integration testing, etc). The > more you split up interdependent projects into different GitHub > repositories, the more difficult it becomes to develop and test them — we > had this exact problem (it was awful) with Parquet in C++ which is why the > code lives in this repository now. > > On Wed, Mar 10, 2021 at 9:02 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > There is no problem with having multiple code-containing repositories in > > Apache Arrow, and the project can produce different release artifacts > (for > > example, Parquet has Parquet-format and Parquet-mr and these release > > separately). I don’t think it’s a good idea to fragment the project > > governance / set up a new PMC unless you have two distinct groups of > people > > who are moving in different directions. > > > > As an example, Arrow was initially split off from Apache Drill. Arrow now > > has little relationship with Drill. DataFusion and Ballista are not > > analogous to that. > > > > Different releases can come from the same git repository also. I would > > just want to make sure you have a proper debate about the long term > > pros/cons of developing within a monorepo (which again are independent > from > > release logistics, so if these concepts are coupled in any person’s mind > > please decouple them). > > > > On Wed, Mar 10, 2021 at 8:42 AM Andy Grove <andygrov...@gmail.com> > wrote: > > > >> Thanks, Andrew. > >> > >> I agree with your points and I do see the argument for > DataFusion/Ballista > >> being in their own repo. When I first donated DataFusion there was a > >> discussion about the fact that it could be moved back out later on once > it > >> was more mature. I will go see if I can find that conversation. > >> > >> Another option here would be to propose creating a new top-level Apache > >> project but I don't know if these components would qualify or what the > >> process would be. I imagine they would need to be much more mature > before > >> this would be an option. > >> > >> Thanks, > >> > >> Andy. > >> > >> > >> > >> > >> > >> On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb <al...@influxdata.com> > wrote: > >> > >> > My thoughts are: > >> > > >> > 1. The scheduler and spill-to-disk/out of core operations sound very > >> good > >> > to bring into DataFusion and many people would benefit > >> > > >> > 2. I think the arrow github project and the unified workflow process > in > >> > particular is reaching its limits. Adding another cool, but non > trivial > >> > project like Ballista will likely exacerbate the challenges even more. > >> > > >> > 3. My sense is that the Rust arrow implementation is nearing feature > >> > completion (though we may still have one last big revamp, depending on > >> > Jorge's plans) and so I expect breaking API changes there to slow > down, > >> > lessening the value of keeping everything in the same rep. > >> > > >> > 4. What would you think about pulling DataFusion out of the arrow > >> crate in > >> > the medium term (2-3 releases from now) and putting it into a new > place > >> > (alongside Ballista)? > >> > > >> > Andrew > >> > > >> > On Tue, Mar 9, 2021 at 12:30 PM Andy Grove <andygrov...@gmail.com> > >> wrote: > >> > > >> > > As many of you know, the reason that I got involved in Arrow back in > >> 2018 > >> > > was that I wanted to build a distributed compute platform in Rust, > >> with > >> > > capabilities similar to Apache Spark. This led to the creation of > the > >> > > DataFusion query engine, which is an in-memory query engine and is > now > >> > part > >> > > of the Arrow repo. > >> > > > >> > > Over the past couple of years, I have been working outside of Arrow > >> on a > >> > > project named “Ballista” [1] to continue the journey of trying to > >> build a > >> > > distributed version. Due to the pandemic, I have had time over the > >> winter > >> > > to put more effort into this project and have managed to build a > small > >> > > community around it over the past few months and the project has now > >> > > reached a point where the basic architecture has been proven and it > is > >> > now > >> > > getting a lot of attention (more than 2k stars on GitHub just > >> recently) > >> > and > >> > > I think that it would now make sense to donate some or all of the > >> project > >> > > to Apache Arrow and continue its growth here. > >> > > > >> > > For an overview of the project, please see the talk I recently gave > at > >> > the > >> > > New York Open Statistical Programming Meetup [2]. > >> > > > >> > > Some of the benefits that I see in donating the project to Arrow > are: > >> > > > >> > > > >> > > - > >> > > > >> > > DataFusion also needs a scheduler and it would probably make > sense > >> to > >> > > push some parts of the Ballista scheduler down a level in the > >> stack so > >> > > that > >> > > the same approach is used to scale across cores in DataFusion and > >> to > >> > > scale > >> > > across nodes in Ballista. > >> > > - > >> > > > >> > > Ballista provides preliminary support for spill-to-disk > >> functionality > >> > > (in Arrow IPC format) which could also benefit DataFusion and > >> provide > >> > > better scalability through out-of-core processing. > >> > > - > >> > > > >> > > Although the Ballista scheduler is implemented in Rust, it is > >> possible > >> > > to implement executors in other languages due to the use of > Flight, > >> > > gRPC, > >> > > and protobuf, so this may be of interest to other language > >> > > implementations > >> > > of Arrow as well. > >> > > - > >> > > > >> > > There is already some overlap between Arrow and Ballista > >> contributors. > >> > > - > >> > > > >> > > Ballista unit tests will be part of Arrow CI which means that any > >> > > changes to Arrow or DataFusion APIs that Ballista depends on will > >> also > >> > > require that the corresponding Ballista code is updated as part > of > >> the > >> > > same > >> > > PR. > >> > > > >> > > > >> > > My main goal with this email thread is to gauge interest in donating > >> this > >> > > code. If there is interest in doing so then we can have a more > >> detailed > >> > > follow-up conversation on exactly what would be donated and where it > >> > would > >> > > go. > >> > > > >> > > > >> > > I have also filed a GitHub issue in Ballista to get feedback from > >> current > >> > > contributors [3]. > >> > > > >> > > > >> > > I'm looking forward to hearing opinions on this! > >> > > > >> > > > >> > > Thanks, > >> > > > >> > > Andy. > >> > > > >> > > [1] https://github.com/ballista-compute/ballista > >> > > > >> > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ > >> > > > >> > > [3] https://github.com/ballista-compute/ballista/issues/646 > >> > > > >> > > >> > > >