Thanks Wes -- I agree. I think moving datafusion out of the main arrow repo
only makes sense when the interfaces it depends on (in arrow and parquet)
have stabilized as that will minimize the mess / pain you describe.

Andrew



On Wed, Mar 10, 2021 at 10:09 AM Wes McKinney <wesmck...@gmail.com> wrote:

> To give you an example of what I’m talking about. Jorge has been building
> this project
>
> https://github.com/jorgecarleitao/datafusion-python
>
> I think it would actually be preferable to build projects like this in the
> monorepo because of the challenges and opportunities that arise in long
> term project interdependence (API changes, integration testing, etc). The
> more you split up interdependent projects into different GitHub
> repositories, the more difficult it becomes to develop and test them — we
> had this exact problem (it was awful) with Parquet in C++ which is why the
> code lives in this repository now.
>
> On Wed, Mar 10, 2021 at 9:02 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > There is no problem with having multiple code-containing repositories in
> > Apache Arrow, and the project can produce different release artifacts
> (for
> > example, Parquet has Parquet-format and Parquet-mr and these release
> > separately). I don’t think it’s a good idea to fragment the project
> > governance / set up a new PMC unless you have two distinct groups of
> people
> > who are moving in different directions.
> >
> > As an example, Arrow was initially split off from Apache Drill. Arrow now
> > has little relationship with Drill. DataFusion and Ballista are not
> > analogous to that.
> >
> > Different releases can come from the same git repository also. I would
> > just want to make sure you have a proper debate about the long term
> > pros/cons of developing within a monorepo (which again are independent
> from
> > release logistics, so if these concepts are coupled in any person’s mind
> > please decouple them).
> >
> > On Wed, Mar 10, 2021 at 8:42 AM Andy Grove <andygrov...@gmail.com>
> wrote:
> >
> >> Thanks, Andrew.
> >>
> >> I agree with your points and I do see the argument for
> DataFusion/Ballista
> >> being in their own repo. When I first donated DataFusion there was a
> >> discussion about the fact that it could be moved back out later on once
> it
> >> was more mature. I will go see if I can find that conversation.
> >>
> >> Another option here would be to propose creating a new top-level Apache
> >> project but I don't know if these components would qualify or what the
> >> process would be. I imagine they would need to be much more mature
> before
> >> this would be an option.
> >>
> >> Thanks,
> >>
> >> Andy.
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb <al...@influxdata.com>
> wrote:
> >>
> >> > My thoughts are:
> >> >
> >> > 1. The scheduler and spill-to-disk/out of core operations sound very
> >> good
> >> > to bring into DataFusion and many people would benefit
> >> >
> >> > 2. I think the arrow github project and the unified workflow process
> in
> >> > particular is reaching its limits. Adding another cool, but non
> trivial
> >> > project like Ballista will likely exacerbate the challenges even more.
> >> >
> >> > 3. My sense is that the Rust arrow implementation is nearing feature
> >> > completion (though we may still have one last big revamp, depending on
> >> > Jorge's plans) and so I expect breaking API changes there to slow
> down,
> >> > lessening the value of keeping everything in the same rep.
> >> >
> >> > 4.  What would you think about pulling DataFusion out of the arrow
> >> crate in
> >> > the medium term (2-3 releases from now) and putting it into a new
> place
> >> > (alongside Ballista)?
> >> >
> >> > Andrew
> >> >
> >> > On Tue, Mar 9, 2021 at 12:30 PM Andy Grove <andygrov...@gmail.com>
> >> wrote:
> >> >
> >> > > As many of you know, the reason that I got involved in Arrow back in
> >> 2018
> >> > > was that I wanted to build a distributed compute platform in Rust,
> >> with
> >> > > capabilities similar to Apache Spark. This led to the creation of
> the
> >> > > DataFusion query engine, which is an in-memory query engine and is
> now
> >> > part
> >> > > of the Arrow repo.
> >> > >
> >> > > Over the past couple of years, I have been working outside of Arrow
> >> on a
> >> > > project named “Ballista” [1] to continue the journey of trying to
> >> build a
> >> > > distributed version. Due to the pandemic, I have had time over the
> >> winter
> >> > > to put more effort into this project and have managed to build a
> small
> >> > > community around it over the past few months and the project has now
> >> > > reached a point where the basic architecture has been proven and it
> is
> >> > now
> >> > > getting a lot of attention (more than 2k stars on GitHub just
> >> recently)
> >> > and
> >> > > I think that it would now make sense to donate some or all of the
> >> project
> >> > > to Apache Arrow and continue its growth here.
> >> > >
> >> > > For an overview of the project, please see the talk I recently gave
> at
> >> > the
> >> > > New York Open Statistical Programming Meetup [2].
> >> > >
> >> > > Some of the benefits that I see in donating the project to Arrow
> are:
> >> > >
> >> > >
> >> > >    -
> >> > >
> >> > >    DataFusion also needs a scheduler and it would probably make
> sense
> >> to
> >> > >    push some parts of the Ballista scheduler down a level in the
> >> stack so
> >> > > that
> >> > >    the same approach is used to scale across cores in DataFusion and
> >> to
> >> > > scale
> >> > >    across nodes in Ballista.
> >> > >    -
> >> > >
> >> > >    Ballista provides preliminary support for spill-to-disk
> >> functionality
> >> > >    (in Arrow IPC format) which could also benefit DataFusion and
> >> provide
> >> > >    better scalability through out-of-core processing.
> >> > >    -
> >> > >
> >> > >    Although the Ballista scheduler is implemented in Rust, it is
> >> possible
> >> > >    to implement executors in other languages due to the use of
> Flight,
> >> > > gRPC,
> >> > >    and protobuf, so this may be of interest to other language
> >> > > implementations
> >> > >    of Arrow as well.
> >> > >    -
> >> > >
> >> > >    There is already some overlap between Arrow and Ballista
> >> contributors.
> >> > >    -
> >> > >
> >> > >    Ballista unit tests will be part of Arrow CI which means that any
> >> > >    changes to Arrow or DataFusion APIs that Ballista depends on will
> >> also
> >> > >    require that the corresponding Ballista code is updated as part
> of
> >> the
> >> > > same
> >> > >    PR.
> >> > >
> >> > >
> >> > > My main goal with this email thread is to gauge interest in donating
> >> this
> >> > > code. If there is interest in doing so then we can have a more
> >> detailed
> >> > > follow-up conversation on exactly what would be donated and where it
> >> > would
> >> > > go.
> >> > >
> >> > >
> >> > > I have also filed a GitHub issue in Ballista to get feedback from
> >> current
> >> > > contributors [3].
> >> > >
> >> > >
> >> > > I'm looking forward to hearing opinions on this!
> >> > >
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Andy.
> >> > >
> >> > > [1] https://github.com/ballista-compute/ballista
> >> > >
> >> > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
> >> > >
> >> > > [3] https://github.com/ballista-compute/ballista/issues/646
> >> > >
> >> >
> >>
> >
>

Reply via email to