Wes - thanks for the clarification around possibilities for having multiple
repositories within Arrow governance. I agree that having separate repos
increases burdens around integration testing and dependency  /release
management and having a monorepo makes those things much simpler.

I think it is worth digging more into this point from Andrew.

> 2. I think the arrow github project and the unified workflow process in
> particular is reaching its limits. Adding another cool, but non trivial
> project like Ballista will likely exacerbate the challenges even more.

I do see that adding Ballista will increase the burden on current
DataFusion maintainers, and they may not be all that interested in Ballista
itself. Ballista would potentially bring along additional contributors as
well, increasing the burden of reviewing PRs in the short term (hopefully
at some point we would have additional committers that are motivated to
work on Ballista).

I would certainly try and handle most of the Ballista PR reviews to start
with until we reach a point where more people can do that, and this would
lead to me being more closely involved in DataFusion reviews as well.

Andrew - do you have more specific concerns that I am missing here?

Thanks,

Andy.






On Wed, Mar 10, 2021 at 9:01 AM Andrew Lamb <al...@influxdata.com> wrote:

> Thanks Wes -- I agree. I think moving datafusion out of the main arrow repo
> only makes sense when the interfaces it depends on (in arrow and parquet)
> have stabilized as that will minimize the mess / pain you describe.
>
> Andrew
>
>
>
> On Wed, Mar 10, 2021 at 10:09 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > To give you an example of what I’m talking about. Jorge has been building
> > this project
> >
> > https://github.com/jorgecarleitao/datafusion-python
> >
> > I think it would actually be preferable to build projects like this in
> the
> > monorepo because of the challenges and opportunities that arise in long
> > term project interdependence (API changes, integration testing, etc). The
> > more you split up interdependent projects into different GitHub
> > repositories, the more difficult it becomes to develop and test them — we
> > had this exact problem (it was awful) with Parquet in C++ which is why
> the
> > code lives in this repository now.
> >
> > On Wed, Mar 10, 2021 at 9:02 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > There is no problem with having multiple code-containing repositories
> in
> > > Apache Arrow, and the project can produce different release artifacts
> > (for
> > > example, Parquet has Parquet-format and Parquet-mr and these release
> > > separately). I don’t think it’s a good idea to fragment the project
> > > governance / set up a new PMC unless you have two distinct groups of
> > people
> > > who are moving in different directions.
> > >
> > > As an example, Arrow was initially split off from Apache Drill. Arrow
> now
> > > has little relationship with Drill. DataFusion and Ballista are not
> > > analogous to that.
> > >
> > > Different releases can come from the same git repository also. I would
> > > just want to make sure you have a proper debate about the long term
> > > pros/cons of developing within a monorepo (which again are independent
> > from
> > > release logistics, so if these concepts are coupled in any person’s
> mind
> > > please decouple them).
> > >
> > > On Wed, Mar 10, 2021 at 8:42 AM Andy Grove <andygrov...@gmail.com>
> > wrote:
> > >
> > >> Thanks, Andrew.
> > >>
> > >> I agree with your points and I do see the argument for
> > DataFusion/Ballista
> > >> being in their own repo. When I first donated DataFusion there was a
> > >> discussion about the fact that it could be moved back out later on
> once
> > it
> > >> was more mature. I will go see if I can find that conversation.
> > >>
> > >> Another option here would be to propose creating a new top-level
> Apache
> > >> project but I don't know if these components would qualify or what the
> > >> process would be. I imagine they would need to be much more mature
> > before
> > >> this would be an option.
> > >>
> > >> Thanks,
> > >>
> > >> Andy.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb <al...@influxdata.com>
> > wrote:
> > >>
> > >> > My thoughts are:
> > >> >
> > >> > 1. The scheduler and spill-to-disk/out of core operations sound very
> > >> good
> > >> > to bring into DataFusion and many people would benefit
> > >> >
> > >> > 2. I think the arrow github project and the unified workflow process
> > in
> > >> > particular is reaching its limits. Adding another cool, but non
> > trivial
> > >> > project like Ballista will likely exacerbate the challenges even
> more.
> > >> >
> > >> > 3. My sense is that the Rust arrow implementation is nearing feature
> > >> > completion (though we may still have one last big revamp, depending
> on
> > >> > Jorge's plans) and so I expect breaking API changes there to slow
> > down,
> > >> > lessening the value of keeping everything in the same rep.
> > >> >
> > >> > 4.  What would you think about pulling DataFusion out of the arrow
> > >> crate in
> > >> > the medium term (2-3 releases from now) and putting it into a new
> > place
> > >> > (alongside Ballista)?
> > >> >
> > >> > Andrew
> > >> >
> > >> > On Tue, Mar 9, 2021 at 12:30 PM Andy Grove <andygrov...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > As many of you know, the reason that I got involved in Arrow back
> in
> > >> 2018
> > >> > > was that I wanted to build a distributed compute platform in Rust,
> > >> with
> > >> > > capabilities similar to Apache Spark. This led to the creation of
> > the
> > >> > > DataFusion query engine, which is an in-memory query engine and is
> > now
> > >> > part
> > >> > > of the Arrow repo.
> > >> > >
> > >> > > Over the past couple of years, I have been working outside of
> Arrow
> > >> on a
> > >> > > project named “Ballista” [1] to continue the journey of trying to
> > >> build a
> > >> > > distributed version. Due to the pandemic, I have had time over the
> > >> winter
> > >> > > to put more effort into this project and have managed to build a
> > small
> > >> > > community around it over the past few months and the project has
> now
> > >> > > reached a point where the basic architecture has been proven and
> it
> > is
> > >> > now
> > >> > > getting a lot of attention (more than 2k stars on GitHub just
> > >> recently)
> > >> > and
> > >> > > I think that it would now make sense to donate some or all of the
> > >> project
> > >> > > to Apache Arrow and continue its growth here.
> > >> > >
> > >> > > For an overview of the project, please see the talk I recently
> gave
> > at
> > >> > the
> > >> > > New York Open Statistical Programming Meetup [2].
> > >> > >
> > >> > > Some of the benefits that I see in donating the project to Arrow
> > are:
> > >> > >
> > >> > >
> > >> > >    -
> > >> > >
> > >> > >    DataFusion also needs a scheduler and it would probably make
> > sense
> > >> to
> > >> > >    push some parts of the Ballista scheduler down a level in the
> > >> stack so
> > >> > > that
> > >> > >    the same approach is used to scale across cores in DataFusion
> and
> > >> to
> > >> > > scale
> > >> > >    across nodes in Ballista.
> > >> > >    -
> > >> > >
> > >> > >    Ballista provides preliminary support for spill-to-disk
> > >> functionality
> > >> > >    (in Arrow IPC format) which could also benefit DataFusion and
> > >> provide
> > >> > >    better scalability through out-of-core processing.
> > >> > >    -
> > >> > >
> > >> > >    Although the Ballista scheduler is implemented in Rust, it is
> > >> possible
> > >> > >    to implement executors in other languages due to the use of
> > Flight,
> > >> > > gRPC,
> > >> > >    and protobuf, so this may be of interest to other language
> > >> > > implementations
> > >> > >    of Arrow as well.
> > >> > >    -
> > >> > >
> > >> > >    There is already some overlap between Arrow and Ballista
> > >> contributors.
> > >> > >    -
> > >> > >
> > >> > >    Ballista unit tests will be part of Arrow CI which means that
> any
> > >> > >    changes to Arrow or DataFusion APIs that Ballista depends on
> will
> > >> also
> > >> > >    require that the corresponding Ballista code is updated as part
> > of
> > >> the
> > >> > > same
> > >> > >    PR.
> > >> > >
> > >> > >
> > >> > > My main goal with this email thread is to gauge interest in
> donating
> > >> this
> > >> > > code. If there is interest in doing so then we can have a more
> > >> detailed
> > >> > > follow-up conversation on exactly what would be donated and where
> it
> > >> > would
> > >> > > go.
> > >> > >
> > >> > >
> > >> > > I have also filed a GitHub issue in Ballista to get feedback from
> > >> current
> > >> > > contributors [3].
> > >> > >
> > >> > >
> > >> > > I'm looking forward to hearing opinions on this!
> > >> > >
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Andy.
> > >> > >
> > >> > > [1] https://github.com/ballista-compute/ballista
> > >> > >
> > >> > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
> > >> > >
> > >> > > [3] https://github.com/ballista-compute/ballista/issues/646
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to