Thanks, Andrew.

I agree with your points and I do see the argument for DataFusion/Ballista
being in their own repo. When I first donated DataFusion there was a
discussion about the fact that it could be moved back out later on once it
was more mature. I will go see if I can find that conversation.

Another option here would be to propose creating a new top-level Apache
project but I don't know if these components would qualify or what the
process would be. I imagine they would need to be much more mature before
this would be an option.

Thanks,

Andy.





On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb <al...@influxdata.com> wrote:

> My thoughts are:
>
> 1. The scheduler and spill-to-disk/out of core operations sound very good
> to bring into DataFusion and many people would benefit
>
> 2. I think the arrow github project and the unified workflow process in
> particular is reaching its limits. Adding another cool, but non trivial
> project like Ballista will likely exacerbate the challenges even more.
>
> 3. My sense is that the Rust arrow implementation is nearing feature
> completion (though we may still have one last big revamp, depending on
> Jorge's plans) and so I expect breaking API changes there to slow down,
> lessening the value of keeping everything in the same rep.
>
> 4.  What would you think about pulling DataFusion out of the arrow crate in
> the medium term (2-3 releases from now) and putting it into a new place
> (alongside Ballista)?
>
> Andrew
>
> On Tue, Mar 9, 2021 at 12:30 PM Andy Grove <andygrov...@gmail.com> wrote:
>
> > As many of you know, the reason that I got involved in Arrow back in 2018
> > was that I wanted to build a distributed compute platform in Rust, with
> > capabilities similar to Apache Spark. This led to the creation of the
> > DataFusion query engine, which is an in-memory query engine and is now
> part
> > of the Arrow repo.
> >
> > Over the past couple of years, I have been working outside of Arrow on a
> > project named “Ballista” [1] to continue the journey of trying to build a
> > distributed version. Due to the pandemic, I have had time over the winter
> > to put more effort into this project and have managed to build a small
> > community around it over the past few months and the project has now
> > reached a point where the basic architecture has been proven and it is
> now
> > getting a lot of attention (more than 2k stars on GitHub just recently)
> and
> > I think that it would now make sense to donate some or all of the project
> > to Apache Arrow and continue its growth here.
> >
> > For an overview of the project, please see the talk I recently gave at
> the
> > New York Open Statistical Programming Meetup [2].
> >
> > Some of the benefits that I see in donating the project to Arrow are:
> >
> >
> >    -
> >
> >    DataFusion also needs a scheduler and it would probably make sense to
> >    push some parts of the Ballista scheduler down a level in the stack so
> > that
> >    the same approach is used to scale across cores in DataFusion and to
> > scale
> >    across nodes in Ballista.
> >    -
> >
> >    Ballista provides preliminary support for spill-to-disk functionality
> >    (in Arrow IPC format) which could also benefit DataFusion and provide
> >    better scalability through out-of-core processing.
> >    -
> >
> >    Although the Ballista scheduler is implemented in Rust, it is possible
> >    to implement executors in other languages due to the use of Flight,
> > gRPC,
> >    and protobuf, so this may be of interest to other language
> > implementations
> >    of Arrow as well.
> >    -
> >
> >    There is already some overlap between Arrow and Ballista contributors.
> >    -
> >
> >    Ballista unit tests will be part of Arrow CI which means that any
> >    changes to Arrow or DataFusion APIs that Ballista depends on will also
> >    require that the corresponding Ballista code is updated as part of the
> > same
> >    PR.
> >
> >
> > My main goal with this email thread is to gauge interest in donating this
> > code. If there is interest in doing so then we can have a more detailed
> > follow-up conversation on exactly what would be donated and where it
> would
> > go.
> >
> >
> > I have also filed a GitHub issue in Ballista to get feedback from current
> > contributors [3].
> >
> >
> > I'm looking forward to hearing opinions on this!
> >
> >
> > Thanks,
> >
> > Andy.
> >
> > [1] https://github.com/ballista-compute/ballista
> >
> > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
> >
> > [3] https://github.com/ballista-compute/ballista/issues/646
> >
>

Reply via email to