Hi Jack,

Thanks for the input, and there are some interesting ideas there.

If we were looking to break this into separate donations though I would
actually consider 2+3 to be the first piece to incorporate into DataFusion
because it would provide much better scalability compared to the current
model where we eagerly try and execute the entire query tree concurrently.

I do think having Ballista in the same repo would make it easier to look at
pushing certain pieces down into the DataFusion crate rather than trying to
coordinate this across two projects where only one of them is under Arrow
governance.

Thanks,

Andy.

On Thu, Mar 11, 2021 at 12:47 PM Jack Chan <j4ck....@gmail.com> wrote:

> Hey Andy
>
> I want to discuss the areas of Ballista code that you proposed above to
> move to Arrow. These are:
> 1. serde code for translating between protobuf and
> Arrow/DataFusion/Ballista data structures
> 2. Distributed query planner
> 3. Scheduler process that coordinates query execution across available
> executors
> 4. Executor process that implements Flight protocol and executes query
> partitions and serializes results in Arrow IPC format
>
> So, 1+4 would make DataFusion an application server that can communicate
> through IPC. This is a good thing and easy to maintain. And, 2+3 is the
> distributed computing part that is orthogonal to what DataFusion is doing.
> This is the more architectural and strategic part. Would it make sense to
> separate the discussion into two? i.e. we can move 1+4 into DataFusion
> short-term, and discuss more about 2+3 in longer-term. (This would create
> some extra work in Ballista. And the only thing I am aware of is to
> refactor the executor to not have a hard dependency on scheduler.)
>
>
> Jack
>
> Andy Grove <andygrov...@gmail.com> 於 2021年3月11日週四 上午9:49寫道:
>
> > Thanks, Micah.
> >
> > Regarding integration testing, we currently have an integration test
> script
> > in the repo that spins up multiple processes in docker compose and runs
> > through a series of queries on a data set that can be generated locally.
> I
> > invested in some modest hardware (a refurbed 12 core proliant rack server
> > with 64 GB RAM) to be able to run these tests via CI (using BuildKite)
> but
> > have not got this set up yet. I am hopeful that with Ballista in Apache
> > Arrow it will be easier to find companies willing to contribute a more
> > scalable solution than this. In the short term, I can at least run these
> > tests nightly from master and catch regressions quickly.
> >
> > I agree with your views on tooling / workflow and I am going to step up
> and
> > start working with the Rust community to really dig into this and put
> > together some concrete proposals. The conversation does keep coming up,
> and
> > not just here on the mailing list. I am hearing many of the same concerns
> > from current Ballista contributors so there are valid concerns here that
> we
> > need to address, and I believe that we can address them over time with
> some
> > incremental improvements, but let's not get into that discussion again
> > here. I will follow up hopefully next week with something on this.
> >
> > On Thu, Mar 11, 2021 at 9:49 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > I think having Ballista in Arrow sounds like a good idea in the short
> > > term.  It sounds like there is enough developer pain, that bringing it
> > here
> > > makes sense (providing existing Ballista contributors are happy with
> the
> > > change and current Rust maintainers are open to the work involved).
> > >
> > > One longer term concern is CI.  Setting up a good system for
> distributed
> > > testing requires a lot of investment and compute resources, but I think
> > we
> > > can figure it out when it comes time.  In the short term it seems a
> > > mono-repo reduces the engineering effort to get a sane CI system
> working.
> > >
> > > As a point of reference Flink, Beam and Spark all seem to use
> mono-repos
> > > (their goals are likely a little different then Arrow's though).
> > >
> > > -Micah
> > >
> > > P.S.  I do think the tooling/workflow conversation should be discussed
> > more
> > > but I think having a more concrete proposal that first starts from
> > > requirements and nice to haves and then gets to a proposed solution is
> > > important (i.e. pointing out pain points and problems is useful, but I
> > > think it ignores some of the current value the existing process
> > provides).
> > >
> > > On Wed, Mar 10, 2021 at 5:13 PM Andy Grove <andygrov...@gmail.com>
> > wrote:
> > >
> > > > Thanks for the feedback so far on this proposal. I really appreciate
> > > > everyone taking the time to put so much thought (and passion!) into
> > this.
> > > >
> > > > So far, I don't think anyone is opposed to the idea of donating
> > Ballista
> > > > but there are clearly concerns about an increased burden on current
> > > > maintainers.
> > > >
> > > > We also have re-started discussions around tooling and release
> > processes,
> > > > but it seems that there is no objection to Rust / DataFusion /
> Ballista
> > > > having more control over the release process but we have to put in
> the
> > > work
> > > > to make that happen. I am certainly motivated to help with this but I
> > > think
> > > > that is a separate conversation to donating Ballista.
> > > >
> > > > To reduce the burden on existing maintainers, we could consider
> > initially
> > > > adding Ballista in such a way that it doesn't slow down momentum on
> > > Arrow &
> > > > DataFusion by adding it as a separate Rust subproject that is not
> part
> > of
> > > > the Rust workspace, and have it depend on pinned commits initially.
> > This
> > > > would be a lightweight way of incubating the project within the
> > mono-repo
> > > > and at some point, we can add it to the main workspace. This would be
> > no
> > > > worse than the current situation, and it would be better because it
> is
> > at
> > > > least under Arrow governance.
> > > >
> > > > I would like to talk a bit more specifically about the donation at
> this
> > > > point now that there is some feedback.
> > > >
> > > > What I propose we donate from Ballista is:
> > > >
> > > >    -
> > > >
> > > >    The ballista.proto file that defines an encoding for logical and
> > > >    physical query plans as well as cluster meta-data (this protobuf
> > file
> > > > could
> > > >    eventually be split into separate files for each area)
> > > >    -
> > > >
> > > >    The Rust source code, which consists of these main areas:
> > > >    -
> > > >
> > > >       serde code for translating between protobuf and
> > > >       Arrow/DataFusion/Ballista data structures
> > > >       -
> > > >
> > > >       Distributed query planner
> > > >       -
> > > >
> > > >       Scheduler process that coordinates query execution across
> > available
> > > >       executors
> > > >       -
> > > >
> > > >       Executor process that implements Flight protocol and executes
> > query
> > > >       partitions and serializes results in Arrow IPC format
> > > >
> > > > I am proposing that we specifically exclude the following parts of
> the
> > > > Ballista repo from the donation:
> > > >
> > > >    -
> > > >
> > > >    The work-in-progress JDBC driver which is not currently functional
> > > >    -
> > > >
> > > >    The Spark benchmark code that I have been using for comparing
> > > > performance
> > > >    -
> > > >
> > > >    The Python bindings, which as far as I know are pretty much a fork
> > of
> > > >    Jorge's datafusion-python project.
> > > >
> > > > I think it is also worth mentioning that Ballista is currently only
> ~8k
> > > > lines of code, which is pretty small in contrast to the >100k lines
> of
> > > code
> > > > in the Arrow Rust project currently.
> > > >
> > > > Let's keep the conversation going and see what other feedback there
> is
> > > > regarding the merits of donating Ballista, or not.
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > > On Wed, Mar 10, 2021 at 3:13 PM Jorge Cardoso Leitão <
> > > > jorgecarlei...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Wes, thanks a lot for your reply. Let me try to answer:
> > > > >
> > > > > 1. If the purpose of Ballista is to support multiple language
> > > > > > executors, what does segregating it from the other PL's (where
> > > > > > executors are being developed, too) serve to facilitate this
> goal?
> > > > > >
> > > > >
> > > > > It facilitates because the stronger the coupling is, the more
> > entropic
> > > > the
> > > > > setup is, and the more energy is required to develop and maintain
> it.
> > > > > In this particular case, I Imagine that each executor would depend
> on
> > > > > specific versions of each implementation, just like any other
> > dependent
> > > > > that is not
> > > > > maintained by Apache Arrow does.
> > > > >
> > > > > Or is the idea that every dependent should be on the mono-repo? If
> we
> > > > need
> > > > > to control our dependents like that, that usually indicates that we
> > > can't
> > > > > guarantee a stable API (which IMO is the root cause).
> > > > >
> > > > > 2. Use of the monorepo does not require a synchronized release
> cycle,
> > > > > > just as Rust does not require it now either. The only reason
> there
> > > > > > have not been independent Rust releases is because someone has
> not
> > > > > > volunteered to do it. Likewise, if DataFusion and Ballista are in
> > the
> > > > > > same git repository, they don't have to release at the same time
> as
> > > > > > the core arrow / parquet crates.
> > > > > >
> > > > >
> > > > > I thought that Rust needed to be synchronized with the major
> release
> > of
> > > > the
> > > > > repo. Isn't this the case anymore?
> > > > >
> > > > > 3. On an incremental basis, I do not believe the increased
> complexity
> > > > > is significant. A multi-repository setup can be actively worse when
> > > > > development work involves both repositories at the same time. This
> > can
> > > > > be mitigated by pinning the arrow / parquet crates as you point
> out,
> > > > > but that creates other issues.
> > > > >
> > > > > Could you enumerate parts from DataFusion or Ballista that would
> > > require
> > > > > work on Arrow at the same time? I proposed that division because I
> am
> > > > > reasonably confident will not need to be developed at the same
> time.
> > I
> > > am
> > > > > confident of this because a) the APIs used by DataFusion are
> written
> > to
> > > > > minimize public surfaces, so that arrow can mutate without
> affecting
> > > > those
> > > > > APIs; b) I designed and implemented most of the DataFusion code
> > around
> > > > > built-in functions, aggregate functions, UDFs and UDAF.
> > > > >
> > > > > But maybe we can validate this here: Andy, during the development
> of
> > > > > Ballista, on which the largest changes on Arrow repo were needed,
> did
> > > you
> > > > > have to change anything on the Arrow crate or parquet crate, or was
> > > > > everything done on DataFusion? If yes to any, was there a
> significant
> > > > > burden in doing so?
> > > > >
> > > > > 4. Even without Jira, there is still the expectation for
> contributors
> > > > > > to communicate in a way that is compatible with the Apache Way.
> So
> > > > > > even without Jira, PMCs have an obligation to establish an
> > > alternative
> > > > > > structure to have consistently open dialogue / planning about
> what
> > > > > > people are working on or planning to work on in the future. If
> > > > > > contributors are extensively discussing / planning privately,
> these
> > > > > > discussions must be moved into the open, whether with design
> > > documents
> > > > > > or issues or e-mail discussions. This was discussed ad nauseam in
> > the
> > > > > > other thread so I won't rehash those arguments.
> > > > > >
> > > > >
> > > > > I fully agree, even though I think it is a bit difficult to
> > > > operationalize.
> > > > > Thus, let's try like this: would you consider, under the definition
> > > used
> > > > > above, discussions happening on github PRs and issues, such as what
> > > > airflow
> > > > > does <https://github.com/apache/airflow/issues> , as open?
> > > > >
> > > > > Aside from these issues, the biggest lost opportunity I see if
> > > > > > DF/Baliista "cast away" as it were, is that it becomes
> unattractive
> > > > > > for the rest of us to build anything on top of these platforms
> > > > > > (because at that point we have a circular dependency, which is
> the
> > > > > > hellscape we escaped from with Parquet C++). I used the
> > > > > > datafusion-python project as an example — if that were in the
> Arrow
> > > > > > project I might consider using it in various ways or contribute
> to
> > > it,
> > > > > > but as an external project it's less interesting to me as
> something
> > > to
> > > > > > build on.
> > > > > >
> > > > >
> > > > > My feelings about transferring datafusion-python to arrow are
> shared
> > > > above:
> > > > > I find the idea of picking something that is well encapsulated and
> > > > > decoupled from the rest and blending it into something large and
> less
> > > > > decoupled as an entropy-generating activity, which requires more
> > energy
> > > > to
> > > > > maintain. Operationally, the way I would merge a project like
> > > > > datafusion-python into Apache would be by transferring ownership of
> > the
> > > > > repo on github, transfer ownership of the pypi project, and create
> > some
> > > > > secrets on github to keep twine working. Just like I mentioned for
> > > > > Ballista. If people lose interest in the project, then deprecating
> it
> > > > would
> > > > > be trivial (archive the repo). If people gain interest in it,
> growth
> > is
> > > > > also trivial (there is already a house in place and the goals are
> > well
> > > > > defined). The interfaces are the API contracts declared as pinned
> > > > > dependencies (in Cargo.toml / setup.py).
> > > > >
> > > > > Best,
> > > > > Jorge
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Mar 10, 2021 at 7:50 PM Wes McKinney <wesmck...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > hi Jorge,
> > > > > >
> > > > > > I have some thoughts / questions on your arguments against use of
> > the
> > > > > > monorepo:
> > > > > >
> > > > > > 1. If the purpose of Ballista is to support multiple language
> > > > > > executors, what does segregating it from the other PL's (where
> > > > > > executors are being developed, too) serve to facilitate this
> goal?
> > > > > >
> > > > > > 2. Use of the monorepo does not require a synchronized release
> > cycle,
> > > > > > just as Rust does not require it now either. The only reason
> there
> > > > > > have not been independent Rust releases is because someone has
> not
> > > > > > volunteered to do it. Likewise, if DataFusion and Ballista are in
> > the
> > > > > > same git repository, they don't have to release at the same time
> as
> > > > > > the core arrow / parquet crates.
> > > > > >
> > > > > > 3. On an incremental basis, I do not believe the increased
> > complexity
> > > > > > is significant. A multi-repository setup can be actively worse
> when
> > > > > > development work involves both repositories at the same time.
> This
> > > can
> > > > > > be mitigated by pinning the arrow / parquet crates as you point
> > out,
> > > > > > but that creates other issues.
> > > > > >
> > > > > > 4. Even without Jira, there is still the expectation for
> > contributors
> > > > > > to communicate in a way that is compatible with the Apache Way.
> So
> > > > > > even without Jira, PMCs have an obligation to establish an
> > > alternative
> > > > > > structure to have consistently open dialogue / planning about
> what
> > > > > > people are working on or planning to work on in the future. If
> > > > > > contributors are extensively discussing / planning privately,
> these
> > > > > > discussions must be moved into the open, whether with design
> > > documents
> > > > > > or issues or e-mail discussions. This was discussed ad nauseam in
> > the
> > > > > > other thread so I won't rehash those arguments.
> > > > > >
> > > > > > Aside from these issues, the biggest lost opportunity I see if
> > > > > > DF/Baliista "cast away" as it were, is that it becomes
> unattractive
> > > > > > for the rest of us to build anything on top of these platforms
> > > > > > (because at that point we have a circular dependency, which is
> the
> > > > > > hellscape we escaped from with Parquet C++). I used the
> > > > > > datafusion-python project as an example — if that were in the
> Arrow
> > > > > > project I might consider using it in various ways or contribute
> to
> > > it,
> > > > > > but as an external project it's less interesting to me as
> something
> > > to
> > > > > > build on.
> > > > > >
> > > > > > On Wed, Mar 10, 2021 at 12:13 PM Jorge Cardoso Leitão
> > > > > > <jorgecarlei...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > First of all, I want to thank you very much for your work on
> > > Ballista
> > > > > and
> > > > > > > for doing it in an open source environment. It is something
> that
> > > > should
> > > > > > be
> > > > > > > emphasised and celebrated.
> > > > > > >
> > > > > > > Secondly, wrt to considering donating it to the Apache
> Foundation
> > > and
> > > > > > > Apache project in particular, I would say that we should be
> > honored
> > > > by
> > > > > > such
> > > > > > > consideration. In this context, my immediate reaction is: how
> can
> > > we
> > > > > best
> > > > > > > support Ballista's community?
> > > > > > >
> > > > > > > My initial thoughts in this direction are:
> > > > > > >
> > > > > > > * create a new git repo for DataFusion and Ballista to reside
> on
> > > > (e.g.
> > > > > > > arrow/ballista)
> > > > > > > * do not require the release cycle and versioning to be aligned
> > > with
> > > > > > > arrow's release cycle
> > > > > > > * do not require the usage of JIRA
> > > > > > > * pin the dependency of Datafusion on Arrow and parquet crate
> > (e.g.
> > > > to
> > > > > a
> > > > > > > specific commit)
> > > > > > >
> > > > > > > I feel that this setup would keep Ballista under the Foundation
> > and
> > > > > > Apache
> > > > > > > Arrow's umbrella and aligned with its goals, while at the same
> > time
> > > > put
> > > > > > the
> > > > > > > least amount of burden on its community, both in terms of
> > keeping a
> > > > > > strict
> > > > > > > release schedule, tooling and CI.
> > > > > > >
> > > > > > > The rationale for the above is that whenever something is
> > released
> > > on
> > > > > > > DataFusion (which hosts most of the physical ops), people will
> > also
> > > > > want
> > > > > > it
> > > > > > > quickly available on Ballista. Thus, having the two release
> > cycles
> > > > more
> > > > > > > closely related and independent of the arrow implementation's
> > cycle
> > > > is
> > > > > > > good. DataFusion does not have integration tests against other
> > > arrow
> > > > > > > implementations, and thus the integration tests are not
> relevant.
> > > > > > >
> > > > > > > There are 4 main reasons I would not recommend placing it in
> the
> > > > > > mono-repo:
> > > > > > >
> > > > > > > 1. It would not add much
> > > > > > > 2. It would place Ballista on the same release schedule and git
> > > > system
> > > > > as
> > > > > > > the rest of Arrow's implementation, which may not suit
> Ballista's
> > > own
> > > > > > > development pace (in either direction)
> > > > > > > 3. It further increases the complexity of the current repo
> > > > > > > 4. It would force its community to use JIRA, merge process,
> > > > components,
> > > > > > > etc, which may not be what its community wishes for
> > > > > > >
> > > > > > > The main risk I see is that because arrow's release cycle is
> slow
> > > and
> > > > > > major
> > > > > > > releases only, DataFusion risks missing arrow features from
> time
> > to
> > > > > time.
> > > > > > > We can mitigate this with cargo and pins to commit hashes. IMO
> > this
> > > > > risk
> > > > > > > exists in any dependency relationship and is usually a sign
> that
> > > > there
> > > > > is
> > > > > > > an API contract and thus a trust relationship involved, which
> is
> > a
> > > > good
> > > > > > > thing.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jorge
> > > > > > >
> > > > > > > On Tue, Mar 9, 2021 at 6:31 PM Andy Grove <
> andygrov...@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > As many of you know, the reason that I got involved in Arrow
> > back
> > > > in
> > > > > > 2018
> > > > > > > > was that I wanted to build a distributed compute platform in
> > > Rust,
> > > > > with
> > > > > > > > capabilities similar to Apache Spark. This led to the
> creation
> > of
> > > > the
> > > > > > > > DataFusion query engine, which is an in-memory query engine
> and
> > > is
> > > > > now
> > > > > > part
> > > > > > > > of the Arrow repo.
> > > > > > > >
> > > > > > > > Over the past couple of years, I have been working outside of
> > > Arrow
> > > > > on
> > > > > > a
> > > > > > > > project named “Ballista” [1] to continue the journey of
> trying
> > to
> > > > > > build a
> > > > > > > > distributed version. Due to the pandemic, I have had time
> over
> > > the
> > > > > > winter
> > > > > > > > to put more effort into this project and have managed to
> build
> > a
> > > > > small
> > > > > > > > community around it over the past few months and the project
> > has
> > > > now
> > > > > > > > reached a point where the basic architecture has been proven
> > and
> > > it
> > > > > is
> > > > > > now
> > > > > > > > getting a lot of attention (more than 2k stars on GitHub just
> > > > > > recently) and
> > > > > > > > I think that it would now make sense to donate some or all of
> > the
> > > > > > project
> > > > > > > > to Apache Arrow and continue its growth here.
> > > > > > > >
> > > > > > > > For an overview of the project, please see the talk I
> recently
> > > gave
> > > > > at
> > > > > > the
> > > > > > > > New York Open Statistical Programming Meetup [2].
> > > > > > > >
> > > > > > > > Some of the benefits that I see in donating the project to
> > Arrow
> > > > are:
> > > > > > > >
> > > > > > > >
> > > > > > > >    -
> > > > > > > >
> > > > > > > >    DataFusion also needs a scheduler and it would probably
> make
> > > > sense
> > > > > > to
> > > > > > > >    push some parts of the Ballista scheduler down a level in
> > the
> > > > > stack
> > > > > > so
> > > > > > > > that
> > > > > > > >    the same approach is used to scale across cores in
> > DataFusion
> > > > and
> > > > > to
> > > > > > > > scale
> > > > > > > >    across nodes in Ballista.
> > > > > > > >    -
> > > > > > > >
> > > > > > > >    Ballista provides preliminary support for spill-to-disk
> > > > > > functionality
> > > > > > > >    (in Arrow IPC format) which could also benefit DataFusion
> > and
> > > > > > provide
> > > > > > > >    better scalability through out-of-core processing.
> > > > > > > >    -
> > > > > > > >
> > > > > > > >    Although the Ballista scheduler is implemented in Rust, it
> > is
> > > > > > possible
> > > > > > > >    to implement executors in other languages due to the use
> of
> > > > > Flight,
> > > > > > > > gRPC,
> > > > > > > >    and protobuf, so this may be of interest to other language
> > > > > > > > implementations
> > > > > > > >    of Arrow as well.
> > > > > > > >    -
> > > > > > > >
> > > > > > > >    There is already some overlap between Arrow and Ballista
> > > > > > contributors.
> > > > > > > >    -
> > > > > > > >
> > > > > > > >    Ballista unit tests will be part of Arrow CI which means
> > that
> > > > any
> > > > > > > >    changes to Arrow or DataFusion APIs that Ballista depends
> on
> > > > will
> > > > > > also
> > > > > > > >    require that the corresponding Ballista code is updated as
> > > part
> > > > of
> > > > > > the
> > > > > > > > same
> > > > > > > >    PR.
> > > > > > > >
> > > > > > > >
> > > > > > > > My main goal with this email thread is to gauge interest in
> > > > donating
> > > > > > this
> > > > > > > > code. If there is interest in doing so then we can have a
> more
> > > > > detailed
> > > > > > > > follow-up conversation on exactly what would be donated and
> > where
> > > > it
> > > > > > would
> > > > > > > > go.
> > > > > > > >
> > > > > > > >
> > > > > > > > I have also filed a GitHub issue in Ballista to get feedback
> > from
> > > > > > current
> > > > > > > > contributors [3].
> > > > > > > >
> > > > > > > >
> > > > > > > > I'm looking forward to hearing opinions on this!
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Andy.
> > > > > > > >
> > > > > > > > [1] https://github.com/ballista-compute/ballista
> > > > > > > >
> > > > > > > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
> > > > > > > >
> > > > > > > > [3] https://github.com/ballista-compute/ballista/issues/646
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to