This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git
The following commit(s) were added to refs/heads/master by this push:
new 17c53400 Improve top-level README (#33)
17c53400 is described below
commit 17c534007236739a790e8ae0f83cb2989476a07e
Author: Andy Grove <[email protected]>
AuthorDate: Sat May 21 15:03:29 2022 -0600
Improve top-level README (#33)
---
README.md | 69 +++++++++++++++++++------------------------
ballista/docs/architecture.md | 16 ++++++++++
dev/release/README.md | 12 ++++----
3 files changed, 53 insertions(+), 44 deletions(-)
diff --git a/README.md b/README.md
index 28477d12..95c202a2 100644
--- a/README.md
+++ b/README.md
@@ -17,68 +17,61 @@
under the License.
-->
-_Please note that Ballista development is still happening in the
-[DataFusion repository](https://github.com/apache/arrow-datafusion) but we are
in the
-process of migrating to this new repository._
-
# Ballista: Distributed Compute with Rust, Apache Arrow, and DataFusion
-Ballista is a distributed compute platform primarily implemented in Rust, and
powered by Apache Arrow and
+Ballista is a distributed SQL query engine primarily implemented in Rust, and
powered by Apache Arrow and
DataFusion. It is built on an architecture that allows other programming
languages (such as Python, C++, and
Java) to be supported as first-class citizens without paying a penalty for
serialization costs.
The foundational technologies in Ballista are:
- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels
for efficient processing of data.
+- [DataFusion Query Engine](https://github.com/apache/arrow-datafusion) for
query execution
- [Apache Arrow Flight
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
for efficient
data transfer between processes.
-- [Google Protocol Buffers](https://developers.google.com/protocol-buffers)
for serializing query plans.
-- [Docker](https://www.docker.com/) for packaging up executors along with
user-defined code.
+- [Google Protocol Buffers](https://developers.google.com/protocol-buffers)
for serializing query plans, with [plans to
+ eventually use substrait.io
here](https://github.com/apache/arrow-ballista/issues/32).
+
+Ballista implements a similar design to Apache Spark (particularly Spark SQL),
but there are some key differences:
+
+- The choice of Rust as the main execution language avoids the overhead of GC
pauses.
+- Ballista is designed from the ground up to use columnar data, enabling a
number of efficiencies such as vectorized
+ processing (SIMD and GPU) and efficient compression. Although Spark does
have some columnar support, it is still
+ largely row-based today.
+- The combination of Rust and Arrow provides excellent memory efficiency and
memory usage can be 5x - 10x lower than
+ Apache Spark in some cases, which means that more processing can fit on a
single node, reducing the overhead of
+ distributed compute.
+- The use of Apache Arrow as the memory model and network protocol means that
data can be exchanged between executors
+ in any programming language with minimal serialization overhead.
Ballista can be deployed as a standalone cluster and also supports
[Kubernetes](https://kubernetes.io/). In either
case, the scheduler can be configured to use [etcd](https://etcd.io/) as a
backing store to (eventually) provide
redundancy in the case of a scheduler failing.
-# Getting Started
-
-Refer to the core [Ballista crate README](ballista/rust/client/README.md) for
the Getting Started guide.
+# Project Status and Roadmap
-## Distributed Scheduler Overview
+Ballista is currently a proof-of-concept and provides batch execution of SQL
queries. Although it is already capable of
+executing complex queries, it is not yet scalable or robust.
-Ballista uses the DataFusion query execution framework to create a physical
plan and then transforms it into a
-distributed physical plan by breaking the query down into stages whenever the
partitioning scheme changes.
+There is an excellent discussion in
https://github.com/apache/arrow-ballista/issues/30 about the future of the
project
+and we encourage you to participate and add your feedback there if you are
interested in using or contributing to
+Ballista.
-Specifically, any `RepartitionExec` operator is replaced with an
`UnresolvedShuffleExec` and the child operator
-of the repartition operator is wrapped in a `ShuffleWriterExec` operator and
scheduled for execution.
+The current initiatives being considered are:
-Each executor polls the scheduler for the next task to run. Tasks are
currently always `ShuffleWriterExec` operators
-and each task represents one _input_ partition that will be executed. The
resulting batches are repartitioned
-according to the shuffle partitioning scheme and each _output_ partition is
streamed to disk in Arrow IPC format.
+- Continue to improve the current batch-based execution
+- Add support for low-latency query execution based on a streaming model
+- Adopt [substrait.io](https://substrait.io/) to allow other query engines to
be integrated
-The scheduler will replace `UnresolvedShuffleExec` operators with
`ShuffleReaderExec` operators once all shuffle
-tasks have completed. The `ShuffleReaderExec` operator connects to other
executors as required using the Flight
-interface, and streams the shuffle IPC files.
-
-# How does this compare to Apache Spark?
-
-Ballista implements a similar design to Apache Spark, but there are some key
differences.
+# Getting Started
-- The choice of Rust as the main execution language means that memory usage is
deterministic and avoids the overhead of
- GC pauses.
-- Ballista is designed from the ground up to use columnar data, enabling a
number of efficiencies such as vectorized
- processing (SIMD and GPU) and efficient compression. Although Spark does
have some columnar support, it is still
- largely row-based today.
-- The combination of Rust and Arrow provides excellent memory efficiency and
memory usage can be 5x - 10x lower than
- Apache Spark in some cases, which means that more processing can fit on a
single node, reducing the overhead of
- distributed compute.
-- The use of Apache Arrow as the memory model and network protocol means that
data can be exchanged between executors
- in any programming language with minimal serialization overhead.
+Refer to the core [Ballista crate README](ballista/rust/client/README.md) for
the Getting Started guide.
## Architecture Overview
-There is no formal document describing Ballista's architecture yet, but the
following presentation offers a good overview of its different components and
how they interact together.
-
-- (February 2021): Ballista: Distributed Compute with Rust and Apache Arrow:
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+- [Architecture Overview](ballista/docs/architecture.md)
+- [Ballista: Distributed Compute with Rust and Apache
Arrow](https://www.youtube.com/watch?v=ZZHQaOap9pQ) talk at
+ the New York Open Statistical Programming Meetup (Feb 2021)
## Contribution Guide
diff --git a/ballista/docs/architecture.md b/ballista/docs/architecture.md
index 2868d52b..bdb45cf1 100644
--- a/ballista/docs/architecture.md
+++ b/ballista/docs/architecture.md
@@ -39,6 +39,22 @@ executors in the cluster. This is the basic unit of
scalability in Ballista.
The following diagram shows the flow of requests and responses between the
client, scheduler, and executor
processes.
+## Distributed Scheduler Overview
+
+Ballista uses the DataFusion query execution framework to create a physical
plan and then transforms it into a
+distributed physical plan by breaking the query down into stages whenever the
partitioning scheme changes.
+
+Specifically, any `RepartitionExec` operator is replaced with an
`UnresolvedShuffleExec` and the child operator
+of the repartition operator is wrapped in a `ShuffleWriterExec` operator and
scheduled for execution.
+
+Each executor polls the scheduler for the next task to run. Tasks are
currently always `ShuffleWriterExec` operators
+and each task represents one _input_ partition that will be executed. The
resulting batches are repartitioned
+according to the shuffle partitioning scheme and each _output_ partition is
streamed to disk in Arrow IPC format.
+
+The scheduler will replace `UnresolvedShuffleExec` operators with
`ShuffleReaderExec` operators once all shuffle
+tasks have completed. The `ShuffleReaderExec` operator connects to other
executors as required using the Flight
+interface, and streams the shuffle IPC files.
+

## Scheduler Process
diff --git a/dev/release/README.md b/dev/release/README.md
index ac07b4f2..500083f7 100644
--- a/dev/release/README.md
+++ b/dev/release/README.md
@@ -62,7 +62,7 @@ See instructions at
https://infra.apache.org/release-signing.html#generate for g
Committers can add signing keys in Subversion client with their ASF account.
e.g.:
-``` bash
+```bash
$ svn co https://dist.apache.org/repos/dist/dev/arrow
$ cd arrow
$ editor KEYS
@@ -71,7 +71,7 @@ $ svn ci KEYS
Follow the instructions in the header of the KEYS file to append your key.
Here is an example:
-``` bash
+```bash
(gpg --list-sigs "John Doe" && gpg --armor --export "John Doe") >> KEYS
svn commit KEYS -m "Add key for John Doe"
```
@@ -136,7 +136,7 @@ git commit -a -m 'Update version'
Define release branch (e.g. `master`), base version tag (e.g. `7.0.0`) and
future version tag (e.g. `8.0.0`). Commits between the base version tag and the
release branch will be used to
populate the changelog content.
-You will need a GitHub Personal Access Token for the following steps. Follow
+You will need a GitHub Personal Access Token for the following steps. Follow
[these
instructions](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token)
to generate one if you do not already have one.
@@ -148,8 +148,8 @@ CHANGELOG_GITHUB_TOKEN=<TOKEN>
./dev/release/update_change_log-ballista.sh maste
git commit -a -m 'Create changelog for release'
```
-_If you see the error `"You have exceeded a secondary rate limit"` when
running this script, try reducing the CPU
-allocation to slow the process down and throttle the number of GitHub requests
made per minute, by modifying the
+_If you see the error `"You have exceeded a secondary rate limit"` when
running this script, try reducing the CPU
+allocation to slow the process down and throttle the number of GitHub requests
made per minute, by modifying the
value of the `--cpus` argument in the `update_change_log.sh` script._
You can add `invalid` or `development-process` label to exclude items from
@@ -321,7 +321,7 @@ following commands. Crates need to be published in the
correct order as shown in
_To update this diagram, manually edit the dependencies in
[crate-deps.dot](crate-deps.dot) and then run:_
-``` bash
+```bash
dot -Tsvg dev/release/crate-deps.dot > dev/release/crate-deps.svg
```