[arrow-ballista] branch master updated: Improve top-level README (#33)

agrove Sat, 21 May 2022 14:03:56 -0700

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git



The following commit(s) were added to refs/heads/master by this push:
     new 17c53400 Improve top-level README (#33)
17c53400 is described below

commit 17c534007236739a790e8ae0f83cb2989476a07e
Author: Andy Grove <[email protected]>
AuthorDate: Sat May 21 15:03:29 2022 -0600

    Improve top-level README (#33)
---
 README.md                     | 69 +++++++++++++++++++------------------------
 ballista/docs/architecture.md | 16 ++++++++++
 dev/release/README.md         | 12 ++++----
 3 files changed, 53 insertions(+), 44 deletions(-)

diff --git a/README.md b/README.md
index 28477d12..95c202a2 100644
--- a/README.md
+++ b/README.md
@@ -17,68 +17,61 @@
   under the License.
 -->
 
-_Please note that Ballista development is still happening in the
-[DataFusion repository](https://github.com/apache/arrow-datafusion) but we are 
in the
-process of migrating to this new repository._
-
 # Ballista: Distributed Compute with Rust, Apache Arrow, and DataFusion
 
-Ballista is a distributed compute platform primarily implemented in Rust, and 
powered by Apache Arrow and
+Ballista is a distributed SQL query engine primarily implemented in Rust, and 
powered by Apache Arrow and
 DataFusion. It is built on an architecture that allows other programming 
languages (such as Python, C++, and
 Java) to be supported as first-class citizens without paying a penalty for 
serialization costs.
 
 The foundational technologies in Ballista are:
 
 - [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels 
for efficient processing of data.
+- [DataFusion Query Engine](https://github.com/apache/arrow-datafusion) for 
query execution
 - [Apache Arrow Flight 
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) 
for efficient
   data transfer between processes.
-- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) 
for serializing query plans.
-- [Docker](https://www.docker.com/) for packaging up executors along with 
user-defined code.
+- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) 
for serializing query plans, with [plans to
+  eventually use substrait.io 
here](https://github.com/apache/arrow-ballista/issues/32).
+
+Ballista implements a similar design to Apache Spark (particularly Spark SQL), 
but there are some key differences:
+
+- The choice of Rust as the main execution language avoids the overhead of GC 
pauses.
+- Ballista is designed from the ground up to use columnar data, enabling a 
number of efficiencies such as vectorized
+  processing (SIMD and GPU) and efficient compression. Although Spark does 
have some columnar support, it is still
+  largely row-based today.
+- The combination of Rust and Arrow provides excellent memory efficiency and 
memory usage can be 5x - 10x lower than
+  Apache Spark in some cases, which means that more processing can fit on a 
single node, reducing the overhead of
+  distributed compute.
+- The use of Apache Arrow as the memory model and network protocol means that 
data can be exchanged between executors
+  in any programming language with minimal serialization overhead.
 
 Ballista can be deployed as a standalone cluster and also supports 
[Kubernetes](https://kubernetes.io/). In either
 case, the scheduler can be configured to use [etcd](https://etcd.io/) as a 
backing store to (eventually) provide
 redundancy in the case of a scheduler failing.
 
-# Getting Started
-
-Refer to the core [Ballista crate README](ballista/rust/client/README.md) for 
the Getting Started guide.
+# Project Status and Roadmap
 
-## Distributed Scheduler Overview
+Ballista is currently a proof-of-concept and provides batch execution of SQL 
queries. Although it is already capable of
+executing complex queries, it is not yet scalable or robust.
 
-Ballista uses the DataFusion query execution framework to create a physical 
plan and then transforms it into a
-distributed physical plan by breaking the query down into stages whenever the 
partitioning scheme changes.
+There is an excellent discussion in 
https://github.com/apache/arrow-ballista/issues/30 about the future of the 
project
+and we encourage you to participate and add your feedback there if you are 
interested in using or contributing to
+Ballista.
 
-Specifically, any `RepartitionExec` operator is replaced with an 
`UnresolvedShuffleExec` and the child operator
-of the repartition operator is wrapped in a `ShuffleWriterExec` operator and 
scheduled for execution.
+The current initiatives being considered are:
 
-Each executor polls the scheduler for the next task to run. Tasks are 
currently always `ShuffleWriterExec` operators
-and each task represents one _input_ partition that will be executed. The 
resulting batches are repartitioned
-according to the shuffle partitioning scheme and each _output_ partition is 
streamed to disk in Arrow IPC format.
+- Continue to improve the current batch-based execution
+- Add support for low-latency query execution based on a streaming model
+- Adopt [substrait.io](https://substrait.io/) to allow other query engines to 
be integrated
 
-The scheduler will replace `UnresolvedShuffleExec` operators with 
`ShuffleReaderExec` operators once all shuffle
-tasks have completed. The `ShuffleReaderExec` operator connects to other 
executors as required using the Flight
-interface, and streams the shuffle IPC files.
-
-# How does this compare to Apache Spark?
-
-Ballista implements a similar design to Apache Spark, but there are some key 
differences.
+# Getting Started
 
-- The choice of Rust as the main execution language means that memory usage is 
deterministic and avoids the overhead of
-  GC pauses.
-- Ballista is designed from the ground up to use columnar data, enabling a 
number of efficiencies such as vectorized
-  processing (SIMD and GPU) and efficient compression. Although Spark does 
have some columnar support, it is still
-  largely row-based today.
-- The combination of Rust and Arrow provides excellent memory efficiency and 
memory usage can be 5x - 10x lower than
-  Apache Spark in some cases, which means that more processing can fit on a 
single node, reducing the overhead of
-  distributed compute.
-- The use of Apache Arrow as the memory model and network protocol means that 
data can be exchanged between executors
-  in any programming language with minimal serialization overhead.
+Refer to the core [Ballista crate README](ballista/rust/client/README.md) for 
the Getting Started guide.
 
 ## Architecture Overview
 
-There is no formal document describing Ballista's architecture yet, but the 
following presentation offers a good overview of its different components and 
how they interact together.
-
-- (February 2021): Ballista: Distributed Compute with Rust and Apache Arrow: 
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+- [Architecture Overview](ballista/docs/architecture.md)
+- [Ballista: Distributed Compute with Rust and Apache 
Arrow](https://www.youtube.com/watch?v=ZZHQaOap9pQ) talk at
+  the New York Open Statistical Programming Meetup (Feb 2021)
 
 ## Contribution Guide
 
diff --git a/ballista/docs/architecture.md b/ballista/docs/architecture.md
index 2868d52b..bdb45cf1 100644
--- a/ballista/docs/architecture.md
+++ b/ballista/docs/architecture.md
@@ -39,6 +39,22 @@ executors in the cluster. This is the basic unit of 
scalability in Ballista.
 The following diagram shows the flow of requests and responses between the 
client, scheduler, and executor
 processes.
 
+## Distributed Scheduler Overview
+
+Ballista uses the DataFusion query execution framework to create a physical 
plan and then transforms it into a
+distributed physical plan by breaking the query down into stages whenever the 
partitioning scheme changes.
+
+Specifically, any `RepartitionExec` operator is replaced with an 
`UnresolvedShuffleExec` and the child operator
+of the repartition operator is wrapped in a `ShuffleWriterExec` operator and 
scheduled for execution.
+
+Each executor polls the scheduler for the next task to run. Tasks are 
currently always `ShuffleWriterExec` operators
+and each task represents one _input_ partition that will be executed. The 
resulting batches are repartitioned
+according to the shuffle partitioning scheme and each _output_ partition is 
streamed to disk in Arrow IPC format.
+
+The scheduler will replace `UnresolvedShuffleExec` operators with 
`ShuffleReaderExec` operators once all shuffle
+tasks have completed. The `ShuffleReaderExec` operator connects to other 
executors as required using the Flight
+interface, and streams the shuffle IPC files.
+
 ![Query Execution Flow](images/query-execution.png)
 
 ## Scheduler Process
diff --git a/dev/release/README.md b/dev/release/README.md
index ac07b4f2..500083f7 100644
--- a/dev/release/README.md
+++ b/dev/release/README.md
@@ -62,7 +62,7 @@ See instructions at 
https://infra.apache.org/release-signing.html#generate for g
 
 Committers can add signing keys in Subversion client with their ASF account. 
e.g.:
 
-``` bash
+```bash
 $ svn co https://dist.apache.org/repos/dist/dev/arrow
 $ cd arrow
 $ editor KEYS
@@ -71,7 +71,7 @@ $ svn ci KEYS
 
 Follow the instructions in the header of the KEYS file to append your key. 
Here is an example:
 
-``` bash
+```bash
 (gpg --list-sigs "John Doe" && gpg --armor --export "John Doe") >> KEYS
 svn commit KEYS -m "Add key for John Doe"
 ```
@@ -136,7 +136,7 @@ git commit -a -m 'Update version'
 Define release branch (e.g. `master`), base version tag (e.g. `7.0.0`) and 
future version tag (e.g. `8.0.0`). Commits between the base version tag and the 
release branch will be used to
 populate the changelog content.
 
-You will need a GitHub Personal Access Token for the following steps. Follow 
+You will need a GitHub Personal Access Token for the following steps. Follow
 [these 
instructions](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token)
 to generate one if you do not already have one.
 
@@ -148,8 +148,8 @@ CHANGELOG_GITHUB_TOKEN=<TOKEN> 
./dev/release/update_change_log-ballista.sh maste
 git commit -a -m 'Create changelog for release'
 ```
 
-_If you see the error `"You have exceeded a secondary rate limit"` when 
running this script, try reducing the CPU 
-allocation to slow the process down and throttle the number of GitHub requests 
made per minute, by modifying the 
+_If you see the error `"You have exceeded a secondary rate limit"` when 
running this script, try reducing the CPU
+allocation to slow the process down and throttle the number of GitHub requests 
made per minute, by modifying the
 value of the `--cpus` argument in the `update_change_log.sh` script._
 
 You can add `invalid` or `development-process` label to exclude items from
@@ -321,7 +321,7 @@ following commands. Crates need to be published in the 
correct order as shown in
 
 _To update this diagram, manually edit the dependencies in 
[crate-deps.dot](crate-deps.dot) and then run:_
 
-``` bash
+```bash
 dot -Tsvg dev/release/crate-deps.dot > dev/release/crate-deps.svg
 ```

[arrow-ballista] branch master updated: Improve top-level README (#33)

Reply via email to