This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git


The following commit(s) were added to refs/heads/master by this push:
     new 57170d81 Replace README with Ballista version (#4)
57170d81 is described below

commit 57170d818b49506dcc5a4525b595bebf92f37bad
Author: Andy Grove <[email protected]>
AuthorDate: Thu May 19 12:35:26 2022 -0600

    Replace README with Ballista version (#4)
---
 README.md                      | 97 +++++++++++++++++++-----------------------
 ballista/README.md             | 71 -------------------------------
 ballista/rust/client/README.md |  2 +-
 3 files changed, 44 insertions(+), 126 deletions(-)

diff --git a/README.md b/README.md
index 58a511da..28477d12 100644
--- a/README.md
+++ b/README.md
@@ -17,79 +17,68 @@
   under the License.
 -->
 
-# DataFusion
+_Please note that Ballista development is still happening in the
+[DataFusion repository](https://github.com/apache/arrow-datafusion) but we are 
in the
+process of migrating to this new repository._
 
-<img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" 
width="256"/>
+# Ballista: Distributed Compute with Rust, Apache Arrow, and DataFusion
 
-DataFusion is an extensible query execution framework, written in
-Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
-in-memory format.
+Ballista is a distributed compute platform primarily implemented in Rust, and 
powered by Apache Arrow and
+DataFusion. It is built on an architecture that allows other programming 
languages (such as Python, C++, and
+Java) to be supported as first-class citizens without paying a penalty for 
serialization costs.
 
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+The foundational technologies in Ballista are:
 
-DataFusion also supports distributed query execution via the
-[Ballista](ballista/README.md) crate.
+- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels 
for efficient processing of data.
+- [Apache Arrow Flight 
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) 
for efficient
+  data transfer between processes.
+- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) 
for serializing query plans.
+- [Docker](https://www.docker.com/) for packaging up executors along with 
user-defined code.
 
-## Use Cases
+Ballista can be deployed as a standalone cluster and also supports 
[Kubernetes](https://kubernetes.io/). In either
+case, the scheduler can be configured to use [etcd](https://etcd.io/) as a 
backing store to (eventually) provide
+redundancy in the case of a scheduler failing.
 
-DataFusion is used to create modern, fast and efficient data
-pipelines, ETL processes, and database systems, which need the
-performance of Rust and Apache Arrow and want to provide their users
-the convenience of an SQL interface or a DataFrame API.
+# Getting Started
 
-## Why DataFusion?
+Refer to the core [Ballista crate README](ballista/rust/client/README.md) for 
the Getting Started guide.
 
-- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion 
achieves very high performance
-- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet 
and Flight), DataFusion works well with the rest of the big data ecosystem
-- _Easy to Embed_: Allowing extension at almost any point in its design, 
DataFusion can be tailored for your specific usecase
-- _High Quality_: Extensively tested, both by itself and with the rest of the 
Arrow ecosystem, DataFusion can be used as the foundation for production 
systems.
+## Distributed Scheduler Overview
 
-## Known Uses
+Ballista uses the DataFusion query execution framework to create a physical 
plan and then transforms it into a
+distributed physical plan by breaking the query down into stages whenever the 
partitioning scheme changes.
 
-Projects that adapt to or serve as plugins to DataFusion:
+Specifically, any `RepartitionExec` operator is replaced with an 
`UnresolvedShuffleExec` and the child operator
+of the repartition operator is wrapped in a `ShuffleWriterExec` operator and 
scheduled for execution.
 
-- [datafusion-python](https://github.com/datafusion-contrib/datafusion-python)
-- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)
-- 
[datafusion-objectstore-s3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)
-- 
[datafusion-objectstore-hdfs](https://github.com/datafusion-contrib/datafusion-objectstore-hdfs)
-- 
[datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable)
-- 
[datafusion-objectstore-azure](https://github.com/datafusion-contrib/datafusion-objectstore-azure)
+Each executor polls the scheduler for the next task to run. Tasks are 
currently always `ShuffleWriterExec` operators
+and each task represents one _input_ partition that will be executed. The 
resulting batches are repartitioned
+according to the shuffle partitioning scheme and each _output_ partition is 
streamed to disk in Arrow IPC format.
 
-Here are some of the projects known to use DataFusion:
+The scheduler will replace `UnresolvedShuffleExec` operators with 
`ShuffleReaderExec` operators once all shuffle
+tasks have completed. The `ShuffleReaderExec` operator connects to other 
executors as required using the Flight
+interface, and streams the shuffle IPC files.
 
-- [Ballista](ballista) Distributed Compute Platform
-- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
-- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
-- [delta-rs](https://github.com/delta-io/delta-rs)
-- [Flock](https://github.com/flock-lab/flock)
-- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series 
Database
-- [ROAPI](https://github.com/roapi/roapi)
-- [Tensorbase](https://github.com/tensorbase/tensorbase)
-- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the 
[Vega](https://vega.github.io/) visualization grammar
+# How does this compare to Apache Spark?
 
-(if you know of another project, please submit a PR to add a link!)
+Ballista implements a similar design to Apache Spark, but there are some key 
differences.
 
-## Example Usage
-
-Please see [example 
usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) to 
find how to use DataFusion.
-
-## Roadmap
-
-Please see [Roadmap](docs/source/specification/roadmap.md) for information of 
where the project is headed.
+- The choice of Rust as the main execution language means that memory usage is 
deterministic and avoids the overhead of
+  GC pauses.
+- Ballista is designed from the ground up to use columnar data, enabling a 
number of efficiencies such as vectorized
+  processing (SIMD and GPU) and efficient compression. Although Spark does 
have some columnar support, it is still
+  largely row-based today.
+- The combination of Rust and Arrow provides excellent memory efficiency and 
memory usage can be 5x - 10x lower than
+  Apache Spark in some cases, which means that more processing can fit on a 
single node, reducing the overhead of
+  distributed compute.
+- The use of Apache Arrow as the memory model and network protocol means that 
data can be exchanged between executors
+  in any programming language with minimal serialization overhead.
 
 ## Architecture Overview
 
-There is no formal document describing DataFusion's architecture yet, but the 
following presentations offer a good overview of its different components and 
how they interact together.
-
-- (March 2021): The DataFusion architecture is described in _Query Engine 
Design and the Rust-Based DataFusion in Apache Arrow_: 
[recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content 
starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) 
and 
[slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-- (February 2021): How DataFusion is used within the Ballista Project is 
described in \*Ballista: Distributed Compute with Rust and Apache Arrow: 
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
-
-## User's guide
+There is no formal document describing Ballista's architecture yet, but the 
following presentation offers a good overview of its different components and 
how they interact together.
 
-Please see [User Guide](https://arrow.apache.org/datafusion/) for more 
information about DataFusion.
+- (February 2021): Ballista: Distributed Compute with Rust and Apache Arrow: 
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
 
 ## Contribution Guide
 
diff --git a/ballista/README.md b/ballista/README.md
deleted file mode 100644
index 1fc3bdbf..00000000
--- a/ballista/README.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!---
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
--->
-
-# Ballista: Distributed Compute with Apache Arrow and DataFusion
-
-Ballista is a distributed compute platform primarily implemented in Rust, and 
powered by Apache Arrow and
-DataFusion. It is built on an architecture that allows other programming 
languages (such as Python, C++, and
-Java) to be supported as first-class citizens without paying a penalty for 
serialization costs.
-
-The foundational technologies in Ballista are:
-
-- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels 
for efficient processing of data.
-- [Apache Arrow Flight 
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) 
for efficient
-  data transfer between processes.
-- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) 
for serializing query plans.
-- [Docker](https://www.docker.com/) for packaging up executors along with 
user-defined code.
-
-Ballista can be deployed as a standalone cluster and also supports 
[Kubernetes](https://kubernetes.io/). In either
-case, the scheduler can be configured to use [etcd](https://etcd.io/) as a 
backing store to (eventually) provide
-redundancy in the case of a scheduler failing.
-
-# Getting Started
-
-Refer to the core [Ballista crate README](rust/client/README.md) for the 
Getting Started guide.
-
-## Distributed Scheduler Overview
-
-Ballista uses the DataFusion query execution framework to create a physical 
plan and then transforms it into a
-distributed physical plan by breaking the query down into stages whenever the 
partitioning scheme changes.
-
-Specifically, any `RepartitionExec` operator is replaced with an 
`UnresolvedShuffleExec` and the child operator
-of the repartition operator is wrapped in a `ShuffleWriterExec` operator and 
scheduled for execution.
-
-Each executor polls the scheduler for the next task to run. Tasks are 
currently always `ShuffleWriterExec` operators
-and each task represents one _input_ partition that will be executed. The 
resulting batches are repartitioned
-according to the shuffle partitioning scheme and each _output_ partition is 
streamed to disk in Arrow IPC format.
-
-The scheduler will replace `UnresolvedShuffleExec` operators with 
`ShuffleReaderExec` operators once all shuffle
-tasks have completed. The `ShuffleReaderExec` operator connects to other 
executors as required using the Flight
-interface, and streams the shuffle IPC files.
-
-# How does this compare to Apache Spark?
-
-Ballista implements a similar design to Apache Spark, but there are some key 
differences.
-
-- The choice of Rust as the main execution language means that memory usage is 
deterministic and avoids the overhead of
-  GC pauses.
-- Ballista is designed from the ground up to use columnar data, enabling a 
number of efficiencies such as vectorized
-  processing (SIMD and GPU) and efficient compression. Although Spark does 
have some columnar support, it is still
-  largely row-based today.
-- The combination of Rust and Arrow provides excellent memory efficiency and 
memory usage can be 5x - 10x lower than
-  Apache Spark in some cases, which means that more processing can fit on a 
single node, reducing the overhead of
-  distributed compute.
-- The use of Apache Arrow as the memory model and network protocol means that 
data can be exchanged between executors
-  in any programming language with minimal serialization overhead.
diff --git a/ballista/rust/client/README.md b/ballista/rust/client/README.md
index ecf364f4..f5fe094d 100644
--- a/ballista/rust/client/README.md
+++ b/ballista/rust/client/README.md
@@ -35,7 +35,7 @@ Ballista can be deployed as a standalone cluster and also 
supports [Kubernetes](
 case, the scheduler can be configured to use [etcd](https://etcd.io/) as a 
backing store to (eventually) provide
 redundancy in the case of a scheduler failing.
 
-## Rust Version Compatbility
+## Rust Version Compatibility
 
 This crate is tested with the latest stable version of Rust. We do not 
currrently test against other, older versions of the Rust compiler.
 

Reply via email to