This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git
The following commit(s) were added to refs/heads/master by this push:
new 1a65eb9a MINOR: Improve developer docs (#41)
1a65eb9a is described below
commit 1a65eb9a5b883bfd4ce485da810e601dc218d462
Author: Andy Grove <[email protected]>
AuthorDate: Mon May 30 09:35:22 2022 -0600
MINOR: Improve developer docs (#41)
---
CONTRIBUTING.md | 2 +-
README.md | 6 +-
ballista/CHANGELOG.md | 74 ++++++++++-----------
docs/README.md | 8 ++-
{ballista/docs => docs/developer}/README.md | 4 +-
{ballista/docs => docs/developer}/architecture.md | 45 +++++--------
{ballista/docs => docs/developer}/dev-env.md | 0
.../developer}/images/query-execution.png | Bin
.../docs => docs/developer}/integration-testing.md | 0
examples/README.md | 9 ++-
10 files changed, 71 insertions(+), 77 deletions(-)
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 94ad6605..a94c2621 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -263,5 +263,5 @@ $ prettier --version
After you've confirmed your prettier version, you can format all the `.md`
files:
```bash
-prettier -w
{ballista,datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md
+prettier -w README.md {ballista,dev,docs,examples}/**/*.md
```
diff --git a/README.md b/README.md
index 88b438b5..f2a73c13 100644
--- a/README.md
+++ b/README.md
@@ -70,9 +70,9 @@ that, refer to the [Getting Started
Guide](ballista/rust/client/README.md).
## Architecture Overview
-- [Architecture Overview](ballista/docs/architecture.md)
-- [Ballista: Distributed Compute with Rust and Apache
Arrow](https://www.youtube.com/watch?v=ZZHQaOap9pQ) talk at
- the New York Open Statistical Programming Meetup (Feb 2021)
+- Refer to the [developer documentation](docs/developer) for the [Architecture
Overview](/docs/developer/architecture.md)
+- Watch the [Ballista: Distributed Compute with Rust and Apache
Arrow](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+ talk from the New York Open Statistical Programming Meetup (Feb 2021)
## Contribution Guide
diff --git a/ballista/CHANGELOG.md b/ballista/CHANGELOG.md
index 07ce062a..2bdd1d6a 100644
--- a/ballista/CHANGELOG.md
+++ b/ballista/CHANGELOG.md
@@ -42,13 +42,13 @@
- Add `CREATE VIEW`
[\#2279](https://github.com/apache/arrow-datafusion/pull/2279)
([matthewmturner](https://github.com/matthewmturner))
- \[Ballista\] Support Union in ballista.
[\#2098](https://github.com/apache/arrow-datafusion/pull/2098)
([Ted-Jiang](https://github.com/Ted-Jiang))
-- Add missing aggr\_expr to PhysicalExprNode for Ballista.
[\#1989](https://github.com/apache/arrow-datafusion/pull/1989)
([Ted-Jiang](https://github.com/Ted-Jiang))
+- Add missing aggr_expr to PhysicalExprNode for Ballista.
[\#1989](https://github.com/apache/arrow-datafusion/pull/1989)
([Ted-Jiang](https://github.com/Ted-Jiang))
**Fixed bugs:**
- Ballista integration tests no longer work
[\#2440](https://github.com/apache/arrow-datafusion/issues/2440)
- Ballista crates cannot be released from DafaFusion 7.0.0 source release
[\#1980](https://github.com/apache/arrow-datafusion/issues/1980)
-- protobuf OctetLength should be deserialized as octet\_length, not length
[\#1834](https://github.com/apache/arrow-datafusion/pull/1834)
([carols10cents](https://github.com/carols10cents))
+- protobuf OctetLength should be deserialized as octet_length, not length
[\#1834](https://github.com/apache/arrow-datafusion/pull/1834)
([carols10cents](https://github.com/carols10cents))
**Documentation updates:**
@@ -80,7 +80,7 @@
- Limit cpu cores used when generating changelog
[\#2494](https://github.com/apache/arrow-datafusion/pull/2494)
([andygrove](https://github.com/andygrove))
- MINOR: Parameterize changelog script
[\#2484](https://github.com/apache/arrow-datafusion/pull/2484)
([jychen7](https://github.com/jychen7))
- Fix stage key extraction
[\#2472](https://github.com/apache/arrow-datafusion/pull/2472)
([thinkharderdev](https://github.com/thinkharderdev))
-- Add support for list\_dir\(\) on local fs
[\#2467](https://github.com/apache/arrow-datafusion/pull/2467)
([wjones127](https://github.com/wjones127))
+- Add support for list_dir\(\) on local fs
[\#2467](https://github.com/apache/arrow-datafusion/pull/2467)
([wjones127](https://github.com/wjones127))
- minor: update versions and paths in changelog scripts
[\#2429](https://github.com/apache/arrow-datafusion/pull/2429)
([andygrove](https://github.com/andygrove))
- Fix Ballista executing during plan
[\#2428](https://github.com/apache/arrow-datafusion/pull/2428)
([tustvold](https://github.com/tustvold))
- Re-organize and rename aggregates physical plan
[\#2388](https://github.com/apache/arrow-datafusion/pull/2388)
([yjshen](https://github.com/yjshen))
@@ -88,7 +88,7 @@
- Grouped Aggregate in row format
[\#2375](https://github.com/apache/arrow-datafusion/pull/2375)
([yjshen](https://github.com/yjshen))
- Stop optimizing queries twice
[\#2369](https://github.com/apache/arrow-datafusion/pull/2369)
([andygrove](https://github.com/andygrove))
- Bump follow-redirects from 1.13.2 to 1.14.9 in /ballista/ui/scheduler
[\#2325](https://github.com/apache/arrow-datafusion/pull/2325)
([dependabot[bot]](https://github.com/apps/dependabot))
-- Move FileType enum from sql module to logical\_plan module
[\#2290](https://github.com/apache/arrow-datafusion/pull/2290)
([andygrove](https://github.com/andygrove))
+- Move FileType enum from sql module to logical_plan module
[\#2290](https://github.com/apache/arrow-datafusion/pull/2290)
([andygrove](https://github.com/andygrove))
- Add BatchPartitioner \(\#2285\)
[\#2287](https://github.com/apache/arrow-datafusion/pull/2287)
([tustvold](https://github.com/tustvold))
- Update uuid requirement from 0.8 to 1.0
[\#2280](https://github.com/apache/arrow-datafusion/pull/2280)
([dependabot[bot]](https://github.com/apps/dependabot))
- Bump async from 2.6.3 to 2.6.4 in /ballista/ui/scheduler
[\#2277](https://github.com/apache/arrow-datafusion/pull/2277)
([dependabot[bot]](https://github.com/apps/dependabot))
@@ -98,11 +98,11 @@
- Update to Arrow 12.0.0, update tonic and prost
[\#2253](https://github.com/apache/arrow-datafusion/pull/2253)
([alamb](https://github.com/alamb))
- Add ExecutorMetricsCollector interface
[\#2234](https://github.com/apache/arrow-datafusion/pull/2234)
([thinkharderdev](https://github.com/thinkharderdev))
- minor: add editor config file
[\#2224](https://github.com/apache/arrow-datafusion/pull/2224)
([jackwener](https://github.com/jackwener))
-- \[Ballista\] Enable ApproxPercentileWithWeight in Ballista and fill UT
[\#2192](https://github.com/apache/arrow-datafusion/pull/2192)
([Ted-Jiang](https://github.com/Ted-Jiang))
+- \[Ballista\] Enable ApproxPercentileWithWeight in Ballista and fill UT
[\#2192](https://github.com/apache/arrow-datafusion/pull/2192)
([Ted-Jiang](https://github.com/Ted-Jiang))
- make nightly clippy happy
[\#2186](https://github.com/apache/arrow-datafusion/pull/2186)
([xudong963](https://github.com/xudong963))
- \[Ballista\]Make PhysicalAggregateExprNode has repeated PhysicalExprNode
[\#2184](https://github.com/apache/arrow-datafusion/pull/2184)
([Ted-Jiang](https://github.com/Ted-Jiang))
- Add LogicalPlan::SubqueryAlias
[\#2172](https://github.com/apache/arrow-datafusion/pull/2172)
([andygrove](https://github.com/andygrove))
-- Implement fast path of with\_new\_children\(\) in ExecutionPlan
[\#2168](https://github.com/apache/arrow-datafusion/pull/2168)
([mingmwang](https://github.com/mingmwang))
+- Implement fast path of with_new_children\(\) in ExecutionPlan
[\#2168](https://github.com/apache/arrow-datafusion/pull/2168)
([mingmwang](https://github.com/mingmwang))
- \[MINOR\] ignore suspicious slow test in Ballista
[\#2167](https://github.com/apache/arrow-datafusion/pull/2167)
([Ted-Jiang](https://github.com/Ted-Jiang))
- enable explain for ballista
[\#2163](https://github.com/apache/arrow-datafusion/pull/2163)
([doki23](https://github.com/doki23))
- Add delimiter for create external table
[\#2162](https://github.com/apache/arrow-datafusion/pull/2162)
([matthewmturner](https://github.com/matthewmturner))
@@ -118,7 +118,7 @@
- Refactor SessionContext, BallistaContext to support multi-tenancy
configurations - Part 3
[\#2091](https://github.com/apache/arrow-datafusion/pull/2091)
([mingmwang](https://github.com/mingmwang))
- Remove dependency of common for the storage crate
[\#2076](https://github.com/apache/arrow-datafusion/pull/2076)
([yahoNanJing](https://github.com/yahoNanJing))
- [MINOR] fix doc in `EXTRACT\(field FROM source\)
[\#2074](https://github.com/apache/arrow-datafusion/pull/2074)
([Ted-Jiang](https://github.com/Ted-Jiang))
-- \[Bug\]\[Datafusion\] fix TaskContext session\_config bug
[\#2070](https://github.com/apache/arrow-datafusion/pull/2070)
([gaojun2048](https://github.com/gaojun2048))
+- \[Bug\]\[Datafusion\] fix TaskContext session_config bug
[\#2070](https://github.com/apache/arrow-datafusion/pull/2070)
([gaojun2048](https://github.com/gaojun2048))
- Short-circuit evaluation for `CaseWhen`
[\#2068](https://github.com/apache/arrow-datafusion/pull/2068)
([yjshen](https://github.com/yjshen))
- split datafusion-object-store module
[\#2065](https://github.com/apache/arrow-datafusion/pull/2065)
([yahoNanJing](https://github.com/yahoNanJing))
- Change log level for noisy logs
[\#2060](https://github.com/apache/arrow-datafusion/pull/2060)
([thinkharderdev](https://github.com/thinkharderdev))
@@ -153,17 +153,17 @@
- Remove uneeded Mutex in Ballista Client
[\#1898](https://github.com/apache/arrow-datafusion/pull/1898)
([alamb](https://github.com/alamb))
- Create a `datafusion-proto` crate for datafusion protobuf serialization
[\#1887](https://github.com/apache/arrow-datafusion/pull/1887)
([carols10cents](https://github.com/carols10cents))
- Fix clippy lints
[\#1885](https://github.com/apache/arrow-datafusion/pull/1885)
([HaoYang670](https://github.com/HaoYang670))
-- Separate cpu-bound \(query-execution\) and IO-bound\(heartbeat\) to …
[\#1883](https://github.com/apache/arrow-datafusion/pull/1883)
([Ted-Jiang](https://github.com/Ted-Jiang))
+- Separate cpu-bound \(query-execution\) and IO-bound\(heartbeat\) to …
[\#1883](https://github.com/apache/arrow-datafusion/pull/1883)
([Ted-Jiang](https://github.com/Ted-Jiang))
- \[Minor\] Clean up DecimalArray API Usage
[\#1869](https://github.com/apache/arrow-datafusion/pull/1869)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([alamb](https://github.com/alamb))
- Changes after went through "Datafusion as a library section"
[\#1868](https://github.com/apache/arrow-datafusion/pull/1868)
([nonontb](https://github.com/nonontb))
- Remove allow unused imports from ballista-core, then fix all warnings
[\#1853](https://github.com/apache/arrow-datafusion/pull/1853)
([carols10cents](https://github.com/carols10cents))
- Update to arrow 9.1.0
[\#1851](https://github.com/apache/arrow-datafusion/pull/1851)
([alamb](https://github.com/alamb))
- move some tests out of context and into sql
[\#1846](https://github.com/apache/arrow-datafusion/pull/1846)
([alamb](https://github.com/alamb))
-- Fix compiling ballista in standalone mode, add build to CI
[\#1839](https://github.com/apache/arrow-datafusion/pull/1839)
([alamb](https://github.com/alamb))
+- Fix compiling ballista in standalone mode, add build to CI
[\#1839](https://github.com/apache/arrow-datafusion/pull/1839)
([alamb](https://github.com/alamb))
- Update documentation example for change in API
[\#1812](https://github.com/apache/arrow-datafusion/pull/1812)
([alamb](https://github.com/alamb))
- Refactor scheduler state with different management policy for volatile and
stable states [\#1810](https://github.com/apache/arrow-datafusion/pull/1810)
([yahoNanJing](https://github.com/yahoNanJing))
- DataFusion + Conbench Integration
[\#1791](https://github.com/apache/arrow-datafusion/pull/1791)
([dianaclarke](https://github.com/dianaclarke))
-- Enable periodic cleanup of work\_dir directories in ballista executor
[\#1783](https://github.com/apache/arrow-datafusion/pull/1783)
([Ted-Jiang](https://github.com/Ted-Jiang))
+- Enable periodic cleanup of work_dir directories in ballista executor
[\#1783](https://github.com/apache/arrow-datafusion/pull/1783)
([Ted-Jiang](https://github.com/Ted-Jiang))
- Use`eq_dyn`, `neq_dyn`, `lt_dyn`, `lt_eq_dyn`, `gt_dyn`, `gt_eq_dyn` kernels
from arrow [\#1475](https://github.com/apache/arrow-datafusion/pull/1475)
([alamb](https://github.com/alamb))
## [7.1.0-rc1](https://github.com/apache/arrow-datafusion/tree/7.1.0-rc1)
(2022-04-10)
@@ -181,7 +181,7 @@
**Closed issues:**
- Optimize memory usage pattern to avoid "double memory" behavior
[\#2149](https://github.com/apache/arrow-datafusion/issues/2149)
-- Document approx\_percentile\_cont\_with\_weight in users guide
[\#2078](https://github.com/apache/arrow-datafusion/issues/2078)
+- Document approx_percentile_cont_with_weight in users guide
[\#2078](https://github.com/apache/arrow-datafusion/issues/2078)
- \[follow up\]cleaning up statements.remove\(0\)
[\#1986](https://github.com/apache/arrow-datafusion/issues/1986)
- Formatting error on documentation for Python
[\#1873](https://github.com/apache/arrow-datafusion/issues/1873)
- Remove duplicate tests from `test_const_evaluator_scalar_functions`
[\#1727](https://github.com/apache/arrow-datafusion/issues/1727)
@@ -208,17 +208,17 @@
- Add `corr` aggregate function
[\#1561](https://github.com/apache/arrow-datafusion/pull/1561)
([realno](https://github.com/realno))
- Add `covar`, `covar_pop` and `covar_samp` aggregate functions
[\#1551](https://github.com/apache/arrow-datafusion/pull/1551)
([realno](https://github.com/realno))
- Add `approx_quantile()` aggregation function
[\#1539](https://github.com/apache/arrow-datafusion/pull/1539)
([domodwyer](https://github.com/domodwyer))
-- Initial MemoryManager and DiskManager APIs for query execution + External
Sort implementation
[\#1526](https://github.com/apache/arrow-datafusion/pull/1526)
([yjshen](https://github.com/yjshen))
+- Initial MemoryManager and DiskManager APIs for query execution + External
Sort implementation
[\#1526](https://github.com/apache/arrow-datafusion/pull/1526)
([yjshen](https://github.com/yjshen))
- Add `stddev` and `variance`
[\#1525](https://github.com/apache/arrow-datafusion/pull/1525)
([realno](https://github.com/realno))
- Add `rem` operation for Expr
[\#1467](https://github.com/apache/arrow-datafusion/pull/1467)
([liukun4515](https://github.com/liukun4515))
- Implement `array_agg` aggregate function
[\#1300](https://github.com/apache/arrow-datafusion/pull/1300)
([viirya](https://github.com/viirya))
**Fixed bugs:**
-- Ballista context::tests::test\_standalone\_mode test fails
[\#1020](https://github.com/apache/arrow-datafusion/issues/1020)
+- Ballista context::tests::test_standalone_mode test fails
[\#1020](https://github.com/apache/arrow-datafusion/issues/1020)
- \[Ballista\] Fix scheduler state mod bug
[\#1655](https://github.com/apache/arrow-datafusion/pull/1655)
([gaojun2048](https://github.com/gaojun2048))
- Pass local address host so we do not get mismatch between IPv4 and IP…
[\#1466](https://github.com/apache/arrow-datafusion/pull/1466)
([thinkharderdev](https://github.com/thinkharderdev))
-- Add Timezone to Scalar::Time\* types, and better timezone awareness to
Datafusion's time types
[\#1455](https://github.com/apache/arrow-datafusion/pull/1455)
([maxburke](https://github.com/maxburke))
+- Add Timezone to Scalar::Time\* types, and better timezone awareness to
Datafusion's time types
[\#1455](https://github.com/apache/arrow-datafusion/pull/1455)
([maxburke](https://github.com/maxburke))
**Documentation updates:**
@@ -235,7 +235,7 @@
- Track memory usage in Non Limited Operators
[\#1569](https://github.com/apache/arrow-datafusion/issues/1569)
- \[Question\] Why does ballista store tables in the client instead of in the
SchedulerServer [\#1473](https://github.com/apache/arrow-datafusion/issues/1473)
- Why use the expr types before coercion to get the result type?
[\#1358](https://github.com/apache/arrow-datafusion/issues/1358)
-- A problem about the projection\_push\_down optimizer gathers valid columns
[\#1312](https://github.com/apache/arrow-datafusion/issues/1312)
+- A problem about the projection_push_down optimizer gathers valid columns
[\#1312](https://github.com/apache/arrow-datafusion/issues/1312)
- apply constant folding to `LogicalPlan::Values`
[\#1170](https://github.com/apache/arrow-datafusion/issues/1170)
- reduce usage of `IntoIterator<Item = Expr>` in logical plan builder window
fn [\#372](https://github.com/apache/arrow-datafusion/issues/372)
@@ -248,7 +248,7 @@
- Update to sqlparser 0.14
[\#1796](https://github.com/apache/arrow-datafusion/pull/1796)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([alamb](https://github.com/alamb))
- Update datafusion versions
[\#1793](https://github.com/apache/arrow-datafusion/pull/1793)
([matthewmturner](https://github.com/matthewmturner))
- Update datafusion to use arrow 9.0.0
[\#1775](https://github.com/apache/arrow-datafusion/pull/1775)
([alamb](https://github.com/alamb))
-- Update parking\_lot requirement from 0.11 to 0.12
[\#1735](https://github.com/apache/arrow-datafusion/pull/1735)
([dependabot[bot]](https://github.com/apps/dependabot))
+- Update parking_lot requirement from 0.11 to 0.12
[\#1735](https://github.com/apache/arrow-datafusion/pull/1735)
([dependabot[bot]](https://github.com/apps/dependabot))
- substitute `parking_lot::Mutex` for `std::sync::Mutex`
[\#1720](https://github.com/apache/arrow-datafusion/pull/1720)
([xudong963](https://github.com/xudong963))
- Create ListingTableConfig which includes file format and schema inference
[\#1715](https://github.com/apache/arrow-datafusion/pull/1715)
([matthewmturner](https://github.com/matthewmturner))
- Support `create_physical_expr` and `ExecutionContextState` or
`DefaultPhysicalPlanner` for faster speed
[\#1700](https://github.com/apache/arrow-datafusion/pull/1700)
([alamb](https://github.com/alamb))
@@ -274,7 +274,7 @@
- add rfcs for datafusion
[\#1490](https://github.com/apache/arrow-datafusion/pull/1490)
([xudong963](https://github.com/xudong963))
- support comparison for decimal data type and refactor the binary coercion
rule [\#1483](https://github.com/apache/arrow-datafusion/pull/1483)
([liukun4515](https://github.com/liukun4515))
- Update arrow-rs to 6.4.0 and replace boolean comparison in datafusion with
arrow compute kernel
[\#1446](https://github.com/apache/arrow-datafusion/pull/1446)
([xudong963](https://github.com/xudong963))
-- support cast/try\_cast for decimal: signed numeric to decimal
[\#1442](https://github.com/apache/arrow-datafusion/pull/1442)
([liukun4515](https://github.com/liukun4515))
+- support cast/try_cast for decimal: signed numeric to decimal
[\#1442](https://github.com/apache/arrow-datafusion/pull/1442)
([liukun4515](https://github.com/liukun4515))
- use 0.13 sql parser
[\#1435](https://github.com/apache/arrow-datafusion/pull/1435)
([Jimexist](https://github.com/Jimexist))
- Clarify communication on bi-weekly sync
[\#1427](https://github.com/apache/arrow-datafusion/pull/1427)
([alamb](https://github.com/alamb))
- Minimize features
[\#1399](https://github.com/apache/arrow-datafusion/pull/1399)
([carols10cents](https://github.com/carols10cents))
@@ -301,7 +301,6 @@
[Full
Changelog](https://github.com/apache/arrow-datafusion/compare/ballista-0.6.0...6.0.0)
-
##
[ballista-0.6.0](https://github.com/apache/arrow-datafusion/tree/ballista-0.6.0)
(2021-11-13)
[Full
Changelog](https://github.com/apache/arrow-datafusion/compare/ballista-0.5.0...ballista-0.6.0)
@@ -310,14 +309,14 @@
- File partitioning for ListingTable
[\#1141](https://github.com/apache/arrow-datafusion/pull/1141)
([rdettai](https://github.com/rdettai))
- Register tables in BallistaContext using TableProviders instead of Dataframe
[\#1028](https://github.com/apache/arrow-datafusion/pull/1028)
([rdettai](https://github.com/rdettai))
-- Make TableProvider.scan\(\) and PhysicalPlanner::create\_physical\_plan\(\)
async [\#1013](https://github.com/apache/arrow-datafusion/pull/1013)
([rdettai](https://github.com/rdettai))
+- Make TableProvider.scan\(\) and PhysicalPlanner::create_physical_plan\(\)
async [\#1013](https://github.com/apache/arrow-datafusion/pull/1013)
([rdettai](https://github.com/rdettai))
- Reorganize table providers by table format
[\#1010](https://github.com/apache/arrow-datafusion/pull/1010)
([rdettai](https://github.com/rdettai))
- Move CBOs and Statistics to physical plan
[\#965](https://github.com/apache/arrow-datafusion/pull/965)
([rdettai](https://github.com/rdettai))
- Update to sqlparser v 0.10.0
[\#934](https://github.com/apache/arrow-datafusion/pull/934)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([alamb](https://github.com/alamb))
- FilePartition and PartitionedFile for scanning flexibility
[\#932](https://github.com/apache/arrow-datafusion/pull/932)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([yjshen](https://github.com/yjshen))
- Improve SQLMetric APIs, port existing metrics
[\#908](https://github.com/apache/arrow-datafusion/pull/908)
([alamb](https://github.com/alamb))
- Add support for EXPLAIN ANALYZE
[\#858](https://github.com/apache/arrow-datafusion/pull/858)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([alamb](https://github.com/alamb))
-- Rename concurrency to target\_partitions
[\#706](https://github.com/apache/arrow-datafusion/pull/706)
([andygrove](https://github.com/andygrove))
+- Rename concurrency to target_partitions
[\#706](https://github.com/apache/arrow-datafusion/pull/706)
([andygrove](https://github.com/andygrove))
**Implemented enhancements:**
@@ -334,16 +333,16 @@
- add digest\(utf8, method\) function and refactor all current hash digest
functions [\#1090](https://github.com/apache/arrow-datafusion/pull/1090)
([Jimexist](https://github.com/Jimexist))
- \[crypto\] add `blake3` algorithm to `digest` function
[\#1086](https://github.com/apache/arrow-datafusion/pull/1086)
([Jimexist](https://github.com/Jimexist))
- \[crypto\] add blake2b and blake2s functions
[\#1081](https://github.com/apache/arrow-datafusion/pull/1081)
([Jimexist](https://github.com/Jimexist))
-- Update sqlparser-rs to 0.11
[\#1052](https://github.com/apache/arrow-datafusion/pull/1052)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([alamb](https://github.com/alamb))
+- Update sqlparser-rs to 0.11
[\#1052](https://github.com/apache/arrow-datafusion/pull/1052)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([alamb](https://github.com/alamb))
- remove hard coded partition count in ballista logicalplan deserialization
[\#1044](https://github.com/apache/arrow-datafusion/pull/1044)
([xudong963](https://github.com/xudong963))
- Indexed field access for List
[\#1006](https://github.com/apache/arrow-datafusion/pull/1006)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([Igosuki](https://github.com/Igosuki))
- Update DataFusion to arrow 6.0
[\#984](https://github.com/apache/arrow-datafusion/pull/984)
([alamb](https://github.com/alamb))
- Implement Display for Expr, improve operator display
[\#971](https://github.com/apache/arrow-datafusion/pull/971)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([matthewmturner](https://github.com/matthewmturner))
- ObjectStore API to read from remote storage systems
[\#950](https://github.com/apache/arrow-datafusion/pull/950)
([yjshen](https://github.com/yjshen))
-- fixes \#933 replace placeholder fmt\_as fr ExecutionPlan impls
[\#939](https://github.com/apache/arrow-datafusion/pull/939)
([tiphaineruy](https://github.com/tiphaineruy))
+- fixes \#933 replace placeholder fmt_as fr ExecutionPlan impls
[\#939](https://github.com/apache/arrow-datafusion/pull/939)
([tiphaineruy](https://github.com/tiphaineruy))
- Support `NotLike` in Ballista
[\#916](https://github.com/apache/arrow-datafusion/pull/916)
([Dandandan](https://github.com/Dandandan))
- Avro Table Provider
[\#910](https://github.com/apache/arrow-datafusion/pull/910)
[[sql](https://github.com/apache/arrow-datafusion/labels/sql)]
([Igosuki](https://github.com/Igosuki))
-- Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`,
rename output\_time -\> elapsed\_compute
[\#909](https://github.com/apache/arrow-datafusion/pull/909)
([alamb](https://github.com/alamb))
+- Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`,
rename output_time -\> elapsed_compute
[\#909](https://github.com/apache/arrow-datafusion/pull/909)
([alamb](https://github.com/alamb))
- \[Ballista\] Add executor last seen info to the ui
[\#895](https://github.com/apache/arrow-datafusion/pull/895)
([msathis](https://github.com/msathis))
- add cross join support to ballista
[\#891](https://github.com/apache/arrow-datafusion/pull/891)
([houqp](https://github.com/houqp))
- Add Ballista support to DataFusion CLI
[\#889](https://github.com/apache/arrow-datafusion/pull/889)
([andygrove](https://github.com/andygrove))
@@ -351,7 +350,7 @@
**Fixed bugs:**
-- Test execution\_plans::shuffle\_writer::tests::test Fail
[\#1040](https://github.com/apache/arrow-datafusion/issues/1040)
+- Test execution_plans::shuffle_writer::tests::test Fail
[\#1040](https://github.com/apache/arrow-datafusion/issues/1040)
- Integration test fails to build docker images
[\#918](https://github.com/apache/arrow-datafusion/issues/918)
- Ballista: Remove hard-coded concurrency from logical plan serde code
[\#708](https://github.com/apache/arrow-datafusion/issues/708)
- How can I make ballista distributed compute work?
[\#327](https://github.com/apache/arrow-datafusion/issues/327)
@@ -364,8 +363,8 @@
- Adds note on compatible rust version
[\#1097](https://github.com/apache/arrow-datafusion/pull/1097)
([1nF0rmed](https://github.com/1nF0rmed))
- implement `approx_distinct` function using HyperLogLog
[\#1087](https://github.com/apache/arrow-datafusion/pull/1087)
([Jimexist](https://github.com/Jimexist))
- Improve User Guide
[\#954](https://github.com/apache/arrow-datafusion/pull/954)
([andygrove](https://github.com/andygrove))
-- Update plan\_query\_stages doc
[\#951](https://github.com/apache/arrow-datafusion/pull/951)
([rdettai](https://github.com/rdettai))
-- \[DataFusion\] - Add show and show\_limit function for DataFrame
[\#923](https://github.com/apache/arrow-datafusion/pull/923)
([francis-du](https://github.com/francis-du))
+- Update plan_query_stages doc
[\#951](https://github.com/apache/arrow-datafusion/pull/951)
([rdettai](https://github.com/rdettai))
+- \[DataFusion\] - Add show and show_limit function for DataFrame
[\#923](https://github.com/apache/arrow-datafusion/pull/923)
([francis-du](https://github.com/francis-du))
- update docs related to protoc and optional syntax
[\#902](https://github.com/apache/arrow-datafusion/pull/902)
([Jimexist](https://github.com/Jimexist))
- Improve Ballista crate README content
[\#878](https://github.com/apache/arrow-datafusion/pull/878)
([andygrove](https://github.com/andygrove))
@@ -377,17 +376,16 @@
- InList expr with NULL literals do not work
[\#1190](https://github.com/apache/arrow-datafusion/issues/1190)
- update the homepage README to include values, `approx_distinct`, etc.
[\#1171](https://github.com/apache/arrow-datafusion/issues/1171)
-- \[Python\]: Inconsistencies with Python package name
[\#1011](https://github.com/apache/arrow-datafusion/issues/1011)
+- \[Python\]: Inconsistencies with Python package name
[\#1011](https://github.com/apache/arrow-datafusion/issues/1011)
- Wanting to contribute to project where to start?
[\#983](https://github.com/apache/arrow-datafusion/issues/983)
- delete redundant code
[\#973](https://github.com/apache/arrow-datafusion/issues/973)
-- How to build DataFusion python wheel
[\#853](https://github.com/apache/arrow-datafusion/issues/853)
+- How to build DataFusion python wheel
[\#853](https://github.com/apache/arrow-datafusion/issues/853)
- Produce a design for a metrics framework
[\#21](https://github.com/apache/arrow-datafusion/issues/21)
**Merged pull requests:**
- \[nit\] simplify ballista executor `CollectExec` impl codes
[\#1140](https://github.com/apache/arrow-datafusion/pull/1140)
([panarch](https://github.com/panarch))
-
For older versions, see
[apache/arrow/CHANGELOG.md](https://github.com/apache/arrow/blob/master/CHANGELOG.md)
##
[ballista-0.5.0](https://github.com/apache/arrow-datafusion/tree/ballista-0.5.0)
(2021-08-10)
@@ -396,7 +394,7 @@ For older versions, see
[apache/arrow/CHANGELOG.md](https://github.com/apache/ar
**Breaking changes:**
-- \[ballista\] support date\_part and date\_turnc ser/de, pass tpch 7
[\#840](https://github.com/apache/arrow-datafusion/pull/840)
([houqp](https://github.com/houqp))
+- \[ballista\] support date_part and date_turnc ser/de, pass tpch 7
[\#840](https://github.com/apache/arrow-datafusion/pull/840)
([houqp](https://github.com/houqp))
- Box ScalarValue:Lists, reduce size by half size
[\#788](https://github.com/apache/arrow-datafusion/pull/788)
([alamb](https://github.com/alamb))
- Support DataFrame.collect for Ballista DataFrames
[\#785](https://github.com/apache/arrow-datafusion/pull/785)
([andygrove](https://github.com/andygrove))
- JOIN conditions are order dependent
[\#778](https://github.com/apache/arrow-datafusion/pull/778)
([seddonm1](https://github.com/seddonm1))
@@ -451,7 +449,7 @@ For older versions, see
[apache/arrow/CHANGELOG.md](https://github.com/apache/ar
- add `order by` construct in window function and logical plans
[\#463](https://github.com/apache/arrow-datafusion/pull/463)
([Jimexist](https://github.com/Jimexist))
- Refactor Ballista executor so that FlightService delegates to an Executor
struct [\#450](https://github.com/apache/arrow-datafusion/pull/450)
([andygrove](https://github.com/andygrove))
- implement lead and lag built-in window function
[\#429](https://github.com/apache/arrow-datafusion/pull/429)
([Jimexist](https://github.com/Jimexist))
-- Implement fmt\_as for ShuffleReaderExec
[\#400](https://github.com/apache/arrow-datafusion/pull/400)
([andygrove](https://github.com/andygrove))
+- Implement fmt_as for ShuffleReaderExec
[\#400](https://github.com/apache/arrow-datafusion/pull/400)
([andygrove](https://github.com/andygrove))
- Add window expression part 1 - logical and physical planning, structure,
to/from proto, and explain, for empty over clause only
[\#334](https://github.com/apache/arrow-datafusion/pull/334)
([Jimexist](https://github.com/Jimexist))
- \[breaking change\] fix 265, log should be log10, and add ln
[\#271](https://github.com/apache/arrow-datafusion/pull/271)
([Jimexist](https://github.com/Jimexist))
- Allow table providers to indicate their type for catalog metadata
[\#205](https://github.com/apache/arrow-datafusion/pull/205)
([returnString](https://github.com/returnString))
@@ -466,7 +464,7 @@ For older versions, see
[apache/arrow/CHANGELOG.md](https://github.com/apache/ar
- Ballista: TPC-H q3 @ SF=1000 never completes
[\#835](https://github.com/apache/arrow-datafusion/issues/835)
- Ballista does not support MIN/MAX aggregate functions
[\#832](https://github.com/apache/arrow-datafusion/issues/832)
- Ballista docker images fail to build
[\#828](https://github.com/apache/arrow-datafusion/issues/828)
-- Ballista: UnresolvedShuffleExec should only have a single stage\_id
[\#726](https://github.com/apache/arrow-datafusion/issues/726)
+- Ballista: UnresolvedShuffleExec should only have a single stage_id
[\#726](https://github.com/apache/arrow-datafusion/issues/726)
- Ballista integration tests are failing
[\#623](https://github.com/apache/arrow-datafusion/issues/623)
- Integration test build failure due to arrow-rs using unstable feature
[\#596](https://github.com/apache/arrow-datafusion/issues/596)
- `cargo build` cannot build the project
[\#531](https://github.com/apache/arrow-datafusion/issues/531)
@@ -496,12 +494,12 @@ For older versions, see
[apache/arrow/CHANGELOG.md](https://github.com/apache/ar
**Closed issues:**
- Confirm git tagging strategy for releases
[\#770](https://github.com/apache/arrow-datafusion/issues/770)
-- arrow::util::pretty::pretty\_format\_batches missing
[\#769](https://github.com/apache/arrow-datafusion/issues/769)
+- arrow::util::pretty::pretty_format_batches missing
[\#769](https://github.com/apache/arrow-datafusion/issues/769)
- move the `assert_batches_eq!` macros to a non part of datafusion
[\#745](https://github.com/apache/arrow-datafusion/issues/745)
- fix an issue where aliases are not respected in generating downstream
schemas in window expr
[\#592](https://github.com/apache/arrow-datafusion/issues/592)
- make the planner to print more succinct and useful information in window
function explain clause
[\#526](https://github.com/apache/arrow-datafusion/issues/526)
- move window frame module to be in `logical_plan`
[\#517](https://github.com/apache/arrow-datafusion/issues/517)
-- use a more rust idiomatic way of handling nth\_value
[\#448](https://github.com/apache/arrow-datafusion/issues/448)
+- use a more rust idiomatic way of handling nth_value
[\#448](https://github.com/apache/arrow-datafusion/issues/448)
- Make Ballista not depend on arrow directly
[\#446](https://github.com/apache/arrow-datafusion/issues/446)
- create a test with more than one partition for window functions
[\#435](https://github.com/apache/arrow-datafusion/issues/435)
- Implement hash-partitioned hash aggregate
[\#27](https://github.com/apache/arrow-datafusion/issues/27)
@@ -517,7 +515,7 @@ For older versions, see
[apache/arrow/CHANGELOG.md](https://github.com/apache/ar
- Change datatype of tpch keys from Int32 to UInt64 to support sf=1000
[\#836](https://github.com/apache/arrow-datafusion/pull/836)
([andygrove](https://github.com/andygrove))
- Add ballista-examples to docker build
[\#829](https://github.com/apache/arrow-datafusion/pull/829)
([andygrove](https://github.com/andygrove))
- Update dependencies: prost to 0.8 and tonic to 0.5
[\#818](https://github.com/apache/arrow-datafusion/pull/818)
([alamb](https://github.com/alamb))
-- Move `hash_array` into hash\_utils.rs
[\#807](https://github.com/apache/arrow-datafusion/pull/807)
([alamb](https://github.com/alamb))
+- Move `hash_array` into hash_utils.rs
[\#807](https://github.com/apache/arrow-datafusion/pull/807)
([alamb](https://github.com/alamb))
- Fix: Update clippy lints for Rust 1.54
[\#794](https://github.com/apache/arrow-datafusion/pull/794)
([alamb](https://github.com/alamb))
- MINOR: Remove unused Ballista query execution code path
[\#732](https://github.com/apache/arrow-datafusion/pull/732)
([andygrove](https://github.com/andygrove))
- \[fix\] benchmark run with compose
[\#666](https://github.com/apache/arrow-datafusion/pull/666)
([rdettai](https://github.com/rdettai))
@@ -540,10 +538,8 @@ For older versions, see
[apache/arrow/CHANGELOG.md](https://github.com/apache/ar
- Remove references to Ballista Docker images published to ballistacompute
Docker Hub repo [\#326](https://github.com/apache/arrow-datafusion/pull/326)
([andygrove](https://github.com/andygrove))
- Update arrow-rs deps
[\#317](https://github.com/apache/arrow-datafusion/pull/317)
([alamb](https://github.com/alamb))
- Update arrow deps
[\#269](https://github.com/apache/arrow-datafusion/pull/269)
([alamb](https://github.com/alamb))
-- Enable redundant\_field\_names clippy lint
[\#261](https://github.com/apache/arrow-datafusion/pull/261)
([Dandandan](https://github.com/Dandandan))
+- Enable redundant_field_names clippy lint
[\#261](https://github.com/apache/arrow-datafusion/pull/261)
([Dandandan](https://github.com/Dandandan))
- Update arrow-rs deps \(to fix build due to flatbuffers update\)
[\#224](https://github.com/apache/arrow-datafusion/pull/224)
([alamb](https://github.com/alamb))
- update arrow-rs deps to latest master
[\#216](https://github.com/apache/arrow-datafusion/pull/216)
([alamb](https://github.com/alamb))
-
-
-\* *This Changelog was automatically generated by
[github_changelog_generator](https://github.com/github-changelog-generator/github-changelog-generator)*
+\* _This Changelog was automatically generated by
[github_changelog_generator](https://github.com/github-changelog-generator/github-changelog-generator)_
diff --git a/docs/README.md b/docs/README.md
index 0cfa559e..bfddcbca 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -17,7 +17,13 @@
under the License.
-->
-# DataFusion docs
+# Developer Documentation
+
+Developer documentation can be found [here](developer/).
+
+# User Documentation
+
+_These instructions were forked from the `arrow-datafusion` repository and are
outdated_
## Dependencies
diff --git a/ballista/docs/README.md b/docs/developer/README.md
similarity index 84%
rename from ballista/docs/README.md
rename to docs/developer/README.md
index 38d3db5d..6c3c3fd0 100644
--- a/ballista/docs/README.md
+++ b/docs/developer/README.md
@@ -21,12 +21,14 @@
This directory contains documentation for developers that are contributing to
Ballista. If you are looking for
end-user documentation for a published release, please start with the
-[DataFusion User Guide](../../docs/user-guide) instead.
+[Ballista User Guide](../source/user-guide) instead.
## Architecture & Design
- Read the [Architecture Overview](architecture.md) to get an understanding of
the scheduler and executor
processes and how distributed query execution works.
+- Watch the [Ballista: Distributed Compute with Rust and Apache
Arrow](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+ talk from the New York Open Statistical Programming Meetup (Feb 2021)
## Build, Test, Release
diff --git a/ballista/docs/architecture.md b/docs/developer/architecture.md
similarity index 59%
rename from ballista/docs/architecture.md
rename to docs/developer/architecture.md
index bdb45cf1..bcd96a67 100644
--- a/ballista/docs/architecture.md
+++ b/docs/developer/architecture.md
@@ -22,38 +22,29 @@
## Overview
Ballista allows queries to be executed in a distributed cluster. A cluster
consists of one or
-more scheduler processes and one or more executor processes. See the following
sections in this document for more
-details about these components.
+more scheduler processes and one or more executor processes.
The scheduler accepts logical query plans and translates them into physical
query plans using DataFusion and then
-runs a secondary planning/optimization process to translate the physical query
plan into a distributed physical
-query plan.
+runs a secondary planning process to translate the physical query plan into a
_distributed_ physical
+query plan by replacing any operator in the DataFusion plan which performs a
repartition with a stage boundary
+(i.e. a shuffle exchange).
-This process breaks a query down into a number of query stages that can be
executed independently. There are
+This results in a plan that contains a number of query stages that can be
executed independently. There are
dependencies between query stages and these dependencies form a
directionally-acyclic graph (DAG) because a query
stage cannot start until its child query stages have completed.
Each query stage has one or more partitions that can be processed in parallel
by the available
executors in the cluster. This is the basic unit of scalability in Ballista.
-The following diagram shows the flow of requests and responses between the
client, scheduler, and executor
-processes.
-
-## Distributed Scheduler Overview
-
-Ballista uses the DataFusion query execution framework to create a physical
plan and then transforms it into a
-distributed physical plan by breaking the query down into stages whenever the
partitioning scheme changes.
+The output of each query stage is persisted to disk and future query stages
will request this data from the executors
+that produced it. The persisted output will be partitioned according to the
partitioning scheme that was defined for
+the query stage and this typically differs from the partitioning scheme of the
query stage that will consume this
+intermediate output since it is the changes in partitioning in the plan that
define the query stage boundaries.
-Specifically, any `RepartitionExec` operator is replaced with an
`UnresolvedShuffleExec` and the child operator
-of the repartition operator is wrapped in a `ShuffleWriterExec` operator and
scheduled for execution.
+This exchange of data between query stages is called a "shuffle exchange" in
Apache Spark.
-Each executor polls the scheduler for the next task to run. Tasks are
currently always `ShuffleWriterExec` operators
-and each task represents one _input_ partition that will be executed. The
resulting batches are repartitioned
-according to the shuffle partitioning scheme and each _output_ partition is
streamed to disk in Arrow IPC format.
-
-The scheduler will replace `UnresolvedShuffleExec` operators with
`ShuffleReaderExec` operators once all shuffle
-tasks have completed. The `ShuffleReaderExec` operator connects to other
executors as required using the Flight
-interface, and streams the shuffle IPC files.
+The following diagram shows the flow of requests and responses between the
client, scheduler, and executor
+processes.

@@ -76,16 +67,16 @@ The scheduler can run in standalone mode, or can be run in
clustered mode using
The executor process implements the Apache Arrow Flight gRPC interface and is
responsible for:
-- Executing query stages and persisting the results to disk in Apache Arrow
IPC Format
-- Making query stage results available as Flights so that they can be
retrieved by other executors as well as by
- clients
+- Connecting to the scheduler and requesting tasks to execute
+- Executing tasks within a query stage and persisting the results to disk in
Apache Arrow IPC Format
+- Making query stage output partitions available as "Flights" so that they can
be retrieved by other executors as well
+ as by clients
## Rust Client
-The Rust client provides a DataFrame API that is a thin wrapper around the
DataFusion DataFrame and provides
-the means for a client to build a query plan for execution.
+The Rust client provides a `BallistaContext` that allows queries to be built
using DataFrames or SQL (or both).
-The client executes the query plan by submitting an `ExecuteLogicalPlan`
request to the scheduler and then calls
+The client executes the query plan by submitting an `ExecuteQuery` request to
the scheduler and then calls
`GetJobStatus` to check for completion. On completion, the client receives a
list of locations for the Flights
containing the results for the query and will then connect to the appropriate
executor processes to retrieve
those results.
diff --git a/ballista/docs/dev-env.md b/docs/developer/dev-env.md
similarity index 100%
rename from ballista/docs/dev-env.md
rename to docs/developer/dev-env.md
diff --git a/ballista/docs/images/query-execution.png
b/docs/developer/images/query-execution.png
similarity index 100%
rename from ballista/docs/images/query-execution.png
rename to docs/developer/images/query-execution.png
diff --git a/ballista/docs/integration-testing.md
b/docs/developer/integration-testing.md
similarity index 100%
rename from ballista/docs/integration-testing.md
rename to docs/developer/integration-testing.md
diff --git a/examples/README.md b/examples/README.md
index 1fb29bf9..e80bbd29 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -23,7 +23,7 @@ This directory contains examples for executing distributed
queries with Ballista
# Standalone Examples
-The standalone example is the easiest to get started with. Ballista supports a
standalone mode where a scheduler
+The standalone example is the easiest to get started with. Ballista supports a
standalone mode where a scheduler
and executor are started in-process.
```bash
@@ -56,7 +56,6 @@ async fn main() -> Result<()> {
```
-
# Distributed Examples
For background information on the Ballista architecture, refer to
@@ -76,7 +75,7 @@ Start a Ballista scheduler process in a new terminal session.
RUST_LOG=info ./target/release/ballista-scheduler
```
-Start one or more Ballista executor processes in new terminal sessions. When
starting more than one
+Start one or more Ballista executor processes in new terminal sessions. When
starting more than one
executor, a unique port number must be specified for each executor.
```bash
@@ -86,7 +85,7 @@ RUST_LOG=info ./target/release/ballista-executor -c 2 -p 50052
## Running the examples
-The examples can be run using the `cargo run --bin` syntax.
+The examples can be run using the `cargo run --bin` syntax.
## Distributed SQL Example
@@ -150,4 +149,4 @@ async fn main() -> Result<()> {
Ok(())
}
-```
\ No newline at end of file
+```