[GitHub] [arrow-ballista] thinkharderdev commented on a diff in pull request #41: MINOR: Improve developer docs

GitBox Wed, 25 May 2022 03:14:09 -0700


thinkharderdev commented on code in PR #41:
URL: https://github.com/apache/arrow-ballista/pull/41#discussion_r881469734



##########
docs/developer/architecture.md:
##########
@@ -22,11 +22,10 @@
 ## Overview
 
 Ballista allows queries to be executed in a distributed cluster. A cluster 
consists of one or
-more scheduler processes and one or more executor processes. See the following 
sections in this document for more
-details about these components.
+more scheduler processes and one or more executor processes.

Review Comment:
   Strictly speaking we only support a single scheduler at the moment. But 
maybe we keep it like this since I hope we can fix that soon :)



##########
docs/developer/architecture.md:
##########
@@ -22,11 +22,10 @@
 ## Overview
 
 Ballista allows queries to be executed in a distributed cluster. A cluster 
consists of one or
-more scheduler processes and one or more executor processes. See the following 
sections in this document for more
-details about these components.
+more scheduler processes and one or more executor processes.
 
 The scheduler accepts logical query plans and translates them into physical 
query plans using DataFusion and then
-runs a secondary planning/optimization process to translate the physical query 
plan into a distributed physical
+runs a secondary planning process to translate the physical query plan into a 
_distributed_ physical

Review Comment:
   Maybe a word here about how the DataFusion plan gets turned into a 
distributed plan? Something like "We get the distributed physical plan by 
replacing any operator in the DataFusion plan which performs a repartition with 
a stage boundary (i.e. a shuffle exchange)"



##########
docs/developer/architecture.md:
##########
@@ -76,14 +66,14 @@ The scheduler can run in standalone mode, or can be run in 
clustered mode using
 
 The executor process implements the Apache Arrow Flight gRPC interface and is 
responsible for:
 
-- Executing query stages and persisting the results to disk in Apache Arrow 
IPC Format
-- Making query stage results available as Flights so that they can be 
retrieved by other executors as well as by
-  clients
+- Connecting to the scheduler and requesting tasks to execute
+- Executing tasks within a query stage and persisting the results to disk in 
Apache Arrow IPC Format
+- Making query stage output partitions available as "Flights" so that they can 
be retrieved by other executors as well
+  as by clients
 
 ## Rust Client
 
-The Rust client provides a DataFrame API that is a thin wrapper around the 
DataFusion DataFrame and provides
-the means for a client to build a query plan for execution.
+The Rust client provides a `BallistaContext` that allows queries to be built 
using DataFrames or SQL (or both).
 
 The client executes the query plan by submitting an `ExecuteLogicalPlan` 
request to the scheduler and then calls

Review Comment:
   `ExecuteLogicalPlan` -> `ExecuteQuery` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-ballista] thinkharderdev commented on a diff in pull request #41: MINOR: Improve developer docs

Reply via email to