vibhatha commented on a change in pull request #12033: URL: https://github.com/apache/arrow/pull/12033#discussion_r778233710
########## File path: docs/source/cpp/streaming_execution.rst ########## @@ -175,9 +175,607 @@ their completion:: // alive until this future is marked finished. Future<> complete = plan->finished(); +Constructing ``ExecNode`` using Options +======================================= + +Using the execution plan we can construct varioud execution queries. +To construct such queries, we have provided a set of containers or +referred as :class:`ExecNode` s. These nodes provide the ability to +construct operations like filtering, projection, join, etc. + +This is the list of :class:`ExecutionNode` s exposed; + +1. :class:`SourceNode` +2. :class:`FilterNode` +3. :class:`ProjectNode` +4. :class:`ScalarAggregateNode` +5. :class:`SinkNode` +6. :class:`ConsumingSinkNode` +7. :struct:`OrderBySinkNode` +8. SelectK-SinkNode +9. Scan-Node +10. :class:`HashJoinNode` +11. Write-Node +12. :class:`UnionNode` + +There are a set of :class:`ExecNode` s designed to provide various operations required +in designing a streaming execution plan. + +``SourceNode`` +-------------- + +:struct:`arrow::compute::SourceNode` can be considered as an entry point to create a streaming execution plan. +A source node can be constructed as follows. + +:class:`arrow::compute::SoureNodeOptions` are used to create the :struct:`arrow::compute::SourceNode`. +The :class:`Schema` of the data passing through and a function to generate data +`std::function<arrow::Future<arrow::util::optional<arrow::compute::ExecBatch>>()>` +are required to create this option:: + + // data generator + arrow::AsyncGenerator<arrow::util::optional<cp::ExecBatch>> gen() { ... } + + // data schema + auto schema = arrow::schema({...}) + + // source node options + auto source_node_options = arrow::compute::SourceNodeOptions{schema, gen}; + + // create a source node + ARROW_ASSIGN_OR_RAISE(arrow::compute::ExecNode * source, + arrow::compute::MakeExecNode("source", plan.get(), {}, + source_node_options)); + +``FilterNode`` +-------------- + +:class:`FilterNode`, as the name suggests, provide a container to define a data filtering criteria. +Filter can be written using :class:`arrow::compute::Expression`. For instance if the row values +of a particular column needs to be filtered by a boundary value, ex: all values of column b +greater than 3, can be written using :class:`arrow::compute::FilterNodeOptions` as follows:: + + // a > 3 + arrow::compute::Expression filter_opt = arrow::compute::greater( + arrow::compute::field_ref("a"), + arrow::compute::literal(3)); + +Using this option, the filter node can be constructed as follows:: + + // creating filter node + arrow::compute::ExecNode* filter; + ARROW_ASSIGN_OR_RAISE(filter, arrow::compute::MakeExecNode("filter", + // plan + plan.get(), + // previous input + {scan}, + //filter node options + arrow::compute::FilterNodeOptions{filter_opt})); + +``ProjectNode`` +--------------- + +:class:`ProjectNode` executes expressions on input batches and produces new batches. +Each expression will be evaluated against each batch which is pushed to this +node to produce a corresponding output column. This is exposed via +:class:`arrow::compute::ProjectNodeOptions` component which requires, +a :class:`arrow::compute::Expression`, names of the project columns (names are not provided, +the string representations of exprs will be used) and a boolean flag to determine +synchronous/asynchronous nature (by default asynchronous option is set to `true`). + +Sample Expression for projection:: + + // a * 2 (multiply values in a column by 2) + arrow::compute::Expression a_times_2 = arrow::compute::call("multiply", + {arrow::compute::field_ref("a"), arrow::compute::literal(2)}); + + +Creating a project node:: + + arrow::compute::ExecNode *project; + ARROW_ASSIGN_OR_RAISE(project, + arrow::compute::MakeExecNode("project", + // plan + plan.get(), + // previous node + {scan}, + // project node options + arrow::compute::ProjectNodeOptions{{a_times_2}})); + +``ScalarAggregateNode`` +----------------------- + +:class:`ScalarAggregateNode` is an :class:`ExecNode` which provides various +aggregation options. The :class:`arrow::compute::AggregateNodeOptions` provides the +container to define the aggregation criterion. These options can be +selected from `arrow::compute` options. Review comment: removed and linked to existing docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org