[GitHub] [arrow] westonpace commented on a change in pull request #12689: ARROW-15515: [C++] Update ExecPlan example code and documentation with new options

GitBox Tue, 29 Mar 2022 18:19:35 -0700


westonpace commented on a change in pull request #12689:
URL: https://github.com/apache/arrow/pull/12689#discussion_r838028830




##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -647,6 +649,25 @@ SelectK example:
 
 .. _stream_execution_scan_docs:
 
+``table_sink``
+----------------
+
+.. _stream_execution_table_sink_docs:
+
+Considering the variety of sink nodes provided in the streaming execution 
engine, the ``table_sink`` node 
+provides the ability to take the output as a table. It is much easier to use 
+:class:`arrow::compute::TableSinkNodeOptions`.

Review comment:
       ```suggestion
   The ``table_sink`` node provides the ability to take the output as an 
in-memory table. 
   This is much simpler to use than the other sink nodes provided by the 
streaming execution engine
   but it only makes sense when the output fits comfortably in memory.  The 
node is created using :class:`arrow::compute::TableSinkNodeOptions`.
   ```

##########
File path: cpp/examples/arrow/execution_plan_documentation_examples.cc
##########
@@ -851,6 +851,50 @@ arrow::Status SourceUnionSinkExample(cp::ExecContext& 
exec_context) {
 
 // (Doc section: Union Example)
 
+// (Doc section: Table Sink Example)
+
+/// \brief An example showing a table sink node
+/// \param exec_context The execution context to run the plan in
+///
+/// TableSink Example
+/// This example shows how a table_sink can be used
+/// in an execution plan. This includes a source node
+/// receiving data as batches and the table sink node
+/// which emits the output as a table.
+arrow::Status TableSinkExample(cp::ExecContext& exec_context) {
+  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<cp::ExecPlan> plan,
+                        cp::ExecPlan::Make(&exec_context));
+
+  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());
+
+  auto source_node_options = cp::SourceNodeOptions{basic_data.schema, 
basic_data.gen()};
+
+  ARROW_ASSIGN_OR_RAISE(cp::ExecNode * source,
+                        cp::MakeExecNode("source", plan.get(), {}, 
source_node_options));
+
+  std::shared_ptr<arrow::Table> output_table;
+  auto table_sink_options = cp::TableSinkNodeOptions{&output_table, 
basic_data.schema};
+
+  ARROW_RETURN_NOT_OK(
+      cp::MakeExecNode("table_sink", plan.get(), {source}, 
table_sink_options));
+  // validate the ExecPlan
+  ARROW_RETURN_NOT_OK(plan->Validate());
+  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
+  // start the ExecPlan
+  ARROW_RETURN_NOT_OK(plan->StartProducing());
+
+  auto finish = source->finished();
+
+  RETURN_NOT_OK(finish.status());
+
+  std::cout << "Results : " << output_table->ToString() << std::endl;
+
+  // plan mark finished
+  auto future = plan->finished();
+  return future.status();

Review comment:
       ```suggestion
     // Wait for the plan to finish
     auto finished = plan->finished();
     RETURN_NOT_OK(finished.status());
   
     std::cout << "Results : " << output_table->ToString() << std::endl;
     return Status::OK();
   ```
   I don't think it's a good idea to do `source->finished()` as I don't think 
it's a good idea to wait on individual nodes.  User's should probably only ever 
wait on the plan.

##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -647,6 +649,25 @@ SelectK example:
 
 .. _stream_execution_scan_docs:
 
+``table_sink``
+----------------
+
+.. _stream_execution_table_sink_docs:
+
+Considering the variety of sink nodes provided in the streaming execution 
engine, the ``table_sink`` node 
+provides the ability to take the output as a table. It is much easier to use 
+:class:`arrow::compute::TableSinkNodeOptions`.
+The output data can be obtained as a ``std::shared_ptr<arrow::Table>`` along 
with the output ``schema``. 

Review comment:
       Technically it was an input schema I think (and it's going away now).  A 
`Table` already has a schema associated with it so there is no need to specify 
the output schema separately.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #12689: ARROW-15515: [C++] Update ExecPlan example code and documentation with new options

Reply via email to