alamb commented on code in PR #19319:
URL: https://github.com/apache/datafusion/pull/19319#discussion_r2649929217
##########
datafusion-examples/examples/data_io/parquet_exec_visitor.rs:
##########
@@ -29,23 +31,47 @@ use datafusion::physical_plan::metrics::MetricValue;
use datafusion::physical_plan::{
ExecutionPlan, ExecutionPlanVisitor, execute_stream, visit_execution_plan,
};
+use datafusion::prelude::CsvReadOptions;
use futures::StreamExt;
+use tempfile::TempDir;
+use tokio::fs::create_dir_all;
/// Example of collecting metrics after execution by visiting the
`ExecutionPlan`
pub async fn parquet_exec_visitor() -> datafusion::common::Result<()> {
let ctx = SessionContext::new();
- let test_data = datafusion::test_util::parquet_test_data();
+ // Load CSV into an in-memory DataFrame, then materialize it to Parquet.
Review Comment:
I think this repeated code fragment to write out a parquet file gets in the
way of the example -- it is like 20 lines of setup that is unrelated to what
the example is trying to show and I fear that it will be confusing for first
time users (imagine if this is the first exposure to datafusion)
Could we move this into a function (something like `fn write_csv_to_parquet`
for example?) I think it is ok to have the code replicated (and thus the
examples be self contained) but not inline like this
I am sorry I have been away for a few days and I haven't been able to give
you more timeley feedback
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]