martin-g commented on code in PR #19316:
URL: https://github.com/apache/datafusion/pull/19316#discussion_r2619201926


##########
datafusion/physical-plan/src/analyze.rs:
##########
@@ -194,26 +228,49 @@ impl ExecutionPlan for AnalyzeExec {
         // JoinSet that computes the overall row count and final
         // record batch
         let mut input_stream = builder.build();
+
+        let inner_schema = Arc::clone(&self.input.schema());
+        let auto_explain = self.auto_explain;
+        let auto_explain_output = self.auto_explain_output.clone();
+        let auto_explain_min_duration = self.auto_explain_min_duration;
+
         let output = async move {
+            let mut batches = vec![];
             let mut total_rows = 0;
             while let Some(batch) = input_stream.next().await.transpose()? {
                 total_rows += batch.num_rows();
+                batches.push(batch);

Review Comment:
   This introduces batching even for the non-auto-explain mode! It will add a 
lot of memory pressure for all users!



##########
datafusion/physical-plan/src/analyze.rs:
##########
@@ -271,19 +328,41 @@ fn create_output_batch(
     .map_err(DataFusionError::from)
 }
 
+fn export_auto_explain(batch: RecordBatch, output: &str) -> Result<()> {
+    let fd: &mut dyn Write = match output {
+        "stdout" => &mut io::stdout(),
+        "stderr" => &mut io::stderr(),
+        _ => &mut OpenOptions::new().create(true).append(true).open(output)?,

Review Comment:
   Does this need any kind of validation of the file location ? 
   Or it is left to the developer/admin to make sure it is a safe place ?



##########
datafusion/physical-plan/src/analyze.rs:
##########


Review Comment:
   This method does not preserve the new auto_explain settings. Is this 
intended ?



##########
datafusion/common/src/config.rs:
##########
@@ -1117,6 +1117,19 @@ config_namespace! {
         /// "summary" shows common metrics for high-level insights.
         /// "dev" provides deep operator-level introspection for developers.
         pub analyze_level: ExplainAnalyzeLevel, default = 
ExplainAnalyzeLevel::Dev
+
+        /// Whether to enable the `auto_explain` mode. In this mode, the 
execution plan is
+        /// automatically written to the location set by 
`auto_explain_output`, if the plan's total
+        /// duration is greater or equal to `auto_explain_min_duration`, in 
milliseconds.
+        pub auto_explain: bool, default = false
+
+        /// Output location used in the `auto_explain` mode. Supports 
`stdout`, `stderr`, or a file
+        /// path (file is created if it does not exist; plans are appended to 
the file).
+        pub auto_explain_output: String, default = "stdout".to_owned()
+
+        /// In the `auto_explain` mode, only output plans if their duration is 
bigger than this

Review Comment:
   ```suggestion
           /// In the `auto_explain` mode, only output plans if their duration 
is bigger than or equal to this
   ```



##########
datafusion/physical-plan/src/analyze.rs:
##########
@@ -271,19 +328,41 @@ fn create_output_batch(
     .map_err(DataFusionError::from)
 }
 
+fn export_auto_explain(batch: RecordBatch, output: &str) -> Result<()> {
+    let fd: &mut dyn Write = match output {
+        "stdout" => &mut io::stdout(),
+        "stderr" => &mut io::stderr(),
+        _ => &mut OpenOptions::new().create(true).append(true).open(output)?,

Review Comment:
   Does this need some kind of synchronisation when a file path is used for the 
output ? Two or more DF sessions using the same config may try to write to the 
same file simultaneously.



##########
datafusion/sqllogictest/test_files/explain_analyze.slt:
##########
@@ -25,3 +25,23 @@ Plan with Metrics LazyMemoryExec: partitions=1, 
batch_generators=[generate_serie
 
 statement ok
 reset datafusion.explain.analyze_level;
+
+
+# test auto_explain
+
+statement ok
+set datafusion.explain.auto_explain_output = 
'test_files/scratch/auto_explain.txt';

Review Comment:
   Does something assert the contents of this output file ?
   Does something remove this file at the end ?



##########
datafusion/physical-plan/src/analyze.rs:
##########
@@ -101,6 +121,20 @@ impl AnalyzeExec {
             input.boundedness(),
         )
     }
+
+    pub fn enable_auto_explain(&mut self) {
+        self.auto_explain = true;
+        self.cache =
+            Self::compute_properties(&self.input, 
Arc::clone(&self.input.schema()));
+    }
+
+    pub fn set_auto_explain_output(&mut self, value: String) {
+        self.auto_explain_output = value
+    }
+
+    pub fn set_auto_explain_min_duration(&mut self, value: usize) {

Review Comment:
   nit: this helps to realize what units are used here. Otherwise the reader 
will need to search around to find this out.
   Or even better - change the type of auto_explain_min_duration at 
https://github.com/apache/datafusion/pull/19316/changes#diff-0cd0842fd045a959ba4ee886c0410353ee6af047478bddb5019b1e45a9fc9774R71
 to Duration
   ```suggestion
       pub fn set_auto_explain_min_duration_ms(&mut self, value: usize) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to