(datafusion) branch main updated: Update documentation for `datafusion.execution.collect_statistics` (#16100)

alamb Tue, 20 May 2025 06:26:16 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git



The following commit(s) were added to refs/heads/main by this push:
     new 8d9c0f6b87 Update documentation for 
`datafusion.execution.collect_statistics` (#16100)
8d9c0f6b87 is described below

commit 8d9c0f6b87d9f3a52e4d3dc642535d09cf86049f
Author: Andrew Lamb <[email protected]>
AuthorDate: Tue May 20 09:26:00 2025 -0400

    Update documentation for `datafusion.execution.collect_statistics` (#16100)
    
    * Update documentation for `datafusion.execution.collect_statistics` setting
    
    * Update test
    
    * Update datafusion/common/src/config.rs
    
    Co-authored-by: Leonardo Yvens <[email protected]>
    
    * update docs
    
    * Update doc
    
    ---------
    
    Co-authored-by: Leonardo Yvens <[email protected]>
---
 datafusion/common/src/config.rs                           | 4 +++-
 datafusion/sqllogictest/test_files/information_schema.slt | 2 +-
 docs/source/user-guide/configs.md                         | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs
index b701b7130b..59283114e3 100644
--- a/datafusion/common/src/config.rs
+++ b/datafusion/common/src/config.rs
@@ -292,7 +292,9 @@ config_namespace! {
         /// target batch size is determined by the configuration setting
         pub coalesce_batches: bool, default = true
 
-        /// Should DataFusion collect statistics after listing files
+        /// Should DataFusion collect statistics when first creating a table.
+        /// Has no effect after the table is created. Applies to the default
+        /// `ListingTableProvider` in DataFusion. Defaults to false.
         pub collect_statistics: bool, default = false
 
         /// Number of partitions for query execution. Increasing partitions 
can increase
diff --git a/datafusion/sqllogictest/test_files/information_schema.slt 
b/datafusion/sqllogictest/test_files/information_schema.slt
index 3a98a4d185..841b289e75 100644
--- a/datafusion/sqllogictest/test_files/information_schema.slt
+++ b/datafusion/sqllogictest/test_files/information_schema.slt
@@ -326,7 +326,7 @@ datafusion.catalog.location NULL Location scanned to load 
tables for `default` s
 datafusion.catalog.newlines_in_values false Specifies whether newlines in 
(quoted) CSV values are supported. This is the default value for 
`format.newlines_in_values` for `CREATE EXTERNAL TABLE` if not specified 
explicitly in the statement. Parsing newlines in quoted values may be affected 
by execution behaviour such as parallel file scanning. Setting this to `true` 
ensures that newlines in values are parsed successfully, which may reduce 
performance.
 datafusion.execution.batch_size 8192 Default batch size while creating new 
batches, it's especially useful for buffer-in-memory batches since creating 
tiny batches would result in too much metadata memory consumption
 datafusion.execution.coalesce_batches true When set to true, record batches 
will be examined between each operator and small batches will be coalesced into 
larger batches. This is helpful when there are highly selective filters or 
joins that could produce tiny output batches. The target batch size is 
determined by the configuration setting
-datafusion.execution.collect_statistics false Should DataFusion collect 
statistics after listing files
+datafusion.execution.collect_statistics false Should DataFusion collect 
statistics when first creating a table. Has no effect after the table is 
created. Applies to the default `ListingTableProvider` in DataFusion. Defaults 
to false.
 datafusion.execution.enable_recursive_ctes true Should DataFusion support 
recursive CTEs
 datafusion.execution.enforce_batch_size_in_joins false Should DataFusion 
enforce batch size in joins or not. By default, DataFusion will not enforce 
batch size in joins. Enforcing batch size in joins can reduce memory usage when 
joining large tables with a highly-selective join filter, but is also slightly 
slower.
 datafusion.execution.keep_partition_by_columns false Should DataFusion keep 
the columns used for partition_by in the output RecordBatches
diff --git a/docs/source/user-guide/configs.md 
b/docs/source/user-guide/configs.md
index fe9c57857b..4129ddc392 100644
--- a/docs/source/user-guide/configs.md
+++ b/docs/source/user-guide/configs.md
@@ -47,7 +47,7 @@ Environment variables are read during `SessionConfig` 
initialisation so they mus
 | datafusion.catalog.newlines_in_values                                   | 
false                     | Specifies whether newlines in (quoted) CSV values 
are supported. This is the default value for `format.newlines_in_values` for 
`CREATE EXTERNAL TABLE` if not specified explicitly in the statement. Parsing 
newlines in quoted values may be affected by execution behaviour such as 
parallel file scanning. Setting this to `true` ensures that newlines in values 
are parsed successfully, which  [...]
 | datafusion.execution.batch_size                                         | 
8192                      | Default batch size while creating new batches, it's 
especially useful for buffer-in-memory batches since creating tiny batches 
would result in too much metadata memory consumption                            
                                                                                
                                                                                
                      [...]
 | datafusion.execution.coalesce_batches                                   | 
true                      | When set to true, record batches will be examined 
between each operator and small batches will be coalesced into larger batches. 
This is helpful when there are highly selective filters or joins that could 
produce tiny output batches. The target batch size is determined by the 
configuration setting                                                           
                                [...]
-| datafusion.execution.collect_statistics                                 | 
false                     | Should DataFusion collect statistics after listing 
files                                                                           
                                                                                
                                                                                
                                                                                
                  [...]
+| datafusion.execution.collect_statistics                                 | 
false                     | Should DataFusion collect statistics when first 
creating a table. Has no effect after the table is created. Applies to the 
default `ListingTableProvider` in DataFusion. Defaults to false.                
                                                                                
                                                                                
                          [...]
 | datafusion.execution.target_partitions                                  | 0  
                       | Number of partitions for query execution. Increasing 
partitions can increase concurrency. Defaults to the number of CPU cores on the 
system                                                                          
                                                                                
                                                                                
                [...]
 | datafusion.execution.time_zone                                          | 
+00:00                    | The default time zone Some functions, e.g. 
`EXTRACT(HOUR from SOME_TIME)`, shift the underlying datetime according to this 
time zone, and then extract the hour                                            
                                                                                
                                                                                
                          [...]
 | datafusion.execution.parquet.enable_page_index                          | 
true                      | (reading) If true, reads the Parquet data page 
level metadata (the Page Index), if present, to reduce the I/O and number of 
rows decoded.                                                                   
                                                                                
                                                                                
                         [...]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion) branch main updated: Update documentation for `datafusion.execution.collect_statistics` (#16100)

Reply via email to