Re: [PR] Docs: Add Tuning Guide for small data / short queries [datafusion]

via GitHub Tue, 05 Aug 2025 04:28:59 -0700


xudong963 commented on code in PR #17040:
URL: https://github.com/apache/datafusion/pull/17040#discussion_r2254055478



##########
dev/update_config_docs.sh:
##########
@@ -92,6 +92,38 @@ EOF
 echo "Running CLI and inserting runtime config docs table"
 $PRINT_RUNTIME_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"
 
+cat <<'EOF' >> "$TARGET_FILE"
+
+# Tuning Guide
+
+## Short Queries
+
+By default DataFusion will attempt to maximize parallelism and use all cores --
+For example, if you have 32 cores, each plan will split the data into 32
+partitions. However, if your data is small, the overhead of splitting the data
+to enable parallelization can dominate the actual computation.
+
+You can find out how many cores are being used via the [`EXPLAIN`] command and 
look
+at the number of partitions in the plan.
+
+[`EXPLAIN`]: sql/explain.md
+
+The `datafusion.optimizer.repartition_file_min_size` option controls the 
minimum file size the
+[`ListingTable`] provider will attempt to repartition. However, this
+does not apply to user defined data sources and only works when DataFusion has 
accurate statistics.
+
+If you know your data is small, you can set the 
`datafusion.execution.target_partitions`
+option to a smaller number to reduce the overhead of repartitioning. For very 
small datasets (e.g. less
+than 1MB), we recommend setting `target_partitions` to 1 to avoid 
repartitioning altogether.
+
+```sql
+SET datafusion.execution.target_partitions = '1';
+```

Review Comment:
   We have some such cases, will have a try



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Docs: Add Tuning Guide for small data / short queries [datafusion]

Reply via email to