Dandandan commented on code in PR #19932: URL: https://github.com/apache/datafusion/pull/19932#discussion_r2759212030
##########
docs/source/user-guide/configs.md:
##########
@@ -161,6 +161,8 @@ The following configuration settings are available:
| datafusion.optimizer.hash_join_single_partition_threshold_rows |
131072 | The maximum estimated size in rows for one input
side of a HashJoin will be collected into a single partition
|
| datafusion.optimizer.hash_join_inlist_pushdown_max_size |
131072 | Maximum size in bytes for the build side of a hash
join to be pushed down as an InList expression for dynamic filtering. Build
sides larger than this will use hash table lookups instead. Set to 0 to always
use hash table lookups. InList pushdown can be more efficient for small build
sides because it can result in better statistics pruning as well as use any
bloom filters present on the scan side. InList expressions are also more
transparent and easier to serialize over the network in distributed uses of
DataFusion. On the other hand InList pushdown requires making a copy of the
data and thus adds some overhead to the build side and uses more memory. This
setting is per-partition, so we may end up using
`hash_join_inlist_pushdown_max_size` \* `target_partitions` memory. The default
is 128kB per partition. This should allow point lookup joins (e.g. joining on a
unique primary key) t
o use InList pushdown in most cases but avoids excessive memory usage or
overhead for larger joins.
|
| datafusion.optimizer.hash_join_inlist_pushdown_max_distinct_values |
150 | Maximum number of distinct values (rows) in the
build side of a hash join to be pushed down as an InList expression for dynamic
filtering. Build sides with more rows than this will use hash table lookups
instead. Set to 0 to always use hash table lookups. This provides an additional
limit beyond `hash_join_inlist_pushdown_max_size` to prevent very large IN
lists that might not provide much benefit over hash table lookups. This uses
the deduplicated row count once the build side has been evaluated. The default
is 150 values per partition. This is inspired by Trino's
`max-filter-keys-per-column` setting. See:
<https://trino.io/docs/current/admin/dynamic-filtering.html#dynamic-filter-collection-thresholds>
|
+| datafusion.optimizer.hash_join_map_pushdown |
true | When true, pushes down hash table references for
membership checks in hash joins when the build side is too large for InList
pushdown. When false, no membership filter is created when InList thresholds
are exceeded. Default: true
|
+| datafusion.optimizer.hash_join_bounds_pushdown |
true | When true, pushes down min/max bounds for join key
columns. This enables statistics-based pruning (e.g., Parquet row group
skipping). When false, only membership filters (InList or Map) are pushed down.
Default: true
|
Review Comment:
Could we also add a config to toggle it for pruning only (as that's
relatively cheap)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
