Re: [PR] Push down InList or hash table references from HashJoinExec depending on the size of the build side [datafusion]

via GitHub Fri, 31 Oct 2025 13:38:32 -0700


adriangb commented on code in PR #18393:
URL: https://github.com/apache/datafusion/pull/18393#discussion_r2482567563



##########
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##########
@@ -1322,11 +1324,11 @@ async fn 
test_hashjoin_dynamic_filter_pushdown_partitioned() {
     -     CoalesceBatchesExec: target_batch_size=8192
     -       HashJoinExec: mode=Partitioned, join_type=Inner, on=[(a@0, a@0), 
(b@1, b@1)]
     -         CoalesceBatchesExec: target_batch_size=8192
-    -           RepartitionExec: partitioning=Hash([a@0, b@1], 12), 
input_partitions=1
+    -           RepartitionExec: partitioning=Hash([a@0, b@1], 4), 
input_partitions=1
     -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, 
projection=[a, b, c], file_type=test, pushdown_supported=true
     -         CoalesceBatchesExec: target_batch_size=8192
-    -           RepartitionExec: partitioning=Hash([a@0, b@1], 12), 
input_partitions=1
-    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, 
projection=[a, b, e], file_type=test, pushdown_supported=true, 
predicate=DynamicFilter [ a@0 >= aa AND a@0 <= ab AND b@1 >= ba AND b@1 <= bb ]
+    -           RepartitionExec: partitioning=Hash([a@0, b@1], 4), 
input_partitions=1
+    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, 
projection=[a, b, e], file_type=test, pushdown_supported=true, 
predicate=DynamicFilter [ a@0 >= aa AND a@0 <= ab AND b@1 >= ba AND b@1 <= bb 
AND struct(a@0, b@1) IN (SET) ([{c0:aa,c1:ba}, {c0:ab,c1:bb}]) ]

Review Comment:
   In this case there's only 1 partition with data (because of hash collisions) 
-> we optimize away the `CASE` expression. This is relevant because the same 
thing would happen with a point lookup primary key join.



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -143,6 +144,76 @@ where
     }
 }
 
+/// Specialized Set implementation for StructArray
+struct StructArraySet {
+    array: Arc<StructArray>,
+    hash_set: ArrayHashSet,
+}

Review Comment:
   I think this is a nice improvement / feature that can easily be it's own PR.



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -474,16 +621,81 @@ pub fn in_list(
     )))
 }
 
+/// Create a new static InList expression from an array of values.
+///
+/// The `array` should contain the list of values for the `IN` clause.
+/// The `negated` flag indicates whether this is an `IN` or `NOT IN` 
expression.
+///
+/// The generated `InListExpr` will use the provided array as a static filter 
for efficient evaluation
+/// and the list of expressions will be constructed from the unique values in 
the array (with no guaranteed order).
+///
+/// # Errors
+/// Returns an error if the array type is not supported for `InList` 
expressions or if we cannot build a hash based lookup.
+pub fn in_list_from_array(
+    expr: Arc<dyn PhysicalExpr>,
+    array: ArrayRef,
+    negated: bool,
+) -> Result<Arc<dyn PhysicalExpr>> {
+    let mut list = Vec::new();
+    let static_filter = make_set(array.as_ref(), |hash_set| {
+        // If we are in tests sort the keys so we get a deterministic order 
for the list
+        // Use cfg(any(test, debug_assertions)) to cover both unit tests and 
integration tests
+        #[cfg(any(test, debug_assertions))]
+        {
+            use itertools::Itertools;
+            list = hash_set
+                .map
+                .keys()
+                .sorted()
+                .map(|&i| {
+                    let scalar = ScalarValue::try_from_array(&array, i)?;
+                    Ok(crate::expressions::lit(scalar))
+                })
+                .collect::<Result<Vec<_>>>()?;
+        }
+        #[cfg(not(any(test, debug_assertions)))]
+        {
+            list = hash_set
+                .map
+                .keys()
+                .map(|&i| {
+                    let scalar = ScalarValue::try_from_array(&array, i)?;
+                    Ok(crate::expressions::lit(scalar))
+                })
+                .collect::<Result<Vec<_>>>()?;
+        }
+        Ok(hash_set)
+    })?;
+
+    // If make_set failed, build list from full array
+    if list.is_empty() && !array.is_empty() {

Review Comment:
   This is just weirdness w/ the typing. We could also just `.expect()` on 
`list`.



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -474,16 +621,81 @@ pub fn in_list(
     )))
 }
 
+/// Create a new static InList expression from an array of values.
+///
+/// The `array` should contain the list of values for the `IN` clause.
+/// The `negated` flag indicates whether this is an `IN` or `NOT IN` 
expression.
+///
+/// The generated `InListExpr` will use the provided array as a static filter 
for efficient evaluation
+/// and the list of expressions will be constructed from the unique values in 
the array (with no guaranteed order).
+///
+/// # Errors
+/// Returns an error if the array type is not supported for `InList` 
expressions or if we cannot build a hash based lookup.

Review Comment:
   This is an important property: this function will error if we can't build an 
efficient hash table lookup.
   
   I would have rather kept it `pub(crate)` but it's being used from a 
different create so it has to be just `pub`.



##########
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##########
@@ -1305,11 +1307,11 @@ async fn 
test_hashjoin_dynamic_filter_pushdown_partitioned() {
     -     CoalesceBatchesExec: target_batch_size=8192
     -       HashJoinExec: mode=Partitioned, join_type=Inner, on=[(a@0, a@0), 
(b@1, b@1)]
     -         CoalesceBatchesExec: target_batch_size=8192
-    -           RepartitionExec: partitioning=Hash([a@0, b@1], 12), 
input_partitions=1
+    -           RepartitionExec: partitioning=Hash([a@0, b@1], 4), 
input_partitions=1
     -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, 
projection=[a, b, c], file_type=test, pushdown_supported=true
     -         CoalesceBatchesExec: target_batch_size=8192
-    -           RepartitionExec: partitioning=Hash([a@0, b@1], 12), 
input_partitions=1
-    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, 
projection=[a, b, e], file_type=test, pushdown_supported=true, 
predicate=DynamicFilter [ a@0 >= ab AND a@0 <= ab AND b@1 >= bb AND b@1 <= bb 
OR a@0 >= aa AND a@0 <= aa AND b@1 >= ba AND b@1 <= ba ]
+    -           RepartitionExec: partitioning=Hash([a@0, b@1], 4), 
input_partitions=1
+    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, 
projection=[a, b, e], file_type=test, pushdown_supported=true, 
predicate=DynamicFilter [ CASE hash_repartition % 4 WHEN 0 THEN a@0 >= aa AND 
a@0 <= aa AND b@1 >= ba AND b@1 <= ba AND struct(a@0, b@1) IN (SET) 
([{c0:aa,c1:ba}]) WHEN 2 THEN a@0 >= ab AND a@0 <= ab AND b@1 >= bb AND b@1 <= 
bb AND struct(a@0, b@1) IN (SET) ([{c0:ab,c1:bb}]) ELSE false END ]

Review Comment:
   Note that this automatically excludes empty partitions and defaults to 
`false`, which works with inner joins. I'm not sure how we'd structure this for 
other join types.



##########
datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs:
##########
@@ -101,25 +179,64 @@ impl PartitionBounds {
 ///
 /// All fields use a single mutex to ensure correct coordination between 
concurrent
 /// partition executions.
-pub(crate) struct SharedBoundsAccumulator {
-    /// Shared state protected by a single mutex to avoid ordering concerns
-    inner: Mutex<SharedBoundsState>,
+pub(crate) struct SharedBuildAccumulator {
+    /// Build-side data protected by a single mutex to avoid ordering concerns
+    inner: Mutex<AccumulatedBuildData>,
     barrier: Barrier,
     /// Dynamic filter for pushdown to probe side
     dynamic_filter: Arc<DynamicFilterPhysicalExpr>,
-    /// Right side join expressions needed for creating filter bounds
+    /// Right side join expressions needed for creating filter expressions
     on_right: Vec<PhysicalExprRef>,
+    /// Random state for partitioning (RepartitionExec's hash function with 
0,0,0,0 seeds)
+    /// Used for PartitionedHashLookupPhysicalExpr
+    repartition_random_state: RandomState,
+    /// Schema of the probe (right) side for evaluating filter expressions
+    probe_schema: Arc<Schema>,
 }
 
-/// State protected by SharedBoundsAccumulator's mutex
-struct SharedBoundsState {
-    /// Bounds from completed partitions.
-    /// Each element represents the column bounds computed by one partition.
-    bounds: Vec<PartitionBounds>,
+/// Strategy for filter pushdown (decided at collection time)
+#[derive(Clone)]
+pub(crate) enum PushdownStrategy {
+    /// Use InList for small build sides (< 128MB)
+    InList(ArrayRef),
+    /// Use hash table lookup for large build sides
+    HashTable(Arc<dyn JoinHashMapType>),
+    /// There was no data in this partition, do not build a dynamic filter for 
it
+    Empty,
 }
 
-impl SharedBoundsAccumulator {
-    /// Creates a new SharedBoundsAccumulator configured for the given 
partition mode
+/// Build-side data reported by a single partition
+pub(crate) enum PartitionBuildData {
+    Partitioned {
+        partition_id: usize,
+        pushdown: PushdownStrategy,
+        bounds: PartitionBounds,
+    },
+    CollectLeft {
+        pushdown: PushdownStrategy,
+        bounds: PartitionBounds,
+    },
+}
+
+/// Per-partition accumulated data (Partitioned mode)
+#[derive(Clone)]
+struct PartitionData {
+    bounds: PartitionBounds,
+    pushdown: PushdownStrategy,
+}
+
+/// Build-side data organized by partition mode
+enum AccumulatedBuildData {
+    Partitioned {
+        partitions: Vec<Option<PartitionData>>,
+    },
+    CollectLeft {
+        data: Option<PartitionData>,
+    },
+}

Review Comment:
   This is the refactoring of the data structures we store to track state. It's 
now much cleaner, e.g. avoids comparing partitions by id.



##########
datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs:
##########
@@ -0,0 +1,292 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review Comment:
   This is the part used to handle the cases where we push down the entire hash 
table.



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -143,6 +144,76 @@ where
     }
 }
 
+/// Specialized Set implementation for StructArray
+struct StructArraySet {
+    array: Arc<StructArray>,
+    hash_set: ArrayHashSet,
+}
+
+impl Set for StructArraySet {
+    fn contains(&self, v: &dyn Array, negated: bool) -> Result<BooleanArray> {
+        // Handle dictionary arrays
+        downcast_dictionary_array! {
+            v => {
+                let values_contains = self.contains(v.values().as_ref(), 
negated)?;
+                let result = take(&values_contains, v.keys(), None)?;
+                return Ok(downcast_array(result.as_ref()))
+            }
+            _ => {}
+        }
+
+        let v = as_struct_array(v)?;
+        let has_nulls = self.array.null_count() != 0;
+
+        // Compute hashes for all rows in the input array
+        let mut input_hashes = vec![0u64; v.len()];

Review Comment:
   Could we use a thread local to avoid repeated re-allocations here? I imagine 
that would work quite well.



##########
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##########
@@ -104,30 +108,12 @@ pub(super) struct JoinLeftData {
     _reservation: MemoryReservation,
     /// Bounds computed from the build side for dynamic filter pushdown
     pub(super) bounds: Option<Vec<ColumnBounds>>,
+    /// InList values for small build sides (alternative to hash map pushdown)
+    /// Contains unique values from build side for filter pushdown as InList 
expressions
+    pub(super) membership: PushdownStrategy,
 }
 
 impl JoinLeftData {
-    /// Create a new `JoinLeftData` from its parts
-    pub(super) fn new(

Review Comment:
   Clippy was complaining about too many arguments, a constructor was not 
necessary anyway



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -180,29 +251,103 @@ where
     ArrayHashSet { state, map }
 }
 
-/// Creates a `Box<dyn Set>` for the given list of `IN` expressions and `batch`
-fn make_set(array: &dyn Array) -> Result<Arc<dyn Set>> {
+/// Computes an [`ArrayHashSet`] for the provided [`StructArray`]
+fn make_struct_hash_set(array: &Arc<StructArray>) -> Result<ArrayHashSet> {
+    let state = RandomState::new();
+    let mut map: HashMap<usize, (), ()> =
+        HashMap::with_capacity_and_hasher(array.len(), ());
+
+    // Compute hashes for all rows
+    let mut row_hashes = vec![0u64; array.len()];
+    create_hashes_from_arrays(&[array.as_ref()], &state, &mut row_hashes)?;
+
+    // Build hash set, deduplicating based on struct equality
+    let insert_value = |idx: usize| {
+        let hash = row_hashes[idx];
+        if let RawEntryMut::Vacant(v) =
+            map.raw_entry_mut().from_hash(hash, |&existing_idx| {
+                // Compare the two struct rows for equality
+                // Use slice to get single-row arrays for comparison
+                let existing_row = array.slice(existing_idx, 1);
+                let current_row = array.slice(idx, 1);
+
+                // Use compare_op_for_nested for struct equality
+                match compare_with_eq(&existing_row, &current_row, true) {
+                    Ok(result) => result.value(0),
+                    Err(_) => false,
+                }
+            })
+        {
+            v.insert_with_hasher(hash, idx, (), |&x| row_hashes[x]);
+        }
+    };
+
+    match array.nulls() {
+        Some(nulls) => {
+            BitIndexIterator::new(nulls.validity(), nulls.offset(), 
nulls.len())
+                .for_each(insert_value)
+        }
+        None => (0..array.len()).for_each(insert_value),
+    }
+
+    Ok(ArrayHashSet { state, map })
+}
+
+/// Creates a `Box<dyn Set>` for the given list of `IN` expressions and 
`batch`.
+/// Accepts a parameter `f` which allows access to the underlying 
`ArrayHashSet`
+/// and is used by [`in_list_from_array`] to collect unique indices.
+fn make_set<F>(array: &dyn Array, mut f: F) -> Result<Arc<dyn Set>>
+where
+    F: FnMut(ArrayHashSet) -> Result<ArrayHashSet>,

Review Comment:
   This change is to add a closure so we can cheaply recover the indexes of the 
unique items in the list re-using the work we did to build the hash table to 
build the equality check list.



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -529,7 +741,18 @@ mod tests {
             })
     }
 
-    // applies the in_list expr to an input batch and list
+    /// Test helper macro that evaluates an IN LIST expression with automatic 
type casting.

Review Comment:
   No actual changes, just improved docstrings (AI generated but still useful 
for human readers or future AI readers).



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -180,29 +251,103 @@ where
     ArrayHashSet { state, map }
 }
 
-/// Creates a `Box<dyn Set>` for the given list of `IN` expressions and `batch`
-fn make_set(array: &dyn Array) -> Result<Arc<dyn Set>> {
+/// Computes an [`ArrayHashSet`] for the provided [`StructArray`]
+fn make_struct_hash_set(array: &Arc<StructArray>) -> Result<ArrayHashSet> {
+    let state = RandomState::new();
+    let mut map: HashMap<usize, (), ()> =
+        HashMap::with_capacity_and_hasher(array.len(), ());
+
+    // Compute hashes for all rows
+    let mut row_hashes = vec![0u64; array.len()];
+    create_hashes_from_arrays(&[array.as_ref()], &state, &mut row_hashes)?;
+
+    // Build hash set, deduplicating based on struct equality
+    let insert_value = |idx: usize| {
+        let hash = row_hashes[idx];
+        if let RawEntryMut::Vacant(v) =
+            map.raw_entry_mut().from_hash(hash, |&existing_idx| {
+                // Compare the two struct rows for equality
+                // Use slice to get single-row arrays for comparison
+                let existing_row = array.slice(existing_idx, 1);
+                let current_row = array.slice(idx, 1);
+
+                // Use compare_op_for_nested for struct equality
+                match compare_with_eq(&existing_row, &current_row, true) {
+                    Ok(result) => result.value(0),
+                    Err(_) => false,
+                }
+            })
+        {
+            v.insert_with_hasher(hash, idx, (), |&x| row_hashes[x]);
+        }
+    };
+
+    match array.nulls() {
+        Some(nulls) => {
+            BitIndexIterator::new(nulls.validity(), nulls.offset(), 
nulls.len())
+                .for_each(insert_value)
+        }
+        None => (0..array.len()).for_each(insert_value),
+    }
+
+    Ok(ArrayHashSet { state, map })
+}
+
+/// Creates a `Box<dyn Set>` for the given list of `IN` expressions and 
`batch`.
+/// Accepts a parameter `f` which allows access to the underlying 
`ArrayHashSet`
+/// and is used by [`in_list_from_array`] to collect unique indices.
+fn make_set<F>(array: &dyn Array, mut f: F) -> Result<Arc<dyn Set>>
+where
+    F: FnMut(ArrayHashSet) -> Result<ArrayHashSet>,
+{
     Ok(downcast_primitive_array! {
-        array => Arc::new(ArraySet::new(array, make_hash_set(array))),
+        array => {
+            let hash_set = f(make_hash_set(array))?;
+            Arc::new(ArraySet::new(array, hash_set))
+        },
         DataType::Boolean => {
             let array = as_boolean_array(array)?;
-            Arc::new(ArraySet::new(array, make_hash_set(array)))
+            let hash_set = f(make_hash_set(array))?;
+            Arc::new(ArraySet::new(array, hash_set))
         },
         DataType::Utf8 => {
             let array = as_string_array(array)?;
-            Arc::new(ArraySet::new(array, make_hash_set(array)))
+            let hash_set = f(make_hash_set(array))?;
+            Arc::new(ArraySet::new(array, hash_set))
         }
         DataType::LargeUtf8 => {
             let array = as_largestring_array(array);
-            Arc::new(ArraySet::new(array, make_hash_set(array)))
+            let hash_set = f(make_hash_set(array))?;
+            Arc::new(ArraySet::new(array, hash_set))
+        }
+        DataType::Utf8View => {
+            let array = as_string_view_array(array)?;
+            let hash_set = f(make_hash_set(array))?;
+            Arc::new(ArraySet::new(array, hash_set))
         }
         DataType::Binary => {
             let array = as_generic_binary_array::<i32>(array)?;
-            Arc::new(ArraySet::new(array, make_hash_set(array)))
+            let hash_set = f(make_hash_set(array))?;
+            Arc::new(ArraySet::new(array, hash_set))
         }
         DataType::LargeBinary => {
             let array = as_generic_binary_array::<i64>(array)?;
-            Arc::new(ArraySet::new(array, make_hash_set(array)))
+            let hash_set = f(make_hash_set(array))?;
+            Arc::new(ArraySet::new(array, hash_set))
+        }
+        DataType::BinaryView => {
+            let array = as_binary_view_array(array)?;
+            let hash_set = f(make_hash_set(array))?;
+            Arc::new(ArraySet::new(array, hash_set))
+        }

Review Comment:
   I added support for BinaryView and StringView; could be it's own PR



##########
datafusion/physical-plan/src/joins/hash_join/inlist_builder.rs:
##########
@@ -0,0 +1,155 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Utilities for building InList expressions from hash join build side data
+
+use std::sync::Arc;
+
+use arrow::array::{ArrayRef, StructArray};
+use arrow::datatypes::{Field, FieldRef, Fields};
+use arrow_schema::DataType;
+use datafusion_common::Result;
+
+pub(crate) fn build_struct_fields(data_types: &[DataType]) -> Result<Fields> {
+    data_types
+        .iter()
+        .enumerate()
+        .map(|(i, dt)| Ok(Field::new(format!("c{i}"), dt.clone(), false)))
+        .collect()
+}
+
+/// Builds InList values from join key column arrays.
+///
+/// If `join_key_arrays` is:
+/// 1. A single array, let's say Int32, this will produce a flat
+///    InList expression where the lookup is expected to be scalar Int32 
values,
+///    that is: this will produce `IN LIST (1, 2, 3)` expected to be used as 
`2 IN LIST (1, 2, 3)`.
+/// 2. An Int32 array and a Utf8 array, this will produce a Struct InList 
expression
+///    where the lookup is expected to be Struct values with two fields 
(Int32, Utf8),
+///    that is: this will produce `IN LIST ((1, "a"), (2, "b"))` expected to 
be used as `(2, "b") IN LIST ((1, "a"), (2, "b"))`.
+///    The field names of the struct are auto-generated as "c0", "c1", ... and 
should match the struct expression used in the join keys.
+///
+/// Note that this will not deduplicate values, that will happen later when 
building an InList expression from this array.
+///
+/// Returns `None` if the estimated size exceeds `max_size_bytes`.
+/// Performs deduplication to ensure unique values only.
+pub fn build_struct_inlist_values(
+    join_key_arrays: &[ArrayRef],
+    max_size_bytes: usize,
+) -> Result<Option<ArrayRef>> {
+    if join_key_arrays.is_empty() {
+        return Ok(None);
+    }
+
+    let num_rows = join_key_arrays[0].len();
+    if num_rows == 0 {
+        return Ok(None);
+    }
+
+    // Size check using built-in method
+    // This is not 1:1 with the actual size of ScalarValues, but it is a good 
approximation
+    // and at this point is basically "free" to compute since we have the 
arrays already.
+    let estimated_size = join_key_arrays
+        .iter()
+        .map(|arr| arr.get_array_memory_size())
+        .sum::<usize>();
+
+    if estimated_size > max_size_bytes {
+        return Ok(None);
+    }

Review Comment:
   Note: this is where we check the size to set the size limit.



##########
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##########
@@ -1487,15 +1490,34 @@ async fn collect_left_input(
         _ => None,
     };
 
-    let data = JoinLeftData::new(
-        hashmap,
-        single_batch,
-        left_values.clone(),
-        Mutex::new(visited_indices_bitmap),
-        AtomicUsize::new(probe_threads_count),
-        reservation,
+    // Convert Box to Arc for sharing with SharedBuildAccumulator
+    let hash_map: Arc<dyn JoinHashMapType> = hashmap.into();
+
+    let membership = if num_rows == 0 {
+        PushdownStrategy::Empty
+    } else {
+        // If the build side is small enough we can use IN list pushdown.
+        // If it's too big we fall back to pushing down a reference to the 
hash table.
+        // See `PushdownStrategy` for more details.
+        if let Some(in_list_values) =
+            build_struct_inlist_values(&left_values, max_inlist_size)?
+        {
+            PushdownStrategy::InList(in_list_values)
+        } else {
+            PushdownStrategy::HashTable(Arc::clone(&hash_map))
+        }
+    };

Review Comment:
   I did some remodeling of the data structures we use to track state. I think 
the new structure is much better - even if we didn't move forward with the rest 
of the PR.



##########
datafusion/common/src/hash_utils.rs:
##########
@@ -366,35 +366,21 @@ fn hash_fixed_list_array(
     Ok(())
 }
 
-/// Test version of `create_hashes` that produces the same value for
-/// all hashes (to test collisions)
-///
-/// See comments on `hashes_buffer` for more details
-#[cfg(feature = "force_hash_collisions")]
-pub fn create_hashes<'a>(
-    _arrays: &[ArrayRef],
-    _random_state: &RandomState,
-    hashes_buffer: &'a mut Vec<u64>,
-) -> Result<&'a mut Vec<u64>> {
-    for hash in hashes_buffer.iter_mut() {
-        *hash = 0
-    }
-    Ok(hashes_buffer)
-}
-
 /// Creates hash values for every row, based on the values in the
 /// columns.
 ///
 /// The number of rows to hash is determined by `hashes_buffer.len()`.
 /// `hashes_buffer` should be pre-sized appropriately
+///
+/// This function accepts array references directly, avoiding the need to
+/// wrap arrays in `Arc` at the call site.
 #[cfg(not(feature = "force_hash_collisions"))]
-pub fn create_hashes<'a>(
-    arrays: &[ArrayRef],
+pub fn create_hashes_from_arrays<'a>(
+    arrays: &[&dyn Array],

Review Comment:
   Recommend this for its own PR.
   
   I think this is a nice refactor for this function, however I decided not to 
deprecate / replace it to avoid churn. If this were it's own PR I think it 
would be worth it to just make a new version and deprecate the old one, 
replacing all references in DataFusion (which I imagine is most users; even if 
this is `pub` I think it's mostly `pub` to be used in other crates within this 
repo). I did not investigate if this can help avoid clones in any other call 
sites.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Push down InList or hash table references from HashJoinExec depending on the size of the build side [datafusion]

Reply via email to