adriangb commented on code in PR #18393:
URL: https://github.com/apache/datafusion/pull/18393#discussion_r2482567563
##########
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##########
@@ -1322,11 +1324,11 @@ async fn
test_hashjoin_dynamic_filter_pushdown_partitioned() {
- CoalesceBatchesExec: target_batch_size=8192
- HashJoinExec: mode=Partitioned, join_type=Inner, on=[(a@0, a@0),
(b@1, b@1)]
- CoalesceBatchesExec: target_batch_size=8192
- - RepartitionExec: partitioning=Hash([a@0, b@1], 12),
input_partitions=1
+ - RepartitionExec: partitioning=Hash([a@0, b@1], 4),
input_partitions=1
- DataSourceExec: file_groups={1 group: [[test.parquet]]},
projection=[a, b, c], file_type=test, pushdown_supported=true
- CoalesceBatchesExec: target_batch_size=8192
- - RepartitionExec: partitioning=Hash([a@0, b@1], 12),
input_partitions=1
- - DataSourceExec: file_groups={1 group: [[test.parquet]]},
projection=[a, b, e], file_type=test, pushdown_supported=true,
predicate=DynamicFilter [ a@0 >= aa AND a@0 <= ab AND b@1 >= ba AND b@1 <= bb ]
+ - RepartitionExec: partitioning=Hash([a@0, b@1], 4),
input_partitions=1
+ - DataSourceExec: file_groups={1 group: [[test.parquet]]},
projection=[a, b, e], file_type=test, pushdown_supported=true,
predicate=DynamicFilter [ a@0 >= aa AND a@0 <= ab AND b@1 >= ba AND b@1 <= bb
AND struct(a@0, b@1) IN (SET) ([{c0:aa,c1:ba}, {c0:ab,c1:bb}]) ]
Review Comment:
In this case there's only 1 partition with data (because of hash collisions)
-> we optimize away the `CASE` expression. This is relevant because the same
thing would happen with a point lookup primary key join.
##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -143,6 +144,76 @@ where
}
}
+/// Specialized Set implementation for StructArray
+struct StructArraySet {
+ array: Arc<StructArray>,
+ hash_set: ArrayHashSet,
+}
Review Comment:
I think this is a nice improvement / feature that can easily be it's own PR.
##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -474,16 +621,81 @@ pub fn in_list(
)))
}
+/// Create a new static InList expression from an array of values.
+///
+/// The `array` should contain the list of values for the `IN` clause.
+/// The `negated` flag indicates whether this is an `IN` or `NOT IN`
expression.
+///
+/// The generated `InListExpr` will use the provided array as a static filter
for efficient evaluation
+/// and the list of expressions will be constructed from the unique values in
the array (with no guaranteed order).
+///
+/// # Errors
+/// Returns an error if the array type is not supported for `InList`
expressions or if we cannot build a hash based lookup.
+pub fn in_list_from_array(
+ expr: Arc<dyn PhysicalExpr>,
+ array: ArrayRef,
+ negated: bool,
+) -> Result<Arc<dyn PhysicalExpr>> {
+ let mut list = Vec::new();
+ let static_filter = make_set(array.as_ref(), |hash_set| {
+ // If we are in tests sort the keys so we get a deterministic order
for the list
+ // Use cfg(any(test, debug_assertions)) to cover both unit tests and
integration tests
+ #[cfg(any(test, debug_assertions))]
+ {
+ use itertools::Itertools;
+ list = hash_set
+ .map
+ .keys()
+ .sorted()
+ .map(|&i| {
+ let scalar = ScalarValue::try_from_array(&array, i)?;
+ Ok(crate::expressions::lit(scalar))
+ })
+ .collect::<Result<Vec<_>>>()?;
+ }
+ #[cfg(not(any(test, debug_assertions)))]
+ {
+ list = hash_set
+ .map
+ .keys()
+ .map(|&i| {
+ let scalar = ScalarValue::try_from_array(&array, i)?;
+ Ok(crate::expressions::lit(scalar))
+ })
+ .collect::<Result<Vec<_>>>()?;
+ }
+ Ok(hash_set)
+ })?;
+
+ // If make_set failed, build list from full array
+ if list.is_empty() && !array.is_empty() {
Review Comment:
This is just weirdness w/ the typing. We could also just `.expect()` on
`list`.
##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -474,16 +621,81 @@ pub fn in_list(
)))
}
+/// Create a new static InList expression from an array of values.
+///
+/// The `array` should contain the list of values for the `IN` clause.
+/// The `negated` flag indicates whether this is an `IN` or `NOT IN`
expression.
+///
+/// The generated `InListExpr` will use the provided array as a static filter
for efficient evaluation
+/// and the list of expressions will be constructed from the unique values in
the array (with no guaranteed order).
+///
+/// # Errors
+/// Returns an error if the array type is not supported for `InList`
expressions or if we cannot build a hash based lookup.
Review Comment:
This is an important property: this function will error if we can't build an
efficient hash table lookup.
I would have rather kept it `pub(crate)` but it's being used from a
different create so it has to be just `pub`.
##########
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##########
@@ -1305,11 +1307,11 @@ async fn
test_hashjoin_dynamic_filter_pushdown_partitioned() {
- CoalesceBatchesExec: target_batch_size=8192
- HashJoinExec: mode=Partitioned, join_type=Inner, on=[(a@0, a@0),
(b@1, b@1)]
- CoalesceBatchesExec: target_batch_size=8192
- - RepartitionExec: partitioning=Hash([a@0, b@1], 12),
input_partitions=1
+ - RepartitionExec: partitioning=Hash([a@0, b@1], 4),
input_partitions=1
- DataSourceExec: file_groups={1 group: [[test.parquet]]},
projection=[a, b, c], file_type=test, pushdown_supported=true
- CoalesceBatchesExec: target_batch_size=8192
- - RepartitionExec: partitioning=Hash([a@0, b@1], 12),
input_partitions=1
- - DataSourceExec: file_groups={1 group: [[test.parquet]]},
projection=[a, b, e], file_type=test, pushdown_supported=true,
predicate=DynamicFilter [ a@0 >= ab AND a@0 <= ab AND b@1 >= bb AND b@1 <= bb
OR a@0 >= aa AND a@0 <= aa AND b@1 >= ba AND b@1 <= ba ]
+ - RepartitionExec: partitioning=Hash([a@0, b@1], 4),
input_partitions=1
+ - DataSourceExec: file_groups={1 group: [[test.parquet]]},
projection=[a, b, e], file_type=test, pushdown_supported=true,
predicate=DynamicFilter [ CASE hash_repartition % 4 WHEN 0 THEN a@0 >= aa AND
a@0 <= aa AND b@1 >= ba AND b@1 <= ba AND struct(a@0, b@1) IN (SET)
([{c0:aa,c1:ba}]) WHEN 2 THEN a@0 >= ab AND a@0 <= ab AND b@1 >= bb AND b@1 <=
bb AND struct(a@0, b@1) IN (SET) ([{c0:ab,c1:bb}]) ELSE false END ]
Review Comment:
Note that this automatically excludes empty partitions and defaults to
`false`, which works with inner joins. I'm not sure how we'd structure this for
other join types.
##########
datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs:
##########
@@ -101,25 +179,64 @@ impl PartitionBounds {
///
/// All fields use a single mutex to ensure correct coordination between
concurrent
/// partition executions.
-pub(crate) struct SharedBoundsAccumulator {
- /// Shared state protected by a single mutex to avoid ordering concerns
- inner: Mutex<SharedBoundsState>,
+pub(crate) struct SharedBuildAccumulator {
+ /// Build-side data protected by a single mutex to avoid ordering concerns
+ inner: Mutex<AccumulatedBuildData>,
barrier: Barrier,
/// Dynamic filter for pushdown to probe side
dynamic_filter: Arc<DynamicFilterPhysicalExpr>,
- /// Right side join expressions needed for creating filter bounds
+ /// Right side join expressions needed for creating filter expressions
on_right: Vec<PhysicalExprRef>,
+ /// Random state for partitioning (RepartitionExec's hash function with
0,0,0,0 seeds)
+ /// Used for PartitionedHashLookupPhysicalExpr
+ repartition_random_state: RandomState,
+ /// Schema of the probe (right) side for evaluating filter expressions
+ probe_schema: Arc<Schema>,
}
-/// State protected by SharedBoundsAccumulator's mutex
-struct SharedBoundsState {
- /// Bounds from completed partitions.
- /// Each element represents the column bounds computed by one partition.
- bounds: Vec<PartitionBounds>,
+/// Strategy for filter pushdown (decided at collection time)
+#[derive(Clone)]
+pub(crate) enum PushdownStrategy {
+ /// Use InList for small build sides (< 128MB)
+ InList(ArrayRef),
+ /// Use hash table lookup for large build sides
+ HashTable(Arc<dyn JoinHashMapType>),
+ /// There was no data in this partition, do not build a dynamic filter for
it
+ Empty,
}
-impl SharedBoundsAccumulator {
- /// Creates a new SharedBoundsAccumulator configured for the given
partition mode
+/// Build-side data reported by a single partition
+pub(crate) enum PartitionBuildData {
+ Partitioned {
+ partition_id: usize,
+ pushdown: PushdownStrategy,
+ bounds: PartitionBounds,
+ },
+ CollectLeft {
+ pushdown: PushdownStrategy,
+ bounds: PartitionBounds,
+ },
+}
+
+/// Per-partition accumulated data (Partitioned mode)
+#[derive(Clone)]
+struct PartitionData {
+ bounds: PartitionBounds,
+ pushdown: PushdownStrategy,
+}
+
+/// Build-side data organized by partition mode
+enum AccumulatedBuildData {
+ Partitioned {
+ partitions: Vec<Option<PartitionData>>,
+ },
+ CollectLeft {
+ data: Option<PartitionData>,
+ },
+}
Review Comment:
This is the refactoring of the data structures we store to track state. It's
now much cleaner, e.g. avoids comparing partitions by id.
##########
datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs:
##########
@@ -0,0 +1,292 @@
+// Licensed to the Apache Software Foundation (ASF) under one
Review Comment:
This is the part used to handle the cases where we push down the entire hash
table.
##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -143,6 +144,76 @@ where
}
}
+/// Specialized Set implementation for StructArray
+struct StructArraySet {
+ array: Arc<StructArray>,
+ hash_set: ArrayHashSet,
+}
+
+impl Set for StructArraySet {
+ fn contains(&self, v: &dyn Array, negated: bool) -> Result<BooleanArray> {
+ // Handle dictionary arrays
+ downcast_dictionary_array! {
+ v => {
+ let values_contains = self.contains(v.values().as_ref(),
negated)?;
+ let result = take(&values_contains, v.keys(), None)?;
+ return Ok(downcast_array(result.as_ref()))
+ }
+ _ => {}
+ }
+
+ let v = as_struct_array(v)?;
+ let has_nulls = self.array.null_count() != 0;
+
+ // Compute hashes for all rows in the input array
+ let mut input_hashes = vec![0u64; v.len()];
Review Comment:
Could we use a thread local to avoid repeated re-allocations here? I imagine
that would work quite well.
##########
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##########
@@ -104,30 +108,12 @@ pub(super) struct JoinLeftData {
_reservation: MemoryReservation,
/// Bounds computed from the build side for dynamic filter pushdown
pub(super) bounds: Option<Vec<ColumnBounds>>,
+ /// InList values for small build sides (alternative to hash map pushdown)
+ /// Contains unique values from build side for filter pushdown as InList
expressions
+ pub(super) membership: PushdownStrategy,
}
impl JoinLeftData {
- /// Create a new `JoinLeftData` from its parts
- pub(super) fn new(
Review Comment:
Clippy was complaining about too many arguments, a constructor was not
necessary anyway
##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -180,29 +251,103 @@ where
ArrayHashSet { state, map }
}
-/// Creates a `Box<dyn Set>` for the given list of `IN` expressions and `batch`
-fn make_set(array: &dyn Array) -> Result<Arc<dyn Set>> {
+/// Computes an [`ArrayHashSet`] for the provided [`StructArray`]
+fn make_struct_hash_set(array: &Arc<StructArray>) -> Result<ArrayHashSet> {
+ let state = RandomState::new();
+ let mut map: HashMap<usize, (), ()> =
+ HashMap::with_capacity_and_hasher(array.len(), ());
+
+ // Compute hashes for all rows
+ let mut row_hashes = vec![0u64; array.len()];
+ create_hashes_from_arrays(&[array.as_ref()], &state, &mut row_hashes)?;
+
+ // Build hash set, deduplicating based on struct equality
+ let insert_value = |idx: usize| {
+ let hash = row_hashes[idx];
+ if let RawEntryMut::Vacant(v) =
+ map.raw_entry_mut().from_hash(hash, |&existing_idx| {
+ // Compare the two struct rows for equality
+ // Use slice to get single-row arrays for comparison
+ let existing_row = array.slice(existing_idx, 1);
+ let current_row = array.slice(idx, 1);
+
+ // Use compare_op_for_nested for struct equality
+ match compare_with_eq(&existing_row, ¤t_row, true) {
+ Ok(result) => result.value(0),
+ Err(_) => false,
+ }
+ })
+ {
+ v.insert_with_hasher(hash, idx, (), |&x| row_hashes[x]);
+ }
+ };
+
+ match array.nulls() {
+ Some(nulls) => {
+ BitIndexIterator::new(nulls.validity(), nulls.offset(),
nulls.len())
+ .for_each(insert_value)
+ }
+ None => (0..array.len()).for_each(insert_value),
+ }
+
+ Ok(ArrayHashSet { state, map })
+}
+
+/// Creates a `Box<dyn Set>` for the given list of `IN` expressions and
`batch`.
+/// Accepts a parameter `f` which allows access to the underlying
`ArrayHashSet`
+/// and is used by [`in_list_from_array`] to collect unique indices.
+fn make_set<F>(array: &dyn Array, mut f: F) -> Result<Arc<dyn Set>>
+where
+ F: FnMut(ArrayHashSet) -> Result<ArrayHashSet>,
Review Comment:
This change is to add a closure so we can cheaply recover the indexes of the
unique items in the list re-using the work we did to build the hash table to
build the equality check list.
##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -529,7 +741,18 @@ mod tests {
})
}
- // applies the in_list expr to an input batch and list
+ /// Test helper macro that evaluates an IN LIST expression with automatic
type casting.
Review Comment:
No actual changes, just improved docstrings (AI generated but still useful
for human readers or future AI readers).
##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -180,29 +251,103 @@ where
ArrayHashSet { state, map }
}
-/// Creates a `Box<dyn Set>` for the given list of `IN` expressions and `batch`
-fn make_set(array: &dyn Array) -> Result<Arc<dyn Set>> {
+/// Computes an [`ArrayHashSet`] for the provided [`StructArray`]
+fn make_struct_hash_set(array: &Arc<StructArray>) -> Result<ArrayHashSet> {
+ let state = RandomState::new();
+ let mut map: HashMap<usize, (), ()> =
+ HashMap::with_capacity_and_hasher(array.len(), ());
+
+ // Compute hashes for all rows
+ let mut row_hashes = vec![0u64; array.len()];
+ create_hashes_from_arrays(&[array.as_ref()], &state, &mut row_hashes)?;
+
+ // Build hash set, deduplicating based on struct equality
+ let insert_value = |idx: usize| {
+ let hash = row_hashes[idx];
+ if let RawEntryMut::Vacant(v) =
+ map.raw_entry_mut().from_hash(hash, |&existing_idx| {
+ // Compare the two struct rows for equality
+ // Use slice to get single-row arrays for comparison
+ let existing_row = array.slice(existing_idx, 1);
+ let current_row = array.slice(idx, 1);
+
+ // Use compare_op_for_nested for struct equality
+ match compare_with_eq(&existing_row, ¤t_row, true) {
+ Ok(result) => result.value(0),
+ Err(_) => false,
+ }
+ })
+ {
+ v.insert_with_hasher(hash, idx, (), |&x| row_hashes[x]);
+ }
+ };
+
+ match array.nulls() {
+ Some(nulls) => {
+ BitIndexIterator::new(nulls.validity(), nulls.offset(),
nulls.len())
+ .for_each(insert_value)
+ }
+ None => (0..array.len()).for_each(insert_value),
+ }
+
+ Ok(ArrayHashSet { state, map })
+}
+
+/// Creates a `Box<dyn Set>` for the given list of `IN` expressions and
`batch`.
+/// Accepts a parameter `f` which allows access to the underlying
`ArrayHashSet`
+/// and is used by [`in_list_from_array`] to collect unique indices.
+fn make_set<F>(array: &dyn Array, mut f: F) -> Result<Arc<dyn Set>>
+where
+ F: FnMut(ArrayHashSet) -> Result<ArrayHashSet>,
+{
Ok(downcast_primitive_array! {
- array => Arc::new(ArraySet::new(array, make_hash_set(array))),
+ array => {
+ let hash_set = f(make_hash_set(array))?;
+ Arc::new(ArraySet::new(array, hash_set))
+ },
DataType::Boolean => {
let array = as_boolean_array(array)?;
- Arc::new(ArraySet::new(array, make_hash_set(array)))
+ let hash_set = f(make_hash_set(array))?;
+ Arc::new(ArraySet::new(array, hash_set))
},
DataType::Utf8 => {
let array = as_string_array(array)?;
- Arc::new(ArraySet::new(array, make_hash_set(array)))
+ let hash_set = f(make_hash_set(array))?;
+ Arc::new(ArraySet::new(array, hash_set))
}
DataType::LargeUtf8 => {
let array = as_largestring_array(array);
- Arc::new(ArraySet::new(array, make_hash_set(array)))
+ let hash_set = f(make_hash_set(array))?;
+ Arc::new(ArraySet::new(array, hash_set))
+ }
+ DataType::Utf8View => {
+ let array = as_string_view_array(array)?;
+ let hash_set = f(make_hash_set(array))?;
+ Arc::new(ArraySet::new(array, hash_set))
}
DataType::Binary => {
let array = as_generic_binary_array::<i32>(array)?;
- Arc::new(ArraySet::new(array, make_hash_set(array)))
+ let hash_set = f(make_hash_set(array))?;
+ Arc::new(ArraySet::new(array, hash_set))
}
DataType::LargeBinary => {
let array = as_generic_binary_array::<i64>(array)?;
- Arc::new(ArraySet::new(array, make_hash_set(array)))
+ let hash_set = f(make_hash_set(array))?;
+ Arc::new(ArraySet::new(array, hash_set))
+ }
+ DataType::BinaryView => {
+ let array = as_binary_view_array(array)?;
+ let hash_set = f(make_hash_set(array))?;
+ Arc::new(ArraySet::new(array, hash_set))
+ }
Review Comment:
I added support for BinaryView and StringView; could be it's own PR
##########
datafusion/physical-plan/src/joins/hash_join/inlist_builder.rs:
##########
@@ -0,0 +1,155 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Utilities for building InList expressions from hash join build side data
+
+use std::sync::Arc;
+
+use arrow::array::{ArrayRef, StructArray};
+use arrow::datatypes::{Field, FieldRef, Fields};
+use arrow_schema::DataType;
+use datafusion_common::Result;
+
+pub(crate) fn build_struct_fields(data_types: &[DataType]) -> Result<Fields> {
+ data_types
+ .iter()
+ .enumerate()
+ .map(|(i, dt)| Ok(Field::new(format!("c{i}"), dt.clone(), false)))
+ .collect()
+}
+
+/// Builds InList values from join key column arrays.
+///
+/// If `join_key_arrays` is:
+/// 1. A single array, let's say Int32, this will produce a flat
+/// InList expression where the lookup is expected to be scalar Int32
values,
+/// that is: this will produce `IN LIST (1, 2, 3)` expected to be used as
`2 IN LIST (1, 2, 3)`.
+/// 2. An Int32 array and a Utf8 array, this will produce a Struct InList
expression
+/// where the lookup is expected to be Struct values with two fields
(Int32, Utf8),
+/// that is: this will produce `IN LIST ((1, "a"), (2, "b"))` expected to
be used as `(2, "b") IN LIST ((1, "a"), (2, "b"))`.
+/// The field names of the struct are auto-generated as "c0", "c1", ... and
should match the struct expression used in the join keys.
+///
+/// Note that this will not deduplicate values, that will happen later when
building an InList expression from this array.
+///
+/// Returns `None` if the estimated size exceeds `max_size_bytes`.
+/// Performs deduplication to ensure unique values only.
+pub fn build_struct_inlist_values(
+ join_key_arrays: &[ArrayRef],
+ max_size_bytes: usize,
+) -> Result<Option<ArrayRef>> {
+ if join_key_arrays.is_empty() {
+ return Ok(None);
+ }
+
+ let num_rows = join_key_arrays[0].len();
+ if num_rows == 0 {
+ return Ok(None);
+ }
+
+ // Size check using built-in method
+ // This is not 1:1 with the actual size of ScalarValues, but it is a good
approximation
+ // and at this point is basically "free" to compute since we have the
arrays already.
+ let estimated_size = join_key_arrays
+ .iter()
+ .map(|arr| arr.get_array_memory_size())
+ .sum::<usize>();
+
+ if estimated_size > max_size_bytes {
+ return Ok(None);
+ }
Review Comment:
Note: this is where we check the size to set the size limit.
##########
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##########
@@ -1487,15 +1490,34 @@ async fn collect_left_input(
_ => None,
};
- let data = JoinLeftData::new(
- hashmap,
- single_batch,
- left_values.clone(),
- Mutex::new(visited_indices_bitmap),
- AtomicUsize::new(probe_threads_count),
- reservation,
+ // Convert Box to Arc for sharing with SharedBuildAccumulator
+ let hash_map: Arc<dyn JoinHashMapType> = hashmap.into();
+
+ let membership = if num_rows == 0 {
+ PushdownStrategy::Empty
+ } else {
+ // If the build side is small enough we can use IN list pushdown.
+ // If it's too big we fall back to pushing down a reference to the
hash table.
+ // See `PushdownStrategy` for more details.
+ if let Some(in_list_values) =
+ build_struct_inlist_values(&left_values, max_inlist_size)?
+ {
+ PushdownStrategy::InList(in_list_values)
+ } else {
+ PushdownStrategy::HashTable(Arc::clone(&hash_map))
+ }
+ };
Review Comment:
I did some remodeling of the data structures we use to track state. I think
the new structure is much better - even if we didn't move forward with the rest
of the PR.
##########
datafusion/common/src/hash_utils.rs:
##########
@@ -366,35 +366,21 @@ fn hash_fixed_list_array(
Ok(())
}
-/// Test version of `create_hashes` that produces the same value for
-/// all hashes (to test collisions)
-///
-/// See comments on `hashes_buffer` for more details
-#[cfg(feature = "force_hash_collisions")]
-pub fn create_hashes<'a>(
- _arrays: &[ArrayRef],
- _random_state: &RandomState,
- hashes_buffer: &'a mut Vec<u64>,
-) -> Result<&'a mut Vec<u64>> {
- for hash in hashes_buffer.iter_mut() {
- *hash = 0
- }
- Ok(hashes_buffer)
-}
-
/// Creates hash values for every row, based on the values in the
/// columns.
///
/// The number of rows to hash is determined by `hashes_buffer.len()`.
/// `hashes_buffer` should be pre-sized appropriately
+///
+/// This function accepts array references directly, avoiding the need to
+/// wrap arrays in `Arc` at the call site.
#[cfg(not(feature = "force_hash_collisions"))]
-pub fn create_hashes<'a>(
- arrays: &[ArrayRef],
+pub fn create_hashes_from_arrays<'a>(
+ arrays: &[&dyn Array],
Review Comment:
Recommend this for its own PR.
I think this is a nice refactor for this function, however I decided not to
deprecate / replace it to avoid churn. If this were it's own PR I think it
would be worth it to just make a new version and deprecate the old one,
replacing all references in DataFusion (which I imagine is most users; even if
this is `pub` I think it's mostly `pub` to be used in other crates within this
repo). I did not investigate if this can help avoid clones in any other call
sites.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]