haohuaijin opened a new issue, #22796:
URL: https://github.com/apache/datafusion/issues/22796
### Describe the bug
`approx_distinct` over a `Utf8View` column can report an inflated distinct
count. The same string value may be hashed in two different ways depending on
which **batch** it arrives in, so one distinct value gets recorded in two
different HyperLogLog registers and counted more than once.
### To Reproduce
```rust
fn distinct_count(acc: &mut StringViewHLLAccumulator) -> u64 {
match acc.evaluate().unwrap() {
ScalarValue::UInt64(Some(v)) => v,
other => panic!("unexpected evaluate result: {other:?}"),
}
}
// A string longer than the 12-byte inline limit
const LONG: &str = "this string is definitely longer than twelve bytes";
#[test]
fn split_batches_match_single_mixed_batch() {
// Multiset: {"aaa" x2, "bbb", LONG}, so 3 distinct values.
let mixed: ArrayRef =
Arc::new(StringViewArray::from(vec!["aaa", "bbb", LONG, "aaa"]));
let mut acc_single = StringViewHLLAccumulator::new();
acc_single.update_batch(&[mixed]).unwrap();
// Same multiset, but split so "aaa" lands in both an all-inline
batch
// and a batch with a data buffer (forced by LONG).
let inline_only: ArrayRef =
Arc::new(StringViewArray::from(vec!["aaa", "bbb"]));
let with_buffer: ArrayRef =
Arc::new(StringViewArray::from(vec!["aaa", LONG]));
assert!(inline_only.as_string_view().data_buffers().is_empty());
assert!(!with_buffer.as_string_view().data_buffers().is_empty());
let mut acc_split = StringViewHLLAccumulator::new();
acc_split.update_batch(&[inline_only]).unwrap();
acc_split.update_batch(&[with_buffer]).unwrap();
assert_eq!(
distinct_count(&mut acc_single),
distinct_count(&mut acc_split)
);
assert_eq!(distinct_count(&mut acc_single), 3);
}
```
### Expected behavior
_No response_
### Additional context
found this when working on https://github.com/apache/datafusion/pull/22768
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]