scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550215532
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -208,6 +233,21 @@ fn shredded_get_path(
return Ok(ArrayRef::from(target));
};
+ // Try to return the typed value directly when we have a perfect shredding
match.
+ if !matches!(as_field.data_type(), DataType::Struct(_)) {
+ if let Some(typed_value) = target.typed_value_field() {
+ let types_match = typed_value.data_type() == as_field.data_type();
+ let value_not_present = target.value_field().is_none();
+ // this is a perfect shredding, where the value is entirely
shredded out, so we can just return the typed value
+ // note that we MUST check value_not_present, because some of the
`typed_value` might be null but data is present in the `value` column.
+ // an alternative is to count whether typed_value has any
non-nulls, or check every row in `value` is null,
+ // but this is too complicated and might be slow.
Review Comment:
I'm pretty sure
[NullBuffer::null_count](https://docs.rs/arrow-buffer/57.0.0/src/arrow_buffer/buffer/null.rs.html#144)
is a stored (not computed) value, so:
```rust
let value_not_present = target.value_field().is_none_or(|v| v.null_count()
== v.len());
```
Also: Double check line wrapping here? (seems like a lot more than 100 chars)
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+ array: &ArrayRef,
+ parent_nulls: Option<&NullBuffer>,
+) -> Result<ArrayRef> {
+ let Some(parent_nulls) = parent_nulls else {
+ return Ok(array.clone());
+ };
+ if parent_nulls.null_count() == 0 {
+ return Ok(array.clone());
+ }
+
+ let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
I don't think we need to actually combine nulls?
For all types other than variant object (which we explicitly skip), the
shredding spec requires exactly one of `value` or `typed_value` to be non-NULL
in each row, with SQL NULL values encoded as `Variant::Null` in the `value`
column.
=> The presence of any NULL values in a `typed_value` column breaks perfect
shredding (even if it's a true SQL NULL)
=> The child array must have null_count 0 if we reached this point
=> We can just apply the parent nulls to the child array and be done with it
Also: https://github.com/apache/arrow-rs/issues/6528 seems relevant here?
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+ array: &ArrayRef,
+ parent_nulls: Option<&NullBuffer>,
+) -> Result<ArrayRef> {
+ let Some(parent_nulls) = parent_nulls else {
+ return Ok(array.clone());
+ };
+ if parent_nulls.null_count() == 0 {
+ return Ok(array.clone());
+ }
+
+ let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
Meanwhile: This is a single-caller function with _very_ specific semantics.
Perhaps it's better to fold it in at the call site, where the necessary
semantics are more obvious? Especially if we're able to simplify it following
the above observation?
Something like:
```rust
if let Some(typed_value) = target.typed_value_field() {
let perfectly_shredded = typed_value.null_count() == 0;
let as_type = as_field.data_type();
if perfectly_shredded && can_cast_types(typed_value.data_type(),
as_type) {
let mut array = match target.nulls() {
None => typed_value.clone(),
Some(nulls) => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(Some(nulls)).build()?)
}
};
if typed_value.data_type() != as_type {
array = cast(array, as_type)?;
}
Ok(array)
}
}
```
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -208,6 +233,21 @@ fn shredded_get_path(
return Ok(ArrayRef::from(target));
};
+ // Try to return the typed value directly when we have a perfect shredding
match.
+ if !matches!(as_field.data_type(), DataType::Struct(_)) {
+ if let Some(typed_value) = target.typed_value_field() {
+ let types_match = typed_value.data_type() == as_field.data_type();
+ let value_not_present = target.value_field().is_none();
+ // this is a perfect shredding, where the value is entirely
shredded out, so we can just return the typed value
+ // note that we MUST check value_not_present, because some of the
`typed_value` might be null but data is present in the `value` column.
+ // an alternative is to count whether typed_value has any
non-nulls, or check every row in `value` is null,
+ // but this is too complicated and might be slow.
Review Comment:
The expensive check to avoid is counting up `Variant::Null` values.
Unfortunately, NULL in a `typed_value` column signals a non-NULL `value`
(even if that value is `Variant::Null`), so a nullable column will usually not
shred perfectly even if all non-NULL values shred perfectly. We must fall back
to a variant builder that manually converts a `value` column full of
`Variant::Null` back to NULL.
##########
parquet-variant-compute/src/variant_get.rs:
##########
@@ -208,6 +233,21 @@ fn shredded_get_path(
return Ok(ArrayRef::from(target));
};
+ // Try to return the typed value directly when we have a perfect shredding
match.
+ if !matches!(as_field.data_type(), DataType::Struct(_)) {
+ if let Some(typed_value) = target.typed_value_field() {
+ let types_match = typed_value.data_type() == as_field.data_type();
Review Comment:
AFAIK, the current code is correct unless we're willing to inject a
typecast.
Otherwise, the caller could end up with a different type than they
requested.
That said, casting does seem reasonable -- it would certainly be faster than
a variant builder, and the caller _did_ request a specific type.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]