Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
alamb commented on PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3602523310 Thanks again everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
alamb merged PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
alamb commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2560905183
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -3799,4 +3826,18 @@ mod test {
assert!(result.is_null(i));
}
}
+
+#[test]
+fn test_perfect_shredding_returns_same_arc_ptr() {
Review Comment:
can you also please add some other tests? Specifically, tests for:
1. The case when the shredded value has all nulls
2. The case when the shredded value has some nulls
3. The case when the shredded value is a Struct
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
alamb commented on PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3576810356 > This test case failed: > > https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet-variant-compute/src/variant_get.rs#L3754-L3781 > > It is because the perfectly shredded array itself is not a valid arrow array. I'm not sure if this is a well-defined behavior, I didn't check carefully, but I feel it is `Time64MicrosecondArray`'s responsibility to make sure the its data is valid. > > What do you think @alamb @klion26 @friendlymatthew ? Yes, I agree with this assessment. We shouldn't be relying on `variant_get` to detect incorrectly shredded values. In my mind, I expect the cast options to apply when the source is not shredded (aka it is a dynamically typed variant) so "casting" is required as part of `variant_get` Therefore I think we should update the test to either use a non -shredded variant. Here is a test to do so: - https://github.com/apache/arrow-rs/pull/8921 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
XiangpengHao commented on PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3568444849 > Sorry, what do you mean? If the types exactly match then there's nothing to convert and the options shouldn't matter? I was debating about whether to perform a sanity check before returning, here's the context: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3555090970 I agree that we should not perform any additional conversions, but currently there're two test cases failing it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3564657259 > I checked arrow's `cast_with_options`: > > https://github.com/apache/arrow-rs/blob/779e9bd2ee43d7d882782e6bf6a11ee0944af229/arrow-cast/src/cast/mod.rs#L751-L761 > > It doesn't check if the `from_type` is a valid data type regardless of `cast_options`. I think we should follow the same conventions where, as we take the same `cast_options` from user 🤔 Sorry, what do you mean? If the types exactly match then there's nothing to convert and the options shouldn't matter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
XiangpengHao commented on PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3564480279 I checked arrow's `cast_with_options`: https://github.com/apache/arrow-rs/blob/779e9bd2ee43d7d882782e6bf6a11ee0944af229/arrow-cast/src/cast/mod.rs#L751-L761 It doesn't check if the `from_type` is a valid data type regardless of `cast_options`. I think we should follow the same conventions where, as we take the same `cast_options` from user 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
XiangpengHao commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550759267
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
> I don't think we need to actually combine nulls?
For all types other than variant object (which we explicitly skip), the
shredding spec requires exactly one of value or typed_value to be non-NULL in
each row
Yes, nice catch!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
XiangpengHao commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550717845
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
return Ok(ArrayRef::from(target));
};
+// Try to return the typed value directly when we have a perfect shredding
match.
+if !matches!(as_field.data_type(), DataType::Struct(_)) {
+if let Some(typed_value) = target.typed_value_field() {
+let types_match = typed_value.data_type() == as_field.data_type();
+let value_not_present = target.value_field().is_none();
+// this is a perfect shredding, where the value is entirely
shredded out, so we can just return the typed value
+// note that we MUST check value_not_present, because some of the
`typed_value` might be null but data is present in the `value` column.
+// an alternative is to count whether typed_value has any
non-nulls, or check every row in `value` is null,
+// but this is too complicated and might be slow.
Review Comment:
Thank you @scovich this makes sense
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550351917
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
aside: this would be a fantastic place for if-let chains...
```rust
if let Some(typed_value) = target.typed_value_field()
&& typed_value.null_count() == 0 /* perfectly shredded */
&& can_cast_types(typed_value.data_type(), as_field.data_type())
{
let mut array = match target.nulls() {
None => typed_value.clone(),
nulls => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(nulls.clone()).build()?)
}
};
if typed_value.data_type() != as_field.data_type() {
array = cast(array, as_field.data_type())?;
}
return Ok(array);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
Meanwhile: This is a single-caller function with _very_ specific semantics.
Perhaps it's better to fold it in at the call site, where the necessary
semantics are more obvious? Especially if we're able to simplify it following
the above observation?
Something like:
```rust
if let Some(typed_value) = target.typed_value_field() {
let perfectly_shredded = typed_value.null_count() == 0;
let as_type = as_field.data_type();
if perfectly_shredded && can_cast_types(typed_value.data_type(),
as_type) {
let mut array = match target.nulls() {
None => typed_value.clone(),
nulls => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(nulls.clone()).build()?)
}
};
if typed_value.data_type() != as_type {
array = cast(array, as_type)?;
}
return Ok(array);
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550351917
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
aside: this would be a fantastic place for if-let chains...
```rust
if let Some(typed_value) = target.typed_value_field()
&& typed_value.null_count() == 0 /* perfectly shredded */
&& can_cast_types(typed_value.data_type(), as_field.data_type())
{
let mut array = match target.nulls() {
None => typed_value.clone(),
Some(nulls) => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(Some(nulls.clone())).build()?)
}
};
if typed_value.data_type() != as_field.data_type() {
array = cast(array, as_field.data_type())?;
}
return Ok(array);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
Meanwhile: This is a single-caller function with _very_ specific semantics.
Perhaps it's better to fold it in at the call site, where the necessary
semantics are more obvious? Especially if we're able to simplify it following
the above observation?
Something like:
```rust
if let Some(typed_value) = target.typed_value_field() {
let perfectly_shredded = typed_value.null_count() == 0;
let as_type = as_field.data_type();
if perfectly_shredded && can_cast_types(typed_value.data_type(),
as_type) {
let mut array = match target.nulls() {
None => typed_value.clone(),
Some(nulls) => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(Some(nulls.clone())).build()?)
}
};
if typed_value.data_type() != as_type {
array = cast(array, as_type)?;
}
return Ok(array);
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550351917
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
aside: this would be a fantastic place for if-let chains...
```rust
if let Some(typed_value) = target.typed_value_field()
&& typed_value.null_count() == 0 /* perfectly shredded */
&& can_cast_types(typed_value.data_type(), as_field.data_type())
{
let mut array = match target.nulls() {
None => typed_value.clone(),
Some(nulls) => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(Some(nulls)).build()?)
}
};
if typed_value.data_type() != as_field.data_type() {
array = cast(array, as_field.data_type())?;
}
return Ok(array);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
Meanwhile: This is a single-caller function with _very_ specific semantics.
Perhaps it's better to fold it in at the call site, where the necessary
semantics are more obvious? Especially if we're able to simplify it following
the above observation?
Something like:
```rust
if let Some(typed_value) = target.typed_value_field() {
let perfectly_shredded = typed_value.null_count() == 0;
let as_type = as_field.data_type();
if perfectly_shredded && can_cast_types(typed_value.data_type(),
as_type) {
let mut array = match target.nulls() {
None => typed_value.clone(),
Some(nulls) => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(Some(nulls)).build()?)
}
};
if typed_value.data_type() != as_type {
array = cast(array, as_type)?;
}
return Ok(array);
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
Meanwhile: This is a single-caller function with _very_ specific semantics.
Perhaps it's better to fold it in at the call site, where the necessary
semantics are more obvious? Especially if we're able to simplify it following
the above observation?
Something like:
```rust
if let Some(typed_value) = target.typed_value_field() {
let perfectly_shredded = typed_value.null_count() == 0;
let as_type = as_field.data_type();
if perfectly_shredded && can_cast_types(typed_value.data_type(),
as_type) {
let mut array = match target.nulls() {
None => typed_value.clone(),
Some(nulls) => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(Some(nulls)).build()?)
}
};
if typed_value.data_type() != as_type {
array = cast(array, as_type)?;
}
return Ok(array);
}
}
```
aside: this would be a fantastic place for if-let chains...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
Meanwhile: This is a single-caller function with _very_ specific semantics.
Perhaps it's better to fold it in at the call site, where the necessary
semantics are more obvious? Especially if we're able to simplify it following
the above observation?
Something like:
```rust
if let Some(typed_value) = target.typed_value_field() {
let perfectly_shredded = typed_value.null_count() == 0;
let as_type = as_field.data_type();
if perfectly_shredded && can_cast_types(typed_value.data_type(),
as_type) {
let mut array = match target.nulls() {
None => typed_value.clone(),
Some(nulls) => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(Some(nulls)).build()?)
}
};
if typed_value.data_type() != as_type {
array = cast(array, as_type)?;
}
return Ok(array);
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550215532
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
return Ok(ArrayRef::from(target));
};
+// Try to return the typed value directly when we have a perfect shredding
match.
+if !matches!(as_field.data_type(), DataType::Struct(_)) {
+if let Some(typed_value) = target.typed_value_field() {
+let types_match = typed_value.data_type() == as_field.data_type();
+let value_not_present = target.value_field().is_none();
+// this is a perfect shredding, where the value is entirely
shredded out, so we can just return the typed value
+// note that we MUST check value_not_present, because some of the
`typed_value` might be null but data is present in the `value` column.
+// an alternative is to count whether typed_value has any
non-nulls, or check every row in `value` is null,
+// but this is too complicated and might be slow.
Review Comment:
I'm pretty sure
[NullBuffer::null_count](https://docs.rs/arrow-buffer/57.0.0/src/arrow_buffer/buffer/null.rs.html#144)
is a stored (not computed) value, so:
```rust
let value_not_present = target.value_field().is_none_or(|v| v.null_count()
== v.len());
```
Also: Double check line wrapping here? (seems like a lot more than 100 chars)
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
I don't think we need to actually combine nulls?
For all types other than variant object (which we explicitly skip), the
shredding spec requires exactly one of `value` or `typed_value` to be non-NULL
in each row, with SQL NULL values encoded as `Variant::Null` in the `value`
column.
=> The presence of any NULL values in a `typed_value` column breaks perfect
shredding (even if it's a true SQL NULL)
=> The child array must have null_count 0 if we reached this point
=> We can just apply the parent nulls to the child array and be done with it
Also: https://github.com/apache/arrow-rs/issues/6528 seems relevant here?
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
}
}
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(),
Some(parent_nulls));
Review Comment:
Meanwhile: This is a single-caller function with _very_ specific semantics.
Perhaps it's better to fold it in at the call site, where the necessary
semantics are more obvious? Especially if we're able to simplify it following
the above observation?
Something like:
```rust
if let Some(typed_value) = target.typed_value_field() {
let perfectly_shredded = typed_value.null_count() == 0;
let as_type = as_field.data_type();
if perfectly_shredded && can_cast_types(typed_value.data_type(),
as_type) {
let mut array = match target.nulls() {
None => typed_value.clone(),
Some(nulls) => {
let builder = array.to_data().into_builder();
make_array(builder.nulls(Some(nulls)).build()?)
}
};
if typed_value.data_type() != as_type {
array = cast(array, as_type)?;
}
Ok(array)
}
}
```
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
return Ok(ArrayRef::from(target));
};
+// Try
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
alamb commented on PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3563675422 I have this on my radar to review; I am hoping to ahve some time to devote to variant stuff next week -- this week I have been occupied getting the 57.1.0 release ready -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
XiangpengHao commented on PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3561393466 > This test here wants to cover the behavior of CastOptions; I think 1) we can't guarantee that the input is valid in `variant_get` here; 2) we'll do a type cast in `variant_get`(type_conversion.rs), and the cast may or may not succeed, and the `CastOptions::safe` controls how to handle suce cases. > > Returning to this test, we might be able to modify the input for `variant_get` here, but we may still need to ensure that different CastOptions are covered. Makes sense to me, thank you @klion26 , in that case I'll try to make the optimization only enabled for not safe cast case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
klion26 commented on PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3558818347 This test here wants to cover the behavior of CastOptions; I think 1) we can't guarantee that the input is valid in `variant_get` here; 2) we'll do a type cast in `variant_get`(type_conversion.rs), and the cast may or may not succeed, and the `CastOptions::safe` controls how to handle suce cases. Returning to this test, we might be able to modify the input for `variant_get` here, but we may still need to ensure that different CastOptions are covered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
XiangpengHao commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2543853730
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
return Ok(ArrayRef::from(target));
};
+// Try to return the typed value directly when we have a perfect shredding
match.
+if !matches!(as_field.data_type(), DataType::Struct(_)) {
+if let Some(typed_value) = target.typed_value_field() {
+let types_match = typed_value.data_type() == as_field.data_type();
Review Comment:
this is probably overly restrictive, `Utf8`, `LargeUtf8`, and `Utf8View`
should be allowed. Maybe check if cast-able.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
XiangpengHao commented on PR #8887: URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3555090970 This test case failed: https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet-variant-compute/src/variant_get.rs#L3754-L3781 It is because the perfectly shredded array itself is not a valid arrow array. I'm not sure if this is a well-defined behavior, I didn't check carefully, but I feel it is `Time64MicrosecondArray`'s responsibility to make sure the its data is valid. What do you think @alamb @klion26 @friendlymatthew ? My gut feeling is that this data integrity checking during read time can be very expensive; for example, validating utf8 every time we read a string array can be extremely slow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]
XiangpengHao commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2543853730
##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
return Ok(ArrayRef::from(target));
};
+// Try to return the typed value directly when we have a perfect shredding
match.
+if !matches!(as_field.data_type(), DataType::Struct(_)) {
+if let Some(typed_value) = target.typed_value_field() {
+let types_match = typed_value.data_type() == as_field.data_type();
Review Comment:
Is this probably overly restrictive, `Utf8`, `LargeUtf8`, and `Utf8View`
should be allowed. Maybe check if cast-able.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
