Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-12-02 Thread via GitHub


alamb commented on PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3602523310

   Thanks again everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-12-02 Thread via GitHub


alamb merged PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-25 Thread via GitHub


alamb commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2560905183


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -3799,4 +3826,18 @@ mod test {
 assert!(result.is_null(i));
 }
 }
+
+#[test]
+fn test_perfect_shredding_returns_same_arc_ptr() {

Review Comment:
   can you also please add some other tests? Specifically, tests for:
   1. The case when the shredded value has all nulls
   2. The case when the shredded value has some nulls
   3. The case when the shredded value is a Struct



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-25 Thread via GitHub


alamb commented on PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3576810356

   > This test case failed:
   > 
   > 
https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet-variant-compute/src/variant_get.rs#L3754-L3781
   > 
   > It is because the perfectly shredded array itself is not a valid arrow 
array. I'm not sure if this is a well-defined behavior, I didn't check 
carefully, but I feel it is `Time64MicrosecondArray`'s responsibility to make 
sure the its data is valid.
   > 
   > What do you think @alamb @klion26 @friendlymatthew ?
   
   Yes, I agree with this assessment. We shouldn't be relying on `variant_get` 
to detect incorrectly shredded values.
   
   In my mind, I expect the cast options to apply when the source is not 
shredded (aka it is a dynamically typed variant) so "casting" is required as 
part of `variant_get`
   
   Therefore I think we should update the test to either use a non -shredded 
variant. Here is a test to do so:
   - https://github.com/apache/arrow-rs/pull/8921
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-23 Thread via GitHub


XiangpengHao commented on PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3568444849

   > Sorry, what do you mean? If the types exactly match then there's nothing 
to convert and the options shouldn't matter?
   
   I was debating about whether to perform a sanity check before returning, 
here's the context: 
https://github.com/apache/arrow-rs/pull/8887#issuecomment-3555090970
   
   I agree that we should not perform any additional conversions, but currently 
there're two test cases failing it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3564657259

   > I checked arrow's `cast_with_options`:
   > 
   > 
https://github.com/apache/arrow-rs/blob/779e9bd2ee43d7d882782e6bf6a11ee0944af229/arrow-cast/src/cast/mod.rs#L751-L761
   > 
   > It doesn't check if the `from_type` is a valid data type regardless of 
`cast_options`. I think we should follow the same conventions where, as we take 
the same `cast_options` from user 🤔
   
   Sorry, what do you mean? If the types exactly match then there's nothing to 
convert and the options shouldn't matter?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


XiangpengHao commented on PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3564480279

   I checked arrow's `cast_with_options`:
   
https://github.com/apache/arrow-rs/blob/779e9bd2ee43d7d882782e6bf6a11ee0944af229/arrow-cast/src/cast/mod.rs#L751-L761
   
   It doesn't check if the `from_type` is a valid data type regardless of 
`cast_options`. I think we should follow the same conventions where, as we take 
the same `cast_options` from user 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


XiangpengHao commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550759267


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   > I don't think we need to actually combine nulls?
For all types other than variant object (which we explicitly skip), the 
shredding spec requires exactly one of value or typed_value to be non-NULL in 
each row
   
   Yes, nice catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


XiangpengHao commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550717845


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
 return Ok(ArrayRef::from(target));
 };
 
+// Try to return the typed value directly when we have a perfect shredding 
match.
+if !matches!(as_field.data_type(), DataType::Struct(_)) {
+if let Some(typed_value) = target.typed_value_field() {
+let types_match = typed_value.data_type() == as_field.data_type();
+let value_not_present = target.value_field().is_none();
+// this is a perfect shredding, where the value is entirely 
shredded out, so we can just return the typed value
+// note that we MUST check value_not_present, because some of the 
`typed_value` might be null but data is present in the `value` column.
+// an alternative is to count whether typed_value has any 
non-nulls, or check every row in `value` is null,
+// but this is too complicated and might be slow.

Review Comment:
   Thank you @scovich  this makes sense 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550351917


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   aside: this would be a fantastic place for if-let chains...
   
   
   ```rust
   if let Some(typed_value) = target.typed_value_field()
   && typed_value.null_count() == 0 /* perfectly shredded */
   && can_cast_types(typed_value.data_type(), as_field.data_type()) 
   {
   let mut array = match target.nulls() {
   None => typed_value.clone(),
   nulls => {
   let builder = array.to_data().into_builder();
   make_array(builder.nulls(nulls.clone()).build()?)
   }
   };
   if typed_value.data_type() != as_field.data_type() {
   array = cast(array, as_field.data_type())?;
   }
   return Ok(array);
   }
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   Meanwhile: This is a single-caller function with _very_ specific semantics. 
Perhaps it's better to fold it in at the call site, where the necessary 
semantics are more obvious? Especially if we're able to simplify it following 
the above observation?
   
   Something like:
   ```rust
   if let Some(typed_value) = target.typed_value_field() {
   let perfectly_shredded = typed_value.null_count() == 0;
   let as_type = as_field.data_type();
   if perfectly_shredded && can_cast_types(typed_value.data_type(), 
as_type) {
   let mut array = match target.nulls() {
   None => typed_value.clone(),
   nulls => {
   let builder = array.to_data().into_builder();
   make_array(builder.nulls(nulls.clone()).build()?)
   }
   };
   if typed_value.data_type() != as_type {
   array = cast(array, as_type)?;
   }
   return Ok(array);
   }
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550351917


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   aside: this would be a fantastic place for if-let chains...
   
   
   ```rust
   if let Some(typed_value) = target.typed_value_field()
   && typed_value.null_count() == 0 /* perfectly shredded */
   && can_cast_types(typed_value.data_type(), as_field.data_type()) 
   {
   let mut array = match target.nulls() {
   None => typed_value.clone(),
   Some(nulls) => {
   let builder = array.to_data().into_builder();
   make_array(builder.nulls(Some(nulls.clone())).build()?)
   }
   };
   if typed_value.data_type() != as_field.data_type() {
   array = cast(array, as_field.data_type())?;
   }
   return Ok(array);
   }
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   Meanwhile: This is a single-caller function with _very_ specific semantics. 
Perhaps it's better to fold it in at the call site, where the necessary 
semantics are more obvious? Especially if we're able to simplify it following 
the above observation?
   
   Something like:
   ```rust
   if let Some(typed_value) = target.typed_value_field() {
   let perfectly_shredded = typed_value.null_count() == 0;
   let as_type = as_field.data_type();
   if perfectly_shredded && can_cast_types(typed_value.data_type(), 
as_type) {
   let mut array = match target.nulls() {
   None => typed_value.clone(),
   Some(nulls) => {
   let builder = array.to_data().into_builder();
   make_array(builder.nulls(Some(nulls.clone())).build()?)
   }
   };
   if typed_value.data_type() != as_type {
   array = cast(array, as_type)?;
   }
   return Ok(array);
   }
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550351917


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   aside: this would be a fantastic place for if-let chains...
   
   
   ```rust
   if let Some(typed_value) = target.typed_value_field()
   && typed_value.null_count() == 0 /* perfectly shredded */
   && can_cast_types(typed_value.data_type(), as_field.data_type()) 
   {
   let mut array = match target.nulls() {
   None => typed_value.clone(),
   Some(nulls) => {
   let builder = array.to_data().into_builder();
   make_array(builder.nulls(Some(nulls)).build()?)
   }
   };
   if typed_value.data_type() != as_field.data_type() {
   array = cast(array, as_field.data_type())?;
   }
   return Ok(array);
   }
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   Meanwhile: This is a single-caller function with _very_ specific semantics. 
Perhaps it's better to fold it in at the call site, where the necessary 
semantics are more obvious? Especially if we're able to simplify it following 
the above observation?
   
   Something like:
   ```rust
   if let Some(typed_value) = target.typed_value_field() {
   let perfectly_shredded = typed_value.null_count() == 0;
   let as_type = as_field.data_type();
   if perfectly_shredded && can_cast_types(typed_value.data_type(), 
as_type) {
   let mut array = match target.nulls() {
   None => typed_value.clone(),
   Some(nulls) => {
   let builder = array.to_data().into_builder();
   make_array(builder.nulls(Some(nulls)).build()?)
   }
   };
   if typed_value.data_type() != as_type {
   array = cast(array, as_type)?;
   }
   return Ok(array);
   }
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   Meanwhile: This is a single-caller function with _very_ specific semantics. 
Perhaps it's better to fold it in at the call site, where the necessary 
semantics are more obvious? Especially if we're able to simplify it following 
the above observation?
   
   Something like:
   ```rust
   if let Some(typed_value) = target.typed_value_field() {
   let perfectly_shredded = typed_value.null_count() == 0;
   let as_type = as_field.data_type();
   if perfectly_shredded && can_cast_types(typed_value.data_type(), 
as_type) {
   let mut array = match target.nulls() {
   None => typed_value.clone(),
   Some(nulls) => {
   let builder = array.to_data().into_builder();
   make_array(builder.nulls(Some(nulls)).build()?)
   }
   };
   if typed_value.data_type() != as_type {
   array = cast(array, as_type)?;
   }
   return Ok(array);
   }
   }
   ```
   aside: this would be a fantastic place for if-let chains...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550298276


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   Meanwhile: This is a single-caller function with _very_ specific semantics. 
Perhaps it's better to fold it in at the call site, where the necessary 
semantics are more obvious? Especially if we're able to simplify it following 
the above observation?
   
   Something like:
   ```rust
   if let Some(typed_value) = target.typed_value_field() {
   let perfectly_shredded = typed_value.null_count() == 0;
   let as_type = as_field.data_type();
   if perfectly_shredded && can_cast_types(typed_value.data_type(), 
as_type) {
   let mut array = match target.nulls() {
   None => typed_value.clone(),
   Some(nulls) => {
   let builder = array.to_data().into_builder();
   make_array(builder.nulls(Some(nulls)).build()?)
   }
   };
   if typed_value.data_type() != as_type {
   array = cast(array, as_type)?;
   }
   return Ok(array);
   }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


scovich commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2550215532


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
 return Ok(ArrayRef::from(target));
 };
 
+// Try to return the typed value directly when we have a perfect shredding 
match.
+if !matches!(as_field.data_type(), DataType::Struct(_)) {
+if let Some(typed_value) = target.typed_value_field() {
+let types_match = typed_value.data_type() == as_field.data_type();
+let value_not_present = target.value_field().is_none();
+// this is a perfect shredding, where the value is entirely 
shredded out, so we can just return the typed value
+// note that we MUST check value_not_present, because some of the 
`typed_value` might be null but data is present in the `value` column.
+// an alternative is to count whether typed_value has any 
non-nulls, or check every row in `value` is null,
+// but this is too complicated and might be slow.

Review Comment:
   I'm pretty sure 
[NullBuffer::null_count](https://docs.rs/arrow-buffer/57.0.0/src/arrow_buffer/buffer/null.rs.html#144)
 is a stored (not computed) value, so:
   ```rust
   let value_not_present = target.value_field().is_none_or(|v| v.null_count() 
== v.len());
   ```
   
   Also: Double check line wrapping here? (seems like a lot more than 100 chars)



##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   I don't think we need to actually combine nulls? 
   
   For all types other than variant object (which we explicitly skip), the 
shredding spec requires exactly one of `value` or `typed_value` to be non-NULL 
in each row, with SQL NULL values encoded as `Variant::Null` in the `value` 
column. 
   
   => The presence of any NULL values in a `typed_value` column breaks perfect 
shredding (even if it's a true SQL NULL)
   => The child array must have null_count 0 if we reached this point
   => We can just apply the parent nulls to the child array and be done with it
   
   Also: https://github.com/apache/arrow-rs/issues/6528 seems relevant here?



##
parquet-variant-compute/src/variant_get.rs:
##
@@ -109,6 +110,30 @@ pub(crate) fn follow_shredded_path_element<'a>(
 }
 }
 
+/// Returns a cloned `ArrayRef` whose null mask is the union of the array's 
existing mask and
+/// `parent_nulls`. If `parent_nulls` is `None` or contains no nulls, the 
original array is returned.
+///
+/// This necessary because the null of the shredded value is the union of 
parent null and current nulls.
+fn clone_with_parent_nulls(
+array: &ArrayRef,
+parent_nulls: Option<&NullBuffer>,
+) -> Result {
+let Some(parent_nulls) = parent_nulls else {
+return Ok(array.clone());
+};
+if parent_nulls.null_count() == 0 {
+return Ok(array.clone());
+}
+
+let combined_nulls = NullBuffer::union(array.as_ref().nulls(), 
Some(parent_nulls));

Review Comment:
   Meanwhile: This is a single-caller function with _very_ specific semantics. 
Perhaps it's better to fold it in at the call site, where the necessary 
semantics are more obvious? Especially if we're able to simplify it following 
the above observation?
   
   Something like:
   ```rust
   if let Some(typed_value) = target.typed_value_field() {
   let perfectly_shredded = typed_value.null_count() == 0;
   let as_type = as_field.data_type();
   if perfectly_shredded && can_cast_types(typed_value.data_type(), 
as_type) {
   let mut array = match target.nulls() {
   None => typed_value.clone(),
   Some(nulls) => {
   let builder = array.to_data().into_builder();
   make_array(builder.nulls(Some(nulls)).build()?)
   }
   };
   if typed_value.data_type() != as_type {
   array = cast(array, as_type)?;
   }
   Ok(array)
   }
   }
   ```



##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
 return Ok(ArrayRef::from(target));
 };
 
+// Try

Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-21 Thread via GitHub


alamb commented on PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3563675422

   I have this on my radar to review; I am hoping to ahve some time to devote 
to variant stuff next week -- this week I have been occupied getting the 57.1.0 
release ready


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-20 Thread via GitHub


XiangpengHao commented on PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3561393466

   > This test here wants to cover the behavior of CastOptions; I think 1) we 
can't guarantee that the input is valid in `variant_get` here; 2) we'll do a 
type cast in `variant_get`(type_conversion.rs), and the cast may or may not 
succeed, and the `CastOptions::safe` controls how to handle suce cases.
   > 
   > Returning to this test, we might be able to modify the input for 
`variant_get` here, but we may still need to ensure that different CastOptions 
are covered.
   
   Makes sense to me, thank you @klion26 , in that case I'll try to make the 
optimization only enabled for not safe cast case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-20 Thread via GitHub


klion26 commented on PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3558818347

   This test here wants to cover the behavior of CastOptions; I think 1) we 
can't guarantee that the input is valid in `variant_get` here; 2) we'll do a 
type cast in `variant_get`(type_conversion.rs), and the cast may or may not 
succeed, and the `CastOptions::safe` controls how to handle suce cases.
   
   Returning to this test, we might be able to modify the input for 
`variant_get` here, but we may still need to ensure that different CastOptions 
are covered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-20 Thread via GitHub


XiangpengHao commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2543853730


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
 return Ok(ArrayRef::from(target));
 };
 
+// Try to return the typed value directly when we have a perfect shredding 
match.
+if !matches!(as_field.data_type(), DataType::Struct(_)) {
+if let Some(typed_value) = target.typed_value_field() {
+let types_match = typed_value.data_type() == as_field.data_type();

Review Comment:
this is probably overly restrictive, `Utf8`, `LargeUtf8`, and `Utf8View` 
should be allowed. Maybe check if cast-able.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-19 Thread via GitHub


XiangpengHao commented on PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#issuecomment-3555090970

   This test case failed:
   
https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet-variant-compute/src/variant_get.rs#L3754-L3781
   
   It is because the perfectly shredded array itself is not a valid arrow 
array. I'm not sure if this is a well-defined behavior, I didn't check 
carefully, but I feel it is `Time64MicrosecondArray`'s responsibility to make 
sure the its data is valid.
   
   What do you think @alamb @klion26 @friendlymatthew ?
   
   My gut feeling is that this data integrity checking during read time can be 
very expensive; for example, validating utf8 every time we read a string array 
can be extremely slow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] [Variant] Improve `variant_get` performance on a perfect shredding [arrow-rs]

2025-11-19 Thread via GitHub


XiangpengHao commented on code in PR #8887:
URL: https://github.com/apache/arrow-rs/pull/8887#discussion_r2543853730


##
parquet-variant-compute/src/variant_get.rs:
##
@@ -208,6 +233,21 @@ fn shredded_get_path(
 return Ok(ArrayRef::from(target));
 };
 
+// Try to return the typed value directly when we have a perfect shredding 
match.
+if !matches!(as_field.data_type(), DataType::Struct(_)) {
+if let Some(typed_value) = target.typed_value_field() {
+let types_match = typed_value.data_type() == as_field.data_type();

Review Comment:
   Is this probably overly restrictive, `Utf8`, `LargeUtf8`, and `Utf8View` 
should be allowed. Maybe check if cast-able.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]