Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2190979500
##
parquet-variant/src/variant.rs:
##
@@ -382,21 +385,23 @@ impl<'m, 'v> Variant<'m, 'v> {
VariantBasicType::ShortString => {
Variant::ShortString(decoder::decode_short_string(value_metadata, value_data)?)
}
-VariantBasicType::Object => {
-Variant::Object(VariantObject::try_new_impl(metadata, value)?)
-}
-VariantBasicType::Array =>
Variant::List(VariantList::try_new_impl(metadata, value)?),
+VariantBasicType::Object => Variant::Object(
+VariantObject::try_new_with_shallow_validation(metadata,
value)?,
+),
+VariantBasicType::Array =>
Variant::List(VariantList::try_new_with_shallow_validation(
+metadata, value,
Review Comment:
I think there must have been a `fmt` rules update, because I'm also seeing
this now locally.
Which IMO is a really nice step forward for balancing readability vs,
newline bloat 🎉
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2190979500
##
parquet-variant/src/variant.rs:
##
@@ -382,21 +385,23 @@ impl<'m, 'v> Variant<'m, 'v> {
VariantBasicType::ShortString => {
Variant::ShortString(decoder::decode_short_string(value_metadata, value_data)?)
}
-VariantBasicType::Object => {
-Variant::Object(VariantObject::try_new_impl(metadata, value)?)
-}
-VariantBasicType::Array =>
Variant::List(VariantList::try_new_impl(metadata, value)?),
+VariantBasicType::Object => Variant::Object(
+VariantObject::try_new_with_shallow_validation(metadata,
value)?,
+),
+VariantBasicType::Array =>
Variant::List(VariantList::try_new_with_shallow_validation(
+metadata, value,
Review Comment:
I think there must have been a `fmt` rules update, because I'm also seeing
this now locally.
Which IMO is a really nice step forward for balancing readability vs,
newline boat.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
alamb commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2190751949
##
parquet-variant/src/variant/list.rs:
##
@@ -232,52 +232,57 @@ impl<'m, 'v> VariantList<'m, 'v> {
/// [invalid]: Self#Validation
pub fn get(&self, index: usize) -> Option> {
(index < self.num_elements).then(|| {
-self.try_get_impl(index)
-.and_then(Variant::validate)
+self.try_get_with_shallow_validation(index)
+.and_then(Variant::with_deep_validation)
.expect("Invalid variant array element")
})
}
/// Fallible version of `get`. Returns element by index, capturing
validation errors
pub fn try_get(&self, index: usize) -> Result, ArrowError>
{
-self.try_get_impl(index)?.validate()
+self.try_get_with_shallow_validation(index)?
+.with_deep_validation()
}
-/// Fallible iteration over the elements of this list.
-pub fn iter_try(&self) -> impl Iterator,
ArrowError>> + '_ {
-self.iter_try_impl().map(|result| result?.validate())
+// Fallible version of `get`, performing only basic (constant-time)
validation.
+fn try_get_with_shallow_validation(&self, index: usize) ->
Result, ArrowError> {
+// Fetch the value bytes between the two offsets for this index, from
the value array region
+// of the byte buffer
+let byte_range = self.get_offset(index)?..self.get_offset(index + 1)?;
+let value_bytes =
+slice_from_slice_at_offset(self.value, self.first_value_byte,
byte_range)?;
+
+Variant::try_new_with_metadata_and_shallow_validation(self.metadata,
value_bytes)
}
-// Fallible iteration that only performs basic (constant-time) validation.
-fn iter_try_impl(&self) -> impl Iterator,
ArrowError>> + '_ {
-(0..self.len()).map(move |i| self.try_get_impl(i))
+/// Fallible iteration over the elements of this list.
+pub fn try_iter(&self) -> impl Iterator,
ArrowError>> + '_ {
+self.try_iter_with_shallow_validation()
+.map(|result| result?.with_deep_validation())
}
/// Iterates over the values of this list. When working with [unvalidated]
input, consider
-/// [`Self::iter_try`] to avoid panics due to invalid data.
+/// [`Self::try_iter`] to avoid panics due to invalid data.
///
/// [unvalidated]: Self#Validation
pub fn iter(&self) -> impl Iterator> + '_ {
-self.iter_try_impl()
+self.try_iter_with_shallow_validation()
.map(|result| result.expect("Invalid variant list entry"))
}
+// Fallible iteration that only performs basic (constant-time) validation.
+fn try_iter_with_shallow_validation(
+&self,
+) -> impl Iterator, ArrowError>> + '_ {
+(0..self.len()).map(move |i| self.try_get_with_shallow_validation(i))
+}
+
// Attempts to retrieve the ith offset from the offset array region of the
byte buffer.
fn get_offset(&self, index: usize) -> Result {
let byte_range =
self.header.first_offset_byte()..self.first_value_byte;
let offset_bytes = slice_from_slice(self.value, byte_range)?;
self.header.offset_size.unpack_usize(offset_bytes, index)
}
-
-// Fallible version of `get`, performing only basic (constant-time)
validation.
-fn try_get_impl(&self, index: usize) -> Result,
ArrowError> {
Review Comment:
I 100% agree with @scovich here
Also, If you have a commit that just moves code around, please feel free to
put it up and tag me aggressively. I'll try and get it merged quicky (as
@scovich says, they are very easy to review)
##
parquet-variant/src/variant.rs:
##
@@ -382,21 +385,23 @@ impl<'m, 'v> Variant<'m, 'v> {
VariantBasicType::ShortString => {
Variant::ShortString(decoder::decode_short_string(value_metadata, value_data)?)
}
-VariantBasicType::Object => {
-Variant::Object(VariantObject::try_new_impl(metadata, value)?)
-}
-VariantBasicType::Array =>
Variant::List(VariantList::try_new_impl(metadata, value)?),
+VariantBasicType::Object => Variant::Object(
+VariantObject::try_new_with_shallow_validation(metadata,
value)?,
+),
+VariantBasicType::Array =>
Variant::List(VariantList::try_new_with_shallow_validation(
+metadata, value,
Review Comment:
yeah I don't know -- `cargo fmt` doesn't reformat this line for me locally
(or on CI) 🤷 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
alamb merged PR #7871: URL: https://github.com/apache/arrow-rs/pull/7871 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
alamb commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2190748273
##
parquet-variant/src/variant.rs:
##
@@ -297,8 +297,10 @@ impl<'m, 'v> Variant<'m, 'v> {
///
/// [unvalidated]: Self#Validation
pub fn new(metadata: &'m [u8], value: &'v [u8]) -> Self {
-let metadata = VariantMetadata::try_new_impl(metadata).expect("Invalid
variant metadata");
-Self::try_new_with_metadata_impl(metadata, value).expect("Invalid
variant data")
+let metadata =
VariantMetadata::try_new_with_shallow_validation(metadata)
Review Comment:
This is a very nice name change
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on PR #7871: URL: https://github.com/apache/arrow-rs/pull/7871#issuecomment-3040618000 > > > ... at least, confident we caught all the ones in code paths this specific unit test exercises. But it's tricky -- I tried to make a visual sweep of all `XXX_with_shallow_validation` methods and didn't notice anything else. But then I re-ran the unit test "benchmark" and saw that it was still oddly slow. Found the third missed site during a second, more motivated, code inspection pass. > > > > > > Can we make your test into an actual benchmark (aka something like https://github.com/apache/arrow-rs/blob/main/parquet-variant/benches/variant_builder.rs )? > > Hi, I have an ad-hoc benchmark that lives [here](https://github.com/apache/arrow-rs/compare/main...pydantic:arrow-rs:friendlymatthew/bench-validation-large-json). It uses the same logic in `test_json_to_variant_object_very_large` to build up the object. > > The validation logic on main regresses quite significantly when compared to the validation logic in this PR > > https://private-user-images.githubusercontent.com/38759997/462853809-62515dbc-4bc9-4541-9b36-b22f0089fadb.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTE3NzAzNDgsIm5iZiI6MTc1MTc3MDA0OCwicGF0aCI6Ii8zODc1OTk5Ny80NjI4NTM4MDktNjI1MTVkYmMtNGJjOS00NTQxLTliMzYtYjIyZjAwODlmYWRiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA3MDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNzA2VDAyNDcyOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTcxZTk1MWUxNjQ5NjhkNjY4NzZmMzM2OTUxZjEzMDVhZmVhN2MzN2EzOWVkNjgzODIzNzczZGI1MjdhZWM1ZGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.rUlGeKZriWYw9IcPYnPt6GCvPtrgyTzzORXZlL6H0JE";> We caught another bug where this time it was performing only a shallow validation from a method that wanted full validation. As @scovich points out, it'll slow this PR down because it now performs the linear checks that it guarantees. Still, we see the validation logic in this PR perform faster than the logic on main. A bit unfortunate but still a worthy improvement IMO. https://github.com/user-attachments/assets/c69fb371-63a1-4575-ab66-7afeb775be0b"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187902897
##
parquet-variant/src/variant/object.rs:
##
@@ -197,22 +197,22 @@ impl<'m, 'v> VariantObject<'m, 'v> {
/// True if this instance is fully [validated] for panic-free infallible
accesses.
///
/// [validated]: Self#Validation
-pub fn is_validated(&self) -> bool {
+pub fn is_fully_validated(&self) -> bool {
self.validated
}
/// Performs a full [validation] of this variant object.
///
/// [validation]: Self#Validation
-pub fn validate(mut self) -> Result {
+pub fn with_full_validation(mut self) -> Result {
if !self.validated {
// Validate the metadata dictionary first, if not already
validated, because we pass it
// by value to all the children (who would otherwise re-validate
it repeatedly).
-self.metadata = self.metadata.validate()?;
+self.metadata = self.metadata.with_full_validation()?;
// Iterate over all string keys in this dictionary in order to
prove that the offset
// array is valid, all offsets are in bounds, and all string bytes
are valid utf-8.
-validate_fallible_iterator(self.iter_try_impl())?;
+
validate_fallible_iterator(self.iter_try_with_shallow_validation())?;
self.validated = true;
}
Ok(self)
Review Comment:
Yeah, that is unfortunate but the change makes sense to me. I'm going to
spend some time thinking how to make full validation faster
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187902897
##
parquet-variant/src/variant/object.rs:
##
@@ -197,22 +197,22 @@ impl<'m, 'v> VariantObject<'m, 'v> {
/// True if this instance is fully [validated] for panic-free infallible
accesses.
///
/// [validated]: Self#Validation
-pub fn is_validated(&self) -> bool {
+pub fn is_fully_validated(&self) -> bool {
self.validated
}
/// Performs a full [validation] of this variant object.
///
/// [validation]: Self#Validation
-pub fn validate(mut self) -> Result {
+pub fn with_full_validation(mut self) -> Result {
if !self.validated {
// Validate the metadata dictionary first, if not already
validated, because we pass it
// by value to all the children (who would otherwise re-validate
it repeatedly).
-self.metadata = self.metadata.validate()?;
+self.metadata = self.metadata.with_full_validation()?;
// Iterate over all string keys in this dictionary in order to
prove that the offset
// array is valid, all offsets are in bounds, and all string bytes
are valid utf-8.
-validate_fallible_iterator(self.iter_try_impl())?;
+
validate_fallible_iterator(self.iter_try_with_shallow_validation())?;
self.validated = true;
}
Ok(self)
Review Comment:
Makes sense to me. I'm going to spend some time thinking how to make full
validation faster
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187902028
##
parquet-variant/src/variant/object.rs:
##
@@ -197,22 +197,22 @@ impl<'m, 'v> VariantObject<'m, 'v> {
/// True if this instance is fully [validated] for panic-free infallible
accesses.
///
/// [validated]: Self#Validation
-pub fn is_validated(&self) -> bool {
+pub fn is_fully_validated(&self) -> bool {
self.validated
}
/// Performs a full [validation] of this variant object.
///
/// [validation]: Self#Validation
-pub fn validate(mut self) -> Result {
+pub fn with_full_validation(mut self) -> Result {
if !self.validated {
// Validate the metadata dictionary first, if not already
validated, because we pass it
// by value to all the children (who would otherwise re-validate
it repeatedly).
-self.metadata = self.metadata.validate()?;
+self.metadata = self.metadata.with_full_validation()?;
// Iterate over all string keys in this dictionary in order to
prove that the offset
// array is valid, all offsets are in bounds, and all string bytes
are valid utf-8.
-validate_fallible_iterator(self.iter_try_impl())?;
+
validate_fallible_iterator(self.iter_try_with_shallow_validation())?;
self.validated = true;
}
Ok(self)
Review Comment:
NOTE: That will probably slow things down again (because currently we're
_not_ performing the validation we claimed). But that's fine as long as it's
linear (and reasonable) cost.
##
parquet-variant/src/variant/object.rs:
##
@@ -197,22 +197,22 @@ impl<'m, 'v> VariantObject<'m, 'v> {
/// True if this instance is fully [validated] for panic-free infallible
accesses.
///
/// [validated]: Self#Validation
-pub fn is_validated(&self) -> bool {
+pub fn is_fully_validated(&self) -> bool {
self.validated
}
/// Performs a full [validation] of this variant object.
///
/// [validation]: Self#Validation
-pub fn validate(mut self) -> Result {
+pub fn with_full_validation(mut self) -> Result {
if !self.validated {
// Validate the metadata dictionary first, if not already
validated, because we pass it
// by value to all the children (who would otherwise re-validate
it repeatedly).
-self.metadata = self.metadata.validate()?;
+self.metadata = self.metadata.with_full_validation()?;
// Iterate over all string keys in this dictionary in order to
prove that the offset
// array is valid, all offsets are in bounds, and all string bytes
are valid utf-8.
-validate_fallible_iterator(self.iter_try_impl())?;
+
validate_fallible_iterator(self.iter_try_with_shallow_validation())?;
self.validated = true;
}
Ok(self)
Review Comment:
NOTE: Fixing that will probably slow things down again (because currently
we're _not_ performing the validation we claimed). But that's fine as long as
it's linear (and reasonable) cost.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187901628
##
parquet-variant/src/variant/object.rs:
##
@@ -197,22 +197,22 @@ impl<'m, 'v> VariantObject<'m, 'v> {
/// True if this instance is fully [validated] for panic-free infallible
accesses.
///
/// [validated]: Self#Validation
-pub fn is_validated(&self) -> bool {
+pub fn is_fully_validated(&self) -> bool {
self.validated
}
/// Performs a full [validation] of this variant object.
///
/// [validation]: Self#Validation
-pub fn validate(mut self) -> Result {
+pub fn with_full_validation(mut self) -> Result {
if !self.validated {
// Validate the metadata dictionary first, if not already
validated, because we pass it
// by value to all the children (who would otherwise re-validate
it repeatedly).
-self.metadata = self.metadata.validate()?;
+self.metadata = self.metadata.with_full_validation()?;
// Iterate over all string keys in this dictionary in order to
prove that the offset
// array is valid, all offsets are in bounds, and all string bytes
are valid utf-8.
-validate_fallible_iterator(self.iter_try_impl())?;
+
validate_fallible_iterator(self.iter_try_with_shallow_validation())?;
self.validated = true;
}
Ok(self)
Review Comment:
Ouch, yes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187885758
##
parquet-variant/src/variant/object.rs:
##
@@ -197,22 +197,22 @@ impl<'m, 'v> VariantObject<'m, 'v> {
/// True if this instance is fully [validated] for panic-free infallible
accesses.
///
/// [validated]: Self#Validation
-pub fn is_validated(&self) -> bool {
+pub fn is_fully_validated(&self) -> bool {
self.validated
}
/// Performs a full [validation] of this variant object.
///
/// [validation]: Self#Validation
-pub fn validate(mut self) -> Result {
+pub fn with_full_validation(mut self) -> Result {
if !self.validated {
// Validate the metadata dictionary first, if not already
validated, because we pass it
// by value to all the children (who would otherwise re-validate
it repeatedly).
-self.metadata = self.metadata.validate()?;
+self.metadata = self.metadata.with_full_validation()?;
// Iterate over all string keys in this dictionary in order to
prove that the offset
// array is valid, all offsets are in bounds, and all string bytes
are valid utf-8.
-validate_fallible_iterator(self.iter_try_impl())?;
+
validate_fallible_iterator(self.iter_try_with_shallow_validation())?;
self.validated = true;
}
Ok(self)
Review Comment:
Oof, I spot another bug here. When we go to perform `full_validation` on a
variant object, we later call `self.iter_try_with_shallow_validation())`. I'm
pretty sure we should be calling `iter_try` here.
@scovich
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on PR #7871: URL: https://github.com/apache/arrow-rs/pull/7871#issuecomment-3040521102 > > ... at least, confident we caught all the ones in code paths this specific unit test exercises. But it's tricky -- I tried to make a visual sweep of all `XXX_with_shallow_validation` methods and didn't notice anything else. But then I re-ran the unit test "benchmark" and saw that it was still oddly slow. Found the third missed site during a second, more motivated, code inspection pass. > > Can we make your test into an actual benchmark (aka something like https://github.com/apache/arrow-rs/blob/main/parquet-variant/benches/variant_builder.rs )? Hi, I have an ad-hoc benchmark that lives [here](https://github.com/apache/arrow-rs/compare/main...pydantic:arrow-rs:friendlymatthew/bench-validation-large-json). It uses the same logic in `test_json_to_variant_object_very_large` to build up the object. The validation logic on main regresses quite significantly when compared to the validation logic in this PR https://github.com/user-attachments/assets/62515dbc-4bc9-4541-9b36-b22f0089fadb"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187811341
##
parquet-variant/src/variant/list.rs:
##
@@ -232,19 +232,20 @@ impl<'m, 'v> VariantList<'m, 'v> {
/// [invalid]: Self#Validation
pub fn get(&self, index: usize) -> Option> {
(index < self.num_elements).then(|| {
-self.try_get_impl(index)
-.and_then(Variant::validate)
+self.try_get_with_shallow_validation(index)
+.and_then(Variant::with_full_validation)
Review Comment:
Good catch 💯
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187685092
##
parquet-variant/src/variant/list.rs:
##
@@ -232,19 +232,20 @@ impl<'m, 'v> VariantList<'m, 'v> {
/// [invalid]: Self#Validation
pub fn get(&self, index: usize) -> Option> {
(index < self.num_elements).then(|| {
-self.try_get_impl(index)
-.and_then(Variant::validate)
+self.try_get_with_shallow_validation(index)
+.and_then(Variant::with_full_validation)
Review Comment:
Do we actually want full validation here? I would expect `get` to perform
shallow validation only, and panic if that failed -- matching the behavior of
`iter` below.
##
parquet-variant/src/variant/object.rs:
##
@@ -281,26 +285,31 @@ impl<'m, 'v> VariantObject<'m, 'v> {
/// Returns an iterator of (name, value) pairs over the fields of this
object.
pub fn iter(&self) -> impl Iterator)> +
'_ {
-self.iter_try_impl()
+self.iter_try_with_shallow_validation()
.map(|result| result.expect("Invalid variant object field value"))
}
/// Fallible iteration over the fields of this object.
pub fn iter_try(
&self,
) -> impl Iterator), ArrowError>>
+ '_ {
-self.iter_try_impl().map(|result| {
+self.iter_try_with_shallow_validation().map(|result| {
let (name, value) = result?;
-Ok((name, value.validate()?))
+Ok((name, value.with_full_validation()?))
})
}
// Fallible iteration over the fields of this object that performs only
shallow (constant-cost)
// validation of field values.
-fn iter_try_impl(
+fn iter_try_with_shallow_validation(
&self,
) -> impl Iterator), ArrowError>>
+ '_ {
-(0..self.num_elements).map(move |i| Ok((self.try_field_name(i)?,
self.try_field(i)?)))
+(0..self.num_elements).map(move |i| {
+Ok((
+self.try_field_name(i)?,
+self.try_field_with_shallow_validation(i)?,
+))
Review Comment:
```suggestion
let field = self.try_field_with_shallow_validation(i)?;
Ok((self.try_field_name(i)?, field))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on PR #7871: URL: https://github.com/apache/arrow-rs/pull/7871#issuecomment-3039112429 > note: moving and changing code in the same commit makes review a lot more difficult, and refactor + bug fix in the same commit suffers a similar issue. I know this code inside out and it's still hard to be sure I'm catching everything -- imagine the burden on a reviewer who isn't as familiar with the code? That makes a lot of sense to me. Fwiw, I was reviewing my diff last night and noticed the code movement obfuscates the diff quite significantly. The PR is now split into three commits, with each phase introducing another refactor. 1. Code movement 2. Rename symbols 3. The actual bug fixes Thank you for the tip @scovich 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187359442
##
parquet-variant/src/variant/list.rs:
##
@@ -232,52 +232,57 @@ impl<'m, 'v> VariantList<'m, 'v> {
/// [invalid]: Self#Validation
pub fn get(&self, index: usize) -> Option> {
(index < self.num_elements).then(|| {
-self.try_get_impl(index)
-.and_then(Variant::validate)
+self.try_get_with_shallow_validation(index)
+.and_then(Variant::with_deep_validation)
.expect("Invalid variant array element")
})
}
/// Fallible version of `get`. Returns element by index, capturing
validation errors
pub fn try_get(&self, index: usize) -> Result, ArrowError>
{
-self.try_get_impl(index)?.validate()
+self.try_get_with_shallow_validation(index)?
+.with_deep_validation()
}
-/// Fallible iteration over the elements of this list.
-pub fn iter_try(&self) -> impl Iterator,
ArrowError>> + '_ {
-self.iter_try_impl().map(|result| result?.validate())
+// Fallible version of `get`, performing only basic (constant-time)
validation.
+fn try_get_with_shallow_validation(&self, index: usize) ->
Result, ArrowError> {
+// Fetch the value bytes between the two offsets for this index, from
the value array region
+// of the byte buffer
+let byte_range = self.get_offset(index)?..self.get_offset(index + 1)?;
+let value_bytes =
+slice_from_slice_at_offset(self.value, self.first_value_byte,
byte_range)?;
+
+Variant::try_new_with_metadata_and_shallow_validation(self.metadata,
value_bytes)
}
-// Fallible iteration that only performs basic (constant-time) validation.
-fn iter_try_impl(&self) -> impl Iterator,
ArrowError>> + '_ {
-(0..self.len()).map(move |i| self.try_get_impl(i))
+/// Fallible iteration over the elements of this list.
+pub fn try_iter(&self) -> impl Iterator,
ArrowError>> + '_ {
Review Comment:
Makes sense to me, I reverted it back to `iter_try`.
##
parquet-variant/src/variant.rs:
##
@@ -407,16 +412,16 @@ impl<'m, 'v> Variant<'m, 'v> {
/// Variant leaf values are always valid by construction, but [objects]
and [arrays] can be
/// constructed in unvalidated (and potentially invalid) state.
///
-/// If [`Self::is_validated`] is true, validation is a no-op. Otherwise,
the cost is `O(m + v)`
+/// If [`Self::is_fully_validated`] is true, validation is a no-op.
Otherwise, the cost is `O(m + v)`
/// where `m` and `v` are the sizes of metadata and value buffers,
respectively.
///
/// [objects]: VariantObject#Validation
/// [arrays]: VariantList#Validation
-pub fn validate(self) -> Result {
+pub fn with_deep_validation(self) -> Result {
Review Comment:
Using `with_full_validation`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
alamb commented on PR #7871: URL: https://github.com/apache/arrow-rs/pull/7871#issuecomment-3038892297 > ... at least, confident we caught all the ones in code paths this specific unit test exercises. But it's tricky -- I tried to make a visual sweep of all `XXX_with_shallow_validation` methods and didn't notice anything else. But then I re-ran the unit test "benchmark" and saw that it was still oddly slow. Found the third missed site during a second, more motivated, code inspection pass. Can we make your test into an actual benchmark (aka something like https://github.com/apache/arrow-rs/blob/main/parquet-variant/benches/variant_builder.rs )? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on PR #7871: URL: https://github.com/apache/arrow-rs/pull/7871#issuecomment-3038861103 > (I'm reasonably confident we can stop wondering if we caught them all, when the cost drops to sub-ms times) ... at least, confident we caught all the ones in code paths this specific unit test exercises. But it's tricky -- I tried to make a visual sweep of all `XXX_with_shallow_validation` methods and didn't notice anything else until I re-ran the unit test "benchmark" and saw that it was still oddly slow. Found the third missed site on a second code inspection pass. Other eyes and tests are probably good here! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187147699
##
parquet-variant/src/variant.rs:
##
@@ -382,21 +385,23 @@ impl<'m, 'v> Variant<'m, 'v> {
VariantBasicType::ShortString => {
Variant::ShortString(decoder::decode_short_string(value_metadata, value_data)?)
}
-VariantBasicType::Object => {
-Variant::Object(VariantObject::try_new_impl(metadata, value)?)
-}
-VariantBasicType::Array =>
Variant::List(VariantList::try_new_impl(metadata, value)?),
+VariantBasicType::Object => Variant::Object(
+VariantObject::try_new_with_shallow_validation(metadata,
value)?,
+),
+VariantBasicType::Array =>
Variant::List(VariantList::try_new_with_shallow_validation(
+metadata, value,
Review Comment:
aside: I don't think fmt usually likes multiple args on a single line? (I
sure sometimes _wish_ it did, applying the same rules for splitting that it
does for any other complex "expression", i.e. too much nesting or more than ~80
chars would cause line splits)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2187147699
##
parquet-variant/src/variant.rs:
##
@@ -382,21 +385,23 @@ impl<'m, 'v> Variant<'m, 'v> {
VariantBasicType::ShortString => {
Variant::ShortString(decoder::decode_short_string(value_metadata, value_data)?)
}
-VariantBasicType::Object => {
-Variant::Object(VariantObject::try_new_impl(metadata, value)?)
-}
-VariantBasicType::Array =>
Variant::List(VariantList::try_new_impl(metadata, value)?),
+VariantBasicType::Object => Variant::Object(
+VariantObject::try_new_with_shallow_validation(metadata,
value)?,
+),
+VariantBasicType::Array =>
Variant::List(VariantList::try_new_with_shallow_validation(
+metadata, value,
Review Comment:
aside: I don't think fmt usually likes multiple args on a single line? (I
sure sometimes _wish_ it did, applying the same rules for splitting that it
does for any other complex "expression", i.e. too much nesting or more than ~80
chars will cause line splits)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2186470644
##
parquet-variant/src/variant.rs:
##
@@ -407,16 +412,16 @@ impl<'m, 'v> Variant<'m, 'v> {
/// Variant leaf values are always valid by construction, but [objects]
and [arrays] can be
/// constructed in unvalidated (and potentially invalid) state.
///
-/// If [`Self::is_validated`] is true, validation is a no-op. Otherwise,
the cost is `O(m + v)`
+/// If [`Self::is_fully_validated`] is true, validation is a no-op.
Otherwise, the cost is `O(m + v)`
/// where `m` and `v` are the sizes of metadata and value buffers,
respectively.
///
/// [objects]: VariantObject#Validation
/// [arrays]: VariantList#Validation
-pub fn validate(self) -> Result {
+pub fn with_deep_validation(self) -> Result {
Review Comment:
Right now we have "full" and "deep" as ~synonyms? Should we pick just one?
```suggestion
pub fn with_full_validation(self) -> Result {
```
(IMO `is_deeply_validated` doesn't sound quite as nice as
`is_fully_validated`)
##
parquet-variant/src/variant/object.rs:
##
@@ -281,23 +286,23 @@ impl<'m, 'v> VariantObject<'m, 'v> {
/// Returns an iterator of (name, value) pairs over the fields of this
object.
pub fn iter(&self) -> impl Iterator)> +
'_ {
-self.iter_try_impl()
+self.try_iter_with_shallow_validation()
.map(|result| result.expect("Invalid variant object field value"))
}
/// Fallible iteration over the fields of this object.
-pub fn iter_try(
+pub fn try_iter(
&self,
) -> impl Iterator), ArrowError>>
+ '_ {
-self.iter_try_impl().map(|result| {
+self.try_iter_with_shallow_validation().map(|result| {
let (name, value) = result?;
-Ok((name, value.validate()?))
+Ok((name, value.with_deep_validation()?))
})
}
// Fallible iteration over the fields of this object that performs only
shallow (constant-cost)
// validation of field values.
-fn iter_try_impl(
+fn try_iter_with_shallow_validation(
&self,
) -> impl Iterator), ArrowError>>
+ '_ {
(0..self.num_elements).map(move |i| Ok((self.try_field_name(i)?,
self.try_field(i)?)))
Review Comment:
A missed fix:
```suggestion
(0..self.num_elements).map(move |i| {
let field = self.try_field_with_shallow_validation(i)?;
Ok((self.try_field_name(i)?, field))
})
```
##
parquet-variant/src/variant/list.rs:
##
@@ -232,52 +232,57 @@ impl<'m, 'v> VariantList<'m, 'v> {
/// [invalid]: Self#Validation
pub fn get(&self, index: usize) -> Option> {
(index < self.num_elements).then(|| {
-self.try_get_impl(index)
-.and_then(Variant::validate)
+self.try_get_with_shallow_validation(index)
+.and_then(Variant::with_deep_validation)
.expect("Invalid variant array element")
})
}
/// Fallible version of `get`. Returns element by index, capturing
validation errors
pub fn try_get(&self, index: usize) -> Result, ArrowError>
{
-self.try_get_impl(index)?.validate()
+self.try_get_with_shallow_validation(index)?
+.with_deep_validation()
}
-/// Fallible iteration over the elements of this list.
-pub fn iter_try(&self) -> impl Iterator,
ArrowError>> + '_ {
-self.iter_try_impl().map(|result| result?.validate())
+// Fallible version of `get`, performing only basic (constant-time)
validation.
+fn try_get_with_shallow_validation(&self, index: usize) ->
Result, ArrowError> {
+// Fetch the value bytes between the two offsets for this index, from
the value array region
+// of the byte buffer
+let byte_range = self.get_offset(index)?..self.get_offset(index + 1)?;
+let value_bytes =
+slice_from_slice_at_offset(self.value, self.first_value_byte,
byte_range)?;
+
+Variant::try_new_with_metadata_and_shallow_validation(self.metadata,
value_bytes)
}
-// Fallible iteration that only performs basic (constant-time) validation.
-fn iter_try_impl(&self) -> impl Iterator,
ArrowError>> + '_ {
-(0..self.len()).map(move |i| self.try_get_impl(i))
+/// Fallible iteration over the elements of this list.
+pub fn try_iter(&self) -> impl Iterator,
ArrowError>> + '_ {
Review Comment:
I actually had `try_iter` at first, but I changed because the name suggests
something that returns a Result of Iterator, rather than an Iterator of Result.
No strong opinions tho -- if people like `try_iter` better, let's go for it.
##
parquet-variant/src/variant.rs:
##
@@ -323,18 +325,19 @@ impl<'m, 'v> Variant<'m, 'v> {
metadata: VariantMetadata<'m>,
va
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
scovich commented on PR #7871: URL: https://github.com/apache/arrow-rs/pull/7871#issuecomment-3038845753 Very very nice catch! A couple missed details and some seeming misses by fmt, but the approach LGTM! > I wonder if these optimizations show up in any benchmark results 🤔 Oh my, yes. From https://github.com/apache/arrow-rs/issues/7869: > I added several timestamp markers (`std::time::Instant::elapsed`) to the `test_json_to_variant_object_very_large` test and ran with `cargo test --release`: > |elapsed time (ms)|delta (ms) | finished step| > |-:|-:|-| > 5 | 5 |generate data structures and json string > 250 | 245 | `json_to_variant` > 593 | 343 | `Variant::try_new` > 1758 | 1165 | `variant_to_json_string` > 1839 | 81 | build variant directly > 2171 | 332 | `Variant::try_new` > 2758 | 587 | `JsonToVariantTest::run` The PR currently only fixes two of the three places that wrongly trigger full validation, but with all three fixed locally it becomes: |elapsed time (ms)|delta (ms) | finished step| |-:|-:|-| 5 | 5 |generate data structures and json string 269 | 264 | `json_to_variant` 270 | 0.3 (!!) | `Variant::try_new` 637 | 367 | `variant_to_json_string` 719 | 82 | build variant directly 719 | 0.1 (!!) | `Variant::try_new` 973 | 254 (!) | `JsonToVariantTest::run` (I'm reasonably confident we can stop wondering if we caught them all, when the result drops to sub-ms times) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
alamb commented on PR #7871: URL: https://github.com/apache/arrow-rs/pull/7871#issuecomment-3038791308 I wonder if these optimizations show up in any benchmark results 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on code in PR #7871:
URL: https://github.com/apache/arrow-rs/pull/7871#discussion_r2186394241
##
parquet-variant/src/variant/object.rs:
##
@@ -236,20 +236,25 @@ impl<'m, 'v> VariantObject<'m, 'v> {
/// field IDs). The latter can only happen when working with an
unvalidated object produced by
/// [`Self::new`].
pub fn field(&self, i: usize) -> Option> {
-(i < self.len()).then(|| self.try_field_impl(i).expect("Invalid object
field value"))
+(i < self.len()).then(|| {
+self.try_field_with_shallow_validation(i)
+.expect("Invalid object field value")
+})
}
/// Fallible version of `field`. Returns field value by index, capturing
validation errors
pub fn try_field(&self, i: usize) -> Result, ArrowError> {
-self.try_field_impl(i)?.validate()
+self.try_field_with_shallow_validation(i)?
+.with_deep_validation()
}
// Attempts to retrieve the ith field value from the value region of the
byte buffer; it
// performs only basic (constant-cost) validation.
-fn try_field_impl(&self, i: usize) -> Result, ArrowError> {
+fn try_field_with_shallow_validation(&self, i: usize) ->
Result, ArrowError> {
let value_bytes = slice_from_slice(self.value,
self.first_value_byte..)?;
let value_bytes = slice_from_slice(value_bytes,
self.get_offset(i)?..)?;
-Variant::try_new_with_metadata(self.metadata, value_bytes)
+
+Variant::try_new_with_metadata_and_shallow_validation(self.metadata,
value_bytes)
Review Comment:
Another place where we want shallow validation but we call a method that
does deep validation as well
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] [Variant] Remove superflous validate call and rename methods [arrow-rs]
friendlymatthew commented on PR #7871: URL: https://github.com/apache/arrow-rs/pull/7871#issuecomment-3037528540 @scovich, I would appreciate your thoughts here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
