scovich commented on code in PR #9663:
URL: https://github.com/apache/arrow-rs/pull/9663#discussion_r3101216129


##########
parquet-variant-compute/src/shred_variant.rs:
##########
@@ -321,12 +328,19 @@ impl<'a> VariantToShreddedArrayVariantRowBuilder<'a> {
         // If the variant is not an array, typed_value must be null.
         // If the variant is an array, value must be null.
         match variant {
-            Variant::List(list) => {
+            Variant::List(ref list) => {
                 self.nulls.append_non_null();
-                self.value_builder.append_null();
-                self.typed_value_builder
-                    .append_value(&Variant::List(list))?;
-                Ok(true)
+
+                // With `safe` cast option set to false, appending list of 
wrong size to
+                // `typed_value_builder` of type `FixedSizeList` will result 
in an error. In such a
+                // case, the provided list should be appended to the 
`value_builder.

Review Comment:
   I don't know... 
   
   My gut reaction is to forbid it completely because it's preferable to fail 
early and consistently rather than blow up unpredictably partway through 
shredding. The spec forbids us to fallback to `value` when the length is wrong, 
and we can't emit NULL either (regardless of safe casting options) because 
shredding is supposed to be lossless (it isn't a converting cast).
   
   But I also don't have any intuition of actual use cases for this feature, to 
weigh its benefits against the downsides. 
   
   NOTE: The same dilemma applies to fixed length binary and fixed length 
string, _except_ the spec technically doesn't forbid us to fallback to the 
`value` column. But how would we actually _use_ it? 
   
   # Thought experiment
   
   Suppose we _could_ use the `value` column as fallback in all three cases? 
Then the writer produces a shredded variant column where all wrong-length 
values are in `value` column (along with the wrong-type values) while all 
right-type-right-length values are in `typed_value`. But what good does it do? 
A reader who comes along later has no way to know about any of this, because 
there's nothing in the parquet that reliably indicates this was a fixed length 
array/binary/string column. Sure, we might embed arrow metadata, but non-arrow 
readers would ignore it and even arrow readers are on shaky ground trusting it 
because this is something entirely outside the variant spec.
   
   Overall, enforcing list/binary/string length feels like a logical validation 
issue while shredding is a physical operation. And I suspect it's best to leave 
client code to enforce logical validity of all forms, not try to encode it in 
the type system. Additionally, It feels like fixed-len-xxx types fall in the 
same category as unsigned or non-zero types. True, arrow has unsigned int types 
but it lacks non-zero types. And parquet has neither.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to