scovich commented on code in PR #8831:
URL: https://github.com/apache/arrow-rs/pull/8831#discussion_r2547801344
##########
parquet-variant-compute/src/shred_variant.rs:
##########
@@ -236,6 +254,285 @@ impl<'a> VariantToShreddedPrimitiveVariantRowBuilder<'a> {
}
}
+pub(crate) struct VariantToShreddedArrayVariantRowBuilder<'a> {
+ value_builder: VariantValueArrayBuilder,
+ typed_value_builder: ArrayVariantToArrowRowBuilder<'a>,
+}
+
+impl<'a> VariantToShreddedArrayVariantRowBuilder<'a> {
+ fn try_new(
+ data_type: &'a DataType,
+ cast_options: &'a CastOptions,
+ capacity: usize,
+ ) -> Result<Self> {
+ Ok(Self {
+ value_builder: VariantValueArrayBuilder::new(capacity),
+ typed_value_builder: ArrayVariantToArrowRowBuilder::try_new(
+ data_type,
+ cast_options,
+ capacity,
+ )?,
+ })
+ }
+
+ fn append_null(&mut self) -> Result<()> {
+ self.value_builder.append_value(Variant::Null);
+ self.typed_value_builder.append_null();
+ Ok(())
+ }
+
+ fn append_value(&mut self, value: Variant<'_, '_>) -> Result<bool> {
+ // If the value is not an array, typed_value must be null.
+ // If the value is an array, value must be null.
+ match value {
+ Variant::List(list) => {
+ self.value_builder.append_null();
+ self.typed_value_builder.append_value(list)?;
+ Ok(true)
+ }
+ other => {
+ self.value_builder.append_value(other);
+ self.typed_value_builder.append_null();
+ Ok(false)
+ }
+ }
+ }
+
+ fn finish(self) -> Result<(BinaryViewArray, ArrayRef, Option<NullBuffer>)>
{
+ Ok((
+ self.value_builder.build()?,
+ self.typed_value_builder.finish()?,
+ // All elements of an array must be present (not missing) because
+ // the array Variant encoding does not allow missing elements
+ None,
+ ))
+ }
+}
+
+enum ArrayVariantToArrowRowBuilder<'a> {
+ List(VariantToListArrowRowBuilder<'a, i32>),
+ LargeList(VariantToListArrowRowBuilder<'a, i64>),
+ ListView(VariantToListViewArrowRowBuilder<'a, i32>),
+ LargeListView(VariantToListViewArrowRowBuilder<'a, i64>),
Review Comment:
Can we introduce a `ListLikeArrayBuilder` trait (**) that encapsulates the
(minimal) differences between these four types, so that
`ArrayVariantToArrowRowBuilder` becomes a generic struct instead of an enum?
(**) c.f. `StringLikeArrayBuilder` that serves the same purpose for strings
##########
parquet-variant-compute/src/shred_variant.rs:
##########
@@ -236,6 +254,285 @@ impl<'a> VariantToShreddedPrimitiveVariantRowBuilder<'a> {
}
}
+pub(crate) struct VariantToShreddedArrayVariantRowBuilder<'a> {
+ value_builder: VariantValueArrayBuilder,
+ typed_value_builder: ArrayVariantToArrowRowBuilder<'a>,
+}
+
+impl<'a> VariantToShreddedArrayVariantRowBuilder<'a> {
+ fn try_new(
+ data_type: &'a DataType,
+ cast_options: &'a CastOptions,
+ capacity: usize,
+ ) -> Result<Self> {
+ Ok(Self {
+ value_builder: VariantValueArrayBuilder::new(capacity),
+ typed_value_builder: ArrayVariantToArrowRowBuilder::try_new(
+ data_type,
+ cast_options,
+ capacity,
+ )?,
+ })
+ }
+
+ fn append_null(&mut self) -> Result<()> {
+ self.value_builder.append_value(Variant::Null);
+ self.typed_value_builder.append_null();
+ Ok(())
+ }
+
+ fn append_value(&mut self, value: Variant<'_, '_>) -> Result<bool> {
+ // If the value is not an array, typed_value must be null.
+ // If the value is an array, value must be null.
+ match value {
+ Variant::List(list) => {
+ self.value_builder.append_null();
+ self.typed_value_builder.append_value(list)?;
Review Comment:
Double checking -- if I try to shred as `List<i32>` and I encounter a
variant array `[..., "hi", ...]`, the bad entry will either become NULL or
cause an error, depending on cast options?
##########
parquet-variant-compute/src/shred_variant.rs:
##########
@@ -236,6 +254,285 @@ impl<'a> VariantToShreddedPrimitiveVariantRowBuilder<'a> {
}
}
+pub(crate) struct VariantToShreddedArrayVariantRowBuilder<'a> {
+ value_builder: VariantValueArrayBuilder,
+ typed_value_builder: ArrayVariantToArrowRowBuilder<'a>,
+}
+
+impl<'a> VariantToShreddedArrayVariantRowBuilder<'a> {
+ fn try_new(
+ data_type: &'a DataType,
+ cast_options: &'a CastOptions,
+ capacity: usize,
+ ) -> Result<Self> {
+ Ok(Self {
+ value_builder: VariantValueArrayBuilder::new(capacity),
+ typed_value_builder: ArrayVariantToArrowRowBuilder::try_new(
+ data_type,
+ cast_options,
+ capacity,
+ )?,
+ })
+ }
+
+ fn append_null(&mut self) -> Result<()> {
+ self.value_builder.append_value(Variant::Null);
+ self.typed_value_builder.append_null();
+ Ok(())
+ }
+
+ fn append_value(&mut self, value: Variant<'_, '_>) -> Result<bool> {
+ // If the value is not an array, typed_value must be null.
+ // If the value is an array, value must be null.
+ match value {
+ Variant::List(list) => {
+ self.value_builder.append_null();
+ self.typed_value_builder.append_value(list)?;
+ Ok(true)
+ }
+ other => {
+ self.value_builder.append_value(other);
+ self.typed_value_builder.append_null();
+ Ok(false)
+ }
+ }
+ }
+
+ fn finish(self) -> Result<(BinaryViewArray, ArrayRef, Option<NullBuffer>)>
{
+ Ok((
+ self.value_builder.build()?,
+ self.typed_value_builder.finish()?,
+ // All elements of an array must be present (not missing) because
+ // the array Variant encoding does not allow missing elements
+ None,
+ ))
+ }
+}
+
+enum ArrayVariantToArrowRowBuilder<'a> {
+ List(VariantToListArrowRowBuilder<'a, i32>),
+ LargeList(VariantToListArrowRowBuilder<'a, i64>),
+ ListView(VariantToListViewArrowRowBuilder<'a, i32>),
+ LargeListView(VariantToListViewArrowRowBuilder<'a, i64>),
Review Comment:
A quick analysis suggests the trait needs:
* An associated type: `type Offset: OffsetSizeTrait`
* A constructor: `fn try_new(...) -> Result<Self>`
* Helper functions to support `append_null` and `append_value` (nulls,
offsets, etc)
* A finisher: `fn finish(self) -> Result<ArrayRef>`
Two trait implementations (one for lists and one for list views), both
generic over `Offset`
And from there, the outer builder should be able to implement its own logic
just once instead of four times.
Double check tho -- the above is a _very_ rough sketch. The goal is to
minimize boilerplate and duplication, using a careful selection of trait
methods that capture the essential differences between lists and list views.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]