Re: [PR] Added List and Struct Encoding to arrow-avro Writer [arrow-rs]

via GitHub Sun, 07 Sep 2025 09:29:08 -0700


scovich commented on code in PR #8274:
URL: https://github.com/apache/arrow-rs/pull/8274#discussion_r2325128883



##########
arrow-avro/src/writer/encoder.rs:
##########
@@ -275,3 +384,434 @@ impl F64Encoder<'_> {
             .map_err(|e| ArrowError::IoError(format!("write f64: {e}"), e))
     }
 }
+
+struct Utf8GenericEncoder<'a, O: OffsetSizeTrait>(&'a GenericStringArray<O>);
+
+impl<'a, O: OffsetSizeTrait> Utf8GenericEncoder<'a, O> {
+    #[inline]
+    fn encode<W: Write + ?Sized>(&mut self, idx: usize, out: &mut W) -> 
Result<(), ArrowError> {
+        write_len_prefixed(out, self.0.value(idx).as_bytes())
+    }
+}
+
+type Utf8Encoder<'a> = Utf8GenericEncoder<'a, i32>;
+type Utf8LargeEncoder<'a> = Utf8GenericEncoder<'a, i64>;
+
+/// Unified field encoder:
+/// - Holds the inner `Encoder` (by value)
+/// - Tracks the column/site null buffer and whether any nulls exist
+/// - Carries per-site Avro `Nullability` and precomputed union branch (fast 
path)
+pub struct FieldEncoder<'a> {
+    encoder: Encoder<'a>,
+    nulls: Option<NullBuffer>,
+    has_nulls: bool,
+    nullability: Option<Nullability>,
+    /// Precomputed constant branch byte if the site is nullable but contains 
no nulls
+    pre: Option<u8>,
+}
+
+impl<'a> FieldEncoder<'a> {
+    fn make_encoder(
+        array: &'a dyn Array,
+        field: &Field,
+        plan: PlanRef<'_>,
+    ) -> Result<Self, ArrowError> {
+        let nulls = array.nulls().cloned();
+        let has_nulls = array.null_count() > 0;
+        let encoder = match plan {
+            FieldPlan::Struct { encoders } => {
+                let arr = array
+                    .as_any()
+                    .downcast_ref::<StructArray>()
+                    .ok_or_else(|| ArrowError::SchemaError("Expected 
StructArray".into()))?;
+                Encoder::Struct(Box::new(StructEncoder::try_new(arr, 
encoders)?))
+            }
+            FieldPlan::List {
+                items_nullability,
+                item_plan,
+            } => match array.data_type() {
+                DataType::List(_) => {
+                    let arr = array
+                        .as_any()
+                        .downcast_ref::<ListArray>()
+                        .ok_or_else(|| ArrowError::SchemaError("Expected 
ListArray".into()))?;
+                    Encoder::List(Box::new(ListEncoder32::try_new(
+                        arr,
+                        *items_nullability,
+                        item_plan.as_ref(),
+                    )?))
+                }
+                DataType::LargeList(_) => {
+                    let arr = array
+                        .as_any()
+                        .downcast_ref::<LargeListArray>()
+                        .ok_or_else(|| ArrowError::SchemaError("Expected 
LargeListArray".into()))?;
+                    Encoder::LargeList(Box::new(ListEncoder64::try_new(
+                        arr,
+                        *items_nullability,
+                        item_plan.as_ref(),
+                    )?))
+                }
+                other => {
+                    return Err(ArrowError::SchemaError(format!(
+                        "Avro array site requires Arrow List/LargeList, found: 
{other:?}"
+                    )))
+                }
+            },
+            FieldPlan::Scalar => match array.data_type() {
+                DataType::Boolean => 
Encoder::Boolean(BooleanEncoder(array.as_boolean())),
+                DataType::Utf8 => {
+                    
Encoder::Utf8(Utf8GenericEncoder::<i32>(array.as_string::<i32>()))
+                }
+                DataType::LargeUtf8 => {
+                    
Encoder::Utf8Large(Utf8GenericEncoder::<i64>(array.as_string::<i64>()))
+                }
+                DataType::Int32 => 
Encoder::Int(IntEncoder(array.as_primitive::<Int32Type>())),
+                DataType::Int64 => 
Encoder::Long(LongEncoder(array.as_primitive::<Int64Type>())),
+                DataType::Float32 => {
+                    
Encoder::Float32(F32Encoder(array.as_primitive::<Float32Type>()))
+                }
+                DataType::Float64 => {
+                    
Encoder::Float64(F64Encoder(array.as_primitive::<Float64Type>()))
+                }
+                DataType::Binary => 
Encoder::Binary(BinaryEncoder(array.as_binary::<i32>())),
+                DataType::LargeBinary => {
+                    
Encoder::LargeBinary(BinaryEncoder(array.as_binary::<i64>()))
+                }
+                DataType::Timestamp(TimeUnit::Microsecond, _) => 
Encoder::Timestamp(LongEncoder(
+                    array.as_primitive::<TimestampMicrosecondType>(),
+                )),
+                other => {
+                    return Err(ArrowError::NotYetImplemented(format!(
+                        "Avro scalar type not yet supported: {other:?}"
+                    )));
+                }
+            },
+            other => {
+                return Err(ArrowError::NotYetImplemented(
+                    "Avro writer: {other:?} not yet supported".into(),
+                ));
+            }
+        };
+        Ok(Self {
+            encoder,
+            nulls,
+            has_nulls,
+            nullability: None,
+            pre: None,
+        })
+    }
+
+    #[inline]

Review Comment:
   I bet profile-guided optimizations would give back far more than 4%, since 
it covers far more than just inlining decisions. If we care about optimizing 
that much, we should probably go that route? But that requires benchmarking and 
profiling infra that arrow-rs probably lacks. An hybrid (and probably almost 
optimal) approach would be to manually gather .profdata, and then see what 
`llvm-profdata show` has to say about these functions. 
   
   Humans are provably terrible at guessing where to sprinkle hints like 
inlining or branch prediction, and so I would expect that some of the inline 
markers are badly-placed and actually _hurting_ performance -- especially for 
real-world (non-microbenchmark) scenarios where CPU instruction cache capacity 
matters -- which eats into the performance gains of the well-placed markers. 
Automatic or manual profile-guided optimization at least gives some concrete 
data to replace those guesses.
   
   Overall, my recommended approach would be to remove all the inline markers 
for now (focus on getting correct code in), and if we want to chase the 4% that 
could be a follow-up item to do it right. Preferably after the code has 
stabilized, since it's still evolving rapidly right now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Added List and Struct Encoding to arrow-avro Writer [arrow-rs]

Reply via email to