Re: [PR] [Website]: Blog post about arrow-avro [arrow-site]

via GitHub Fri, 17 Oct 2025 17:13:21 -0700


jecsand838 commented on code in PR #712:
URL: https://github.com/apache/arrow-site/pull/712#discussion_r2434705309



##########
_posts/2025-10-17-introducing-arrow-avro.md:
##########
@@ -0,0 +1,246 @@
+---
+layout: post
+title: "Announcing arrow-avro in Arrow Rust"
+description: "A new vectorized reader/writer for Avro native to Arrow, with 
OCF, Single‑Object, and Confluent wire format support."
+date: "2025-10-17 00:00:00"
+author: jecsand838
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+`arrow-avro` is a Rust crate that reads and writes Avro data directly as Arrow 
`RecordBatch`es. It supports Avro Object Container Files (OCF), Single‑Object 
Encoding, and the Confluent Schema Registry wire format, with 
projection/evolution, tunable batch sizing, and an optional `StringViewArray` 
for faster strings. Its vectorized design reduces copies and cache misses, 
making both batch (files) and streaming (Kafka) pipelines simpler and faster.
+
+## Motivation

Review Comment:
   @alamb Also a really good idea.
   
   I added this section under `Motivation`
   
   ```markdown
   ### Why not use the existing `apache-avro` crate?
   
   Rust already has a mature, general‑purpose Avro SDK, 
[`apache-avro`](https://crates.io/crates/apache-avro). It reads and writes Avro 
records as Avro `Value`/Serde types and provides Object Container File (OCF) 
readers and writers. What it does not do is decode directly into Arrow arrays, 
so any Arrow integration must materialize rows and then build columns.
   
   What’s needed is a complementary approach that decodes column‑by‑column 
straight into Arrow builders and emits `RecordBatch`es. This would enable 
projection pushdown and keep execution vectorized end to end. For projects like 
DataFusion, access to a mature, upstream Arrow‑native reader would help 
simplify the code path and reduce duplication.
   
   Modern pipelines heighten this need because Avro is also used on the wire, 
not just in files. Kafka ecosystems commonly use Confluent’s Schema Registry 
framing and many services adopt Avro Single‑Object Encoding. Decoding straight 
into Arrow batches (rather than per‑row values) is what lets downstream compute 
remain vectorized at streaming rates.
   ```
   
   and this section under `Introducing arrow-avro`
   
   ```markdown
   ### How this mirrors Parquet in Arrow‑rs
   
   If you have used Parquet with Arrow‑rs, you already know the pattern. The 
`parquet` crate exposes an `parquet::arrow` module that reads and writes Arrow 
`RecordBatch`es directly. Most users reach for 
`ParquetRecordBatchReaderBuilder` when reading and `ArrowWriter` when writing. 
You choose columns up front, set a batch size, and the reader gives you Arrow 
batches that flow straight into vectorized operators. This is the widely 
adopted "format crate + Arrow‑native bridge" approach in Rust.
   
   `arrow‑avro` brings that same bridge to Avro. You get a single 
`ReaderBuilder` that can produce a file reader for OCF, or a streaming 
`Decoder` for on‑the‑wire frames. Both return Arrow `RecordBatch`es, which 
means engines can keep projection and filtering close to the reader and avoid 
building rows only to reassemble them back into columns later. For evolving 
streams, a small `SchemaStore` resolves fingerprints or ids before decoding, so 
the batches that come out are already shaped for vectorized execution.
   
   The reason this pattern matters is straightforward. Arrow’s columnar format 
is designed for vectorized work and good cache locality. When your format 
reader produces Arrow batches directly, you minimize copies and branchy per‑row 
work, keeping downstream operators fast. That is the same story that made 
`parquet::arrow` popular in Rust, and it is what `arrow‑avro` now enables for 
Avro.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Website]: Blog post about arrow-avro [arrow-site]

Reply via email to