jecsand838 commented on code in PR #712:
URL: https://github.com/apache/arrow-site/pull/712#discussion_r2434705309
##########
_posts/2025-10-17-introducing-arrow-avro.md:
##########
@@ -0,0 +1,246 @@
+---
+layout: post
+title: "Announcing arrow-avro in Arrow Rust"
+description: "A new vectorized reader/writer for Avro native to Arrow, with
OCF, Single‑Object, and Confluent wire format support."
+date: "2025-10-17 00:00:00"
+author: jecsand838
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+`arrow-avro` is a Rust crate that reads and writes Avro data directly as Arrow
`RecordBatch`es. It supports Avro Object Container Files (OCF), Single‑Object
Encoding, and the Confluent Schema Registry wire format, with
projection/evolution, tunable batch sizing, and an optional `StringViewArray`
for faster strings. Its vectorized design reduces copies and cache misses,
making both batch (files) and streaming (Kafka) pipelines simpler and faster.
+
+## Motivation
Review Comment:
@alamb Also a really good idea.
I added this section under `Motivation`
```markdown
### Why not use the existing `apache-avro` crate?
Rust already has a mature, general‑purpose Avro SDK,
[`apache-avro`](https://crates.io/crates/apache-avro). It reads and writes Avro
records as Avro `Value`/Serde types and provides Object Container File (OCF)
readers and writers. What it does not do is decode directly into Arrow arrays,
so any Arrow integration must materialize rows and then build columns.
What’s needed is a complementary approach that decodes column‑by‑column
straight into Arrow builders and emits `RecordBatch`es. This would enable
projection pushdown and keep execution vectorized end to end. For projects like
DataFusion, access to a mature, upstream Arrow‑native reader would help
simplify the code path and reduce duplication.
Modern pipelines heighten this need because Avro is also used on the wire,
not just in files. Kafka ecosystems commonly use Confluent’s Schema Registry
framing and many services adopt Avro Single‑Object Encoding. Decoding straight
into Arrow batches (rather than per‑row values) is what lets downstream compute
remain vectorized at streaming rates.
```
and this section under `Introducing arrow-avro`
```markdown
### How this mirrors Parquet in Arrow‑rs
If you have used Parquet with Arrow‑rs, you already know the pattern. The
`parquet` crate exposes an `parquet::arrow` module that reads and writes Arrow
`RecordBatch`es directly. Most users reach for
`ParquetRecordBatchReaderBuilder` when reading and `ArrowWriter` when writing.
You choose columns up front, set a batch size, and the reader gives you Arrow
batches that flow straight into vectorized operators. This is the widely
adopted "format crate + Arrow‑native bridge" approach in Rust.
`arrow‑avro` brings that same bridge to Avro. You get a single
`ReaderBuilder` that can produce a file reader for OCF, or a streaming
`Decoder` for on‑the‑wire frames. Both return Arrow `RecordBatch`es, which
means engines can keep projection and filtering close to the reader and avoid
building rows only to reassemble them back into columns later. For evolving
streams, a small `SchemaStore` resolves fingerprints or ids before decoding, so
the batches that come out are already shaped for vectorized execution.
The reason this pattern matters is straightforward. Arrow’s columnar format
is designed for vectorized work and good cache locality. When your format
reader produces Arrow batches directly, you minimize copies and branchy per‑row
work, keeping downstream operators fast. That is the same story that made
`parquet::arrow` popular in Rust, and it is what `arrow‑avro` now enables for
Avro.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]