jecsand838 commented on code in PR #9171:
URL: https://github.com/apache/arrow-rs/pull/9171#discussion_r2723863586
##########
arrow-avro/src/writer/mod.rs:
##########
@@ -172,6 +480,74 @@ impl WriterBuilder {
}
}
+/// A row-by-row streaming encoder for Avro **Single Object Encoding** (SOE)
streams.
Review Comment:
Great question! At the byte level `Writer<_, AvroSoeFormat>` writing into a
`Vec<u8>` does produce the same concatenated output stream.
The reason for `Encoder` however is that neither SOE nor the
Confluent/Apicurio wire formats include a length field (SOE is just 0xC3 0x01 +
8-byte hashed fingerprint + body while Confluent is magic byte + 4-byte schema
id + body). So once multiple rows are written into a single `Vec`, there’s no
cheap or 100% reliable--especially for wire formats--way to split it back into
per-row payloads without either decoding or getting hacky. Support for binary
format was essentially blocked since those payloads aren't framed at all and
therefore have no makeshift delimiter to scan for / split by.
Additionally, I hit performance bottlenecks when developing message-oriented
sinks (Kafka/Pulsar/etc.) downstream of `arrow-avro`. These were incurred from
having to use the `Writer` to encode 1-row batches and tracking `Vec` lengths,
which is much less efficient due to repeated per-call setups and per-row
allocations + copies.
The new `Encoder` solves this while enabling binary format by recording
row-end offsets during encoding and returning zero-copy `Bytes` slices per row
(via `EncodedRows`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]