jecsand838 commented on code in PR #9171:
URL: https://github.com/apache/arrow-rs/pull/9171#discussion_r2723863586


##########
arrow-avro/src/writer/mod.rs:
##########
@@ -172,6 +480,74 @@ impl WriterBuilder {
     }
 }
 
+/// A row-by-row streaming encoder for Avro **Single Object Encoding** (SOE) 
streams.

Review Comment:
   Great question! At the byte level `Writer<_, AvroSoeFormat>` writing into a 
`Vec<u8>` does produce the same concatenated output stream.
   
   The reason for `Encoder` however is that neither SOE nor the 
Confluent/Apicurio wire formats include a length field (SOE is just 0xC3 0x01 + 
8-byte hashed fingerprint + body while Confluent is magic byte + 4-byte schema 
id + body). So once multiple rows are written into a single `Vec`, there’s no 
cheap or 100% reliable--especially for wire formats--way to split it back into 
per-row payloads without either decoding or getting hacky. Support for binary 
format was essentially blocked since those payloads aren't framed at all and 
therefore have no makeshift delimiter to scan for / split by.
   
   Additionally, I hit performance bottlenecks when developing message-oriented 
sinks (Kafka/Pulsar/etc.) downstream of `arrow-avro`. These were incurred from 
having to use the `Writer` to encode 1-row batches and tracking `Vec` lengths, 
which is much less efficient due to repeated per-call setups and per-row 
allocations + copies.
   
   The new `Encoder` solves this while enabling binary format by recording 
row-end offsets during encoding and returning zero-copy `Bytes` slices per row 
(via `EncodedRows`). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to