jorgecarleitao commented on issue #1061: URL: https://github.com/apache/arrow-datafusion/issues/1061#issuecomment-935443808
The approach I took in `arrow2` was to not use `AvroValue` and work directly with the byte stream. The reason for this is that `AvroValue` takes bytes by value and not by reference, which implies that when we perform the conversion `File -> AvroValue -> Arrow` for a `Utf8` or `Binary`, we end up with the transformation `bytes -> AvroValue::String(String) -> Utf8Array`, performing an extra allocation _per item_. Since the avro format is relatively simple to read from, I just implemented a reader from bytes directly to arrow. I [did not implemented it for all types](https://github.com/jorgecarleitao/arrow2/blob/main/src/io/avro/read/deserialize.rs#L131), only for the basic ones, but the idea stands. So, if I understood, the goal is to generalize the parser to more types. Which ones are needed here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
