If you only intend to examine a subset of the fields, you can pass in a version of your schema with all but those fields removed as the 'reader' schema. Fields not in this minimized schema will be skipped without creating any structures.
Alternatively, you can walk a schema calling the decoder API and process data without constructing a complete representation of it. As an example, see GenericDatumReader#skip(). https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java#L564 You can write a method with a similar structure, a big switch statement over Avro types with recursive calls, except yours might selectively process some fields. This permits SAX-like, event-based processing, if you remember that from XML parsing. Doug On Fri, Oct 9, 2020 at 3:51 PM Richard Ney <[email protected]> wrote: > I have the need to read in Avro messages from files that inflate to sizes > that are causing OOM errors due to the in memory representation of the > inflated document exceeding 1.5GB of Heap. Is there a way to stream the > file into the application, inflate it, and marshal the contents without > pulling the entire message into memory or am I restricted to chunking only > at the message level? >
