Re: Reading large Avro message in Java with minimal heap usage

Doug Cutting Mon, 12 Oct 2020 14:41:27 -0700

If you only intend to examine a subset of the fields, you can pass in a
version of your schema with all but those fields removed as the 'reader'
schema.  Fields not in this minimized schema will be skipped without
creating any structures.

Alternatively, you can walk a schema calling the decoder API and process
data without constructing a complete representation of it.

As an example, see GenericDatumReader#skip().

https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java#L564

You can write a method with a similar structure, a big switch statement
over Avro types with recursive calls, except yours might selectively
process some fields.  This permits SAX-like, event-based processing, if you
remember that from XML parsing.

Doug

On Fri, Oct 9, 2020 at 3:51 PM Richard Ney <[email protected]> wrote:

> I have the need to read in Avro messages from files that inflate to sizes
> that are causing OOM errors due to the in memory representation of the
> inflated document exceeding 1.5GB of Heap. Is there a way to stream the
> file into the application, inflate it, and marshal the contents without
> pulling the entire message into memory or am I restricted to chunking only
> at the message level?
>

Re: Reading large Avro message in Java with minimal heap usage

Reply via email to