[ https://issues.apache.org/jira/browse/AVRO-459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844610#action_12844610 ]
Doug Cutting commented on AVRO-459: ----------------------------------- One approach is to use a record that contains an array of byte[], chunking things into, e.g., 64kb chunks. Then one can use ValidatingEncoder and ValidatingDecoder to write and read arbitrarily large records in a type-safe manner. Generic tools that process data files, reading each record into a single object would still fail, but it would be possible to use the Python, Ruby, C, & C++ equivalent of the encoder/decoder APIs to process such data without otherwise modifying these implementations. In Java we might also implement datum reader/write subclasses that use a representation for bytes as a file handle plus a start position and length. We'd probably need to downcast the input to be able to extract these and such values could only be used while their input file remains open. But one could then process such files using normal file iterators, etc. This could be implemented in other languages too, as needed. (A variation might be to implement a BinaryDecoder whose readBytes() method returns a slice of a MappedByteBuffer. This might only be practical for really big files in 64-bit JVMs. But one could, when the file is opened, memory map the entire file into a MappedByteBuffer, then, each time readBytes is called, call slice() on this.) These two approaches could be combined or used separately. My suggestion would be to do both: use a schema that chunks, to give other languages the possibility to process such data without extending their implementations, and to add a lazy representation in Java, to make it easier to process big binary values. > Allow lazy reading of large fields from data files > -------------------------------------------------- > > Key: AVRO-459 > URL: https://issues.apache.org/jira/browse/AVRO-459 > Project: Avro > Issue Type: New Feature > Components: java > Reporter: Aaron Kimball > > The current file reader will attempt to materialize individual fields > entirely in RAM. If a record is too big to fit in RAM, it would be good to > get a stream-based API to very large fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.