[jira] Commented: (AVRO-459) Allow lazy reading of large fields from data files

Doug Cutting (JIRA) Fri, 12 Mar 2010 10:52:52 -0800

    [ 
https://issues.apache.org/jira/browse/AVRO-459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844610#action_12844610
 ]


Doug Cutting commented on AVRO-459:
-----------------------------------

One approach is to use a record that contains an array of byte[], chunking 
things into, e.g., 64kb chunks.  Then one can use ValidatingEncoder and 
ValidatingDecoder to write and read arbitrarily large records in a type-safe 
manner.  Generic tools that process data files, reading each record into a 
single object would still fail, but it would be possible to use the Python, 
Ruby, C, & C++ equivalent of the encoder/decoder APIs to process such data 
without otherwise modifying these implementations.

In Java we might also implement datum reader/write subclasses that use a 
representation for bytes as a file handle plus a start position and length.  
We'd probably need to downcast the input to be able to extract these and such 
values could only be used while their input file remains open.  But one could 
then process such files using normal file iterators, etc.  This could be 
implemented in other languages too, as needed.

(A variation might be to implement a BinaryDecoder whose readBytes() method 
returns a slice of a MappedByteBuffer.  This might only be practical for really 
big files in 64-bit JVMs.  But one could, when the file is opened, memory map 
the entire file into a MappedByteBuffer, then, each time readBytes is called, 
call slice() on this.)

These two approaches could be combined or used separately.  My suggestion would 
be to do both: use a schema that chunks, to give other languages the 
possibility to process such data without extending their implementations, and 
to add a lazy representation in Java, to make it easier to process big binary 
values.


> Allow lazy reading of large fields from data files
> --------------------------------------------------
>
>                 Key: AVRO-459
>                 URL: https://issues.apache.org/jira/browse/AVRO-459
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Aaron Kimball
>
> The current file reader will attempt to materialize individual fields 
> entirely in RAM. If a record is too big to fit in RAM, it would be good to 
> get a stream-based API to very large fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-459) Allow lazy reading of large fields from data files

Reply via email to