Re: Reading huge records
Thank you very much! On Tue, Dec 16, 2014 at 12:34 PM, Doug Cutting cutt...@apache.org wrote: Avro does permit partial reading of arrays. Arrays are written as a series of length-prefixed blocks: http://avro.apache.org/docs/current/spec.html#binary_encode_complex The standard encoders do not write arrays as multiple blocks, but BlockingBinaryEncoder does. It can be used with any DatumWriter implementation. If you, for example, have an array whose implementation is backed by a database and contains billions of elements, it can be written as a single Avro value with a BlockingBinaryEncoder. http://avro.apache.org/docs/current/api/java/org/apache/avro/io/BlockingBinaryEncoder.html All Decoder implementations read array blocks correctly, but none of the standard DatumReader implementations support reading of partial arrays. So you could use the Decoder API directly to read your data, or you might extend an existing DatumReader to read partial arrays. For example, you might override GenericDatumReader#readArray() to only read the first N elements, then skip the rest. Or all the array elements might be stored externally as they are read. Doug On Mon, Dec 15, 2014 at 2:54 PM, yael aharon yael.aharo...@gmail.com wrote: Hello, I need to read very large avro records, where each record is an array which can have hundreds of millions of members. I am concerned about reading a whole record into memory at once. Is there a way to read only a part of the record instead? thanks, Yael
Avro schema and data read with it.
I have a data that is persisted in Avro format. Each record has a certain schema and it contains 10 fields while it is persisted. When I read the same record(s) from other process, i also specify a schema with a subset of fields (5). Will only 5 columns be read from disk? or Will all the columns be read but 5 are later discarded? or Are all the columns read but only five are accessible since the schema used to read contain only five columns? Please suggest. Regards, Deepak
Re: Avro schema and data read with it.
Avro skips over fields that were present in the writer's schema but are no longer present in the reader's schema. Skipping is substantially faster than reading for most types. For known-size types like string, bytes, fixed, double and float the file pointer can be incremented past skipped values. For skipped structures like records, maps and arrays, no memory is allocated and no stores are made. Avro data files are not in a columnar format however, so the i/o and decompression of skipped fields is not generally avoided. Doug On Wed, Dec 17, 2014 at 7:53 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a data that is persisted in Avro format. Each record has a certain schema and it contains 10 fields while it is persisted. When I read the same record(s) from other process, i also specify a schema with a subset of fields (5). Will only 5 columns be read from disk? or Will all the columns be read but 5 are later discarded? or Are all the columns read but only five are accessible since the schema used to read contain only five columns? Please suggest. Regards, Deepak