date:20141217

Re: Reading huge records

2014-12-17 Thread yael aharon

Thank you very much!

On Tue, Dec 16, 2014 at 12:34 PM, Doug Cutting cutt...@apache.org wrote:

Avro does permit partial reading of arrays.

Arrays are written as a series of length-prefixed blocks:

http://avro.apache.org/docs/current/spec.html#binary_encode_complex

The standard encoders do not write arrays as multiple blocks, but
BlockingBinaryEncoder does. It can be used with any DatumWriter
implementation. If you, for example, have an array whose
implementation is backed by a database and contains billions of
elements, it can be written as a single Avro value with a
BlockingBinaryEncoder.

http://avro.apache.org/docs/current/api/java/org/apache/avro/io/BlockingBinaryEncoder.html

All Decoder implementations read array blocks correctly, but none of
the standard DatumReader implementations support reading of partial
arrays. So you could use the Decoder API directly to read your data,
or you might extend an existing DatumReader to read partial arrays.
For example, you might override GenericDatumReader#readArray() to only
read the first N elements, then skip the rest. Or all the array
elements might be stored externally as they are read.

Doug

On Mon, Dec 15, 2014 at 2:54 PM, yael aharon yael.aharo...@gmail.com
wrote:
Hello,
I need to read very large avro records, where each record is an array
which
can have hundreds of millions of members. I am concerned about reading a
whole record into memory at once.
Is there a way to read only a part of the record instead?
thanks, Yael

Avro schema and data read with it.

2014-12-17 Thread ๏̯͡๏

I have a data that is persisted in Avro format. Each record has a certain
schema and it contains 10 fields while it is persisted.

When I read the same record(s) from other process, i also specify a schema
with a subset of fields (5).

Will only 5 columns be read from disk?
or
Will all the columns be read but 5 are later discarded?
or
Are all the columns read but only five are accessible since the schema used
to read contain only five columns?

Please suggest.

Regards,
Deepak

Re: Avro schema and data read with it.

2014-12-17 Thread Doug Cutting

Avro skips over fields that were present in the writer's schema but
are no longer present in the reader's schema.  Skipping is
substantially faster than reading for most types.  For known-size
types like string, bytes, fixed, double and float the file pointer can
be incremented past skipped values.  For skipped structures like
records, maps and arrays, no memory is allocated and no stores are
made.  Avro data files are not in a columnar format however, so the
i/o and decompression of skipped fields is not generally avoided.

Doug

On Wed, Dec 17, 2014 at 7:53 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 I have a data that is persisted in Avro format. Each record has a certain
 schema and it contains 10 fields while it is persisted.

 When I read the same record(s) from other process, i also specify a schema
 with a subset of fields (5).

 Will only 5 columns be read from disk?
 or
 Will all the columns be read but 5 are later discarded?
 or
 Are all the columns read but only five are accessible since the schema used
 to read contain only five columns?

 Please suggest.

 Regards,
 Deepak

Re: Reading huge records

Avro schema and data read with it.

Re: Avro schema and data read with it.

3 matches

Site Navigation

Mail list logo

Footer information