Re: Reading huge records

2014-12-17 Thread yael aharon
Thank you very much!


On Tue, Dec 16, 2014 at 12:34 PM, Doug Cutting cutt...@apache.org wrote:

 Avro does permit partial reading of arrays.

 Arrays are written as a series of length-prefixed blocks:

 http://avro.apache.org/docs/current/spec.html#binary_encode_complex

 The standard encoders do not write arrays as multiple blocks, but
 BlockingBinaryEncoder does.  It can be used with any DatumWriter
 implementation.  If you, for example, have an array whose
 implementation is backed by a database and contains billions of
 elements, it can be written as a single Avro value with a
 BlockingBinaryEncoder.


 http://avro.apache.org/docs/current/api/java/org/apache/avro/io/BlockingBinaryEncoder.html

 All Decoder implementations read array blocks correctly, but none of
 the standard DatumReader implementations support reading of partial
 arrays.  So you could use the Decoder API directly to read your data,
 or you might extend an existing DatumReader to read partial arrays.
 For example, you might override GenericDatumReader#readArray() to only
 read the first N elements, then skip the rest.  Or all the array
 elements might be stored externally as they are read.

 Doug


 On Mon, Dec 15, 2014 at 2:54 PM, yael aharon yael.aharo...@gmail.com
 wrote:
  Hello,
  I need to read very large avro records, where each record is an array
 which
  can have hundreds of millions of members. I am concerned about reading a
  whole record into memory at once.
  Is there a way to read only a part of the record instead?
  thanks, Yael



Avro schema and data read with it.

2014-12-17 Thread ๏̯͡๏
I have a data that is persisted in Avro format. Each record has a certain
schema and it contains 10 fields while it is persisted.

When I read the same record(s) from other process, i also specify a schema
with a subset of fields (5).

Will only 5 columns be read from disk?
or
Will all the columns be read but 5 are later discarded?
or
Are all the columns read but only five are accessible since the schema used
to read contain only five columns?

Please suggest.

Regards,
Deepak


Re: Avro schema and data read with it.

2014-12-17 Thread Doug Cutting
Avro skips over fields that were present in the writer's schema but
are no longer present in the reader's schema.  Skipping is
substantially faster than reading for most types.  For known-size
types like string, bytes, fixed, double and float the file pointer can
be incremented past skipped values.  For skipped structures like
records, maps and arrays, no memory is allocated and no stores are
made.  Avro data files are not in a columnar format however, so the
i/o and decompression of skipped fields is not generally avoided.

Doug

On Wed, Dec 17, 2014 at 7:53 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 I have a data that is persisted in Avro format. Each record has a certain
 schema and it contains 10 fields while it is persisted.

 When I read the same record(s) from other process, i also specify a schema
 with a subset of fields (5).

 Will only 5 columns be read from disk?
 or
 Will all the columns be read but 5 are later discarded?
 or
 Are all the columns read but only five are accessible since the schema used
 to read contain only five columns?

 Please suggest.

 Regards,
 Deepak