MapReduce and Avro split by number of records

Pasquale Salza Thu, 06 Feb 2014 13:54:37 -0800

Hi everybody,
I'm looking for a solution to my problem: split a group of Avro files by
number of records and not by block size, as default.


For the moment, my strategy is:
- Iterate among the records input files;
- Create a new InputSplit when a limit has been reached and store: the file
paths, the last sync point met in the first file and an offset, which is
the number of records from the sync point from which start with;
- The record reader opens the first path and launches a seek with the
stored sync point. Then it shift, by iterating, the number of records
offset and starts to read the split records.

I'm am obliged to use a split by records because the MapReduce work, in my
case, is computational centric and not data centric.

Do you have any better solution?

Pasquale Salza

MapReduce and Avro split by number of records

Reply via email to