Hi everybody, I'm looking for a solution to my problem: split a group of Avro files by number of records and not by block size, as default.
For the moment, my strategy is: - Iterate among the records input files; - Create a new InputSplit when a limit has been reached and store: the file paths, the last sync point met in the first file and an offset, which is the number of records from the sync point from which start with; - The record reader opens the first path and launches a seek with the stored sync point. Then it shift, by iterating, the number of records offset and starts to read the split records. I'm am obliged to use a split by records because the MapReduce work, in my case, is computational centric and not data centric. Do you have any better solution? Pasquale Salza