How to process only input files containing 100% valid rows

Matthias Scherer Thu, 18 Apr 2013 12:35:59 -0700

Hi all,

In my mapreduce job, I would like to process only whole input files containing 
only valid rows. If one map task processing an input split of a file detects an 
invalid row, the whole file should be "marked" as invalid and not processed at 
all. This input file will then be cleansed by another process, and taken again 
as input to the next run of my mapreduce job.


My first idea was to set a counter in the mapper after detecting an invalid 
line with the name of the file as the counter name (derived from input split). 
Then additionally put the input filename to the map output value (which is 
already a MapWritable, so adding the filename is no problem). And in the 
reducer I could filter out any rows belonging to the counters written in the 
mapper.

Each job has some thousand input files. So in the worst case there could be as 
many counters written to mark invalid input files. Is this a feasible approach? 
Does the framework guarantee that all counters written in the mappers are 
synchronized (visible) in the reducers? And could this number of counters lead 
to OOME in the jobtracker?

Are there better approaches? I could also process the files using a non 
splitable input format. Is there a way to reject the already outputted rows of 
a the map task processing an input split?

Thanks,
Matthias

How to process only input files containing 100% valid rows

Reply via email to