How about a different approach: If you use the multiple output option you can process the valid lines in a normal way and put the invalid lines in a special separate output file. On Apr 18, 2013 9:36 PM, "Matthias Scherer" <matthias.sche...@1und1.de> wrote:
> Hi all,**** > > ** ** > > In my mapreduce job, I would like to process only whole input files > containing only valid rows. If one map task processing an input split of a > file detects an invalid row, the whole file should be “marked” as invalid > and not processed at all. This input file will then be cleansed by another > process, and taken again as input to the next run of my mapreduce job.**** > > ** ** > > My first idea was to set a counter in the mapper after detecting an > invalid line with the name of the file as the counter name (derived from > input split). Then additionally put the input filename to the map output > value (which is already a MapWritable, so adding the filename is no > problem). And in the reducer I could filter out any rows belonging to the > counters written in the mapper.**** > > ** ** > > Each job has some thousand input files. So in the worst case there could > be as many counters written to mark invalid input files. Is this a feasible > approach? Does the framework guarantee that all counters written in the > mappers are synchronized (visible) in the reducers? And could this number > of counters lead to OOME in the jobtracker?**** > > ** ** > > Are there better approaches? I could also process the files using a non > splitable input format. Is there a way to reject the already outputted rows > of a the map task processing an input split?**** > > ** ** > > Thanks,**** > > Matthias**** > > ** ** >