If you can make the decision locally, then it should just be performed in the reducer itself:
if (guard) { output.collect(k, v); } If you need to know what results will be generated by other calls to reduce() on that same machine, then you'll need to be a bit more clever. If you know that for all jobs you'll run, your results will always fit in a buffer in RAM, then you can put your values in an ArrayList or something and then override Reducer.close() to dump your values into the output collector. Then call super.close(). If you may need to generate more data than will fit in RAM, or you need the results of multiple nodes to conference together, then this means you almost certainly want a second MapReduce pass. Your first pass should collect() all the results it generates. Then in a second pass, use an identity mapper that causes the shuffler to sort the data along some axis so that the most desirable data comes first. Then output.collect() this data a second time in the second reducer, discarding the data that doesn't meet your criterion. The input path to your second MR is the output path from the first one. - Aaron On Sun, Jun 14, 2009 at 4:02 PM, Kunsheng Chen <ke...@yahoo.com> wrote: > > Hi everyone, > > I am doing a map-reduce program, it is working good. > > Now I am thinking of inserting my own algorithm to pick the output results > after 'Reduce' other than simply use 'output.colllect()' in Reduce to output > all results. > > The only thing I could think is read the output file after JobClient > finishing and does some Java program for that, but I am not sure whether > there are efficient method provided by hadoop to handle that. > > > Any idea is well appreciated, > > -Kun > > > >