Yes I considered Shevek's tactic as well, but as Jason pointed out emit ing the entire data set just to find the maximum value would be wasteful, you do not want to sort the dataset, you just want to break it in parts and find the max value of each part, then bring it into one part and perform that operation again.
The way I look at it are the 'best' hadoop algorithms are the ones that emit less key pairs. What Jason suggested, and the MapRunner concept I was looking at, would both emit about the same amount of key pairs. I am curious to see if the MapRunner implementation would run faster due to less calls to the map function. After all MapRunner is only iterating over the data set. On Mon, Apr 20, 2009 at 8:25 AM, jason hadoop <jason.had...@gmail.com> wrote: > The Hadoop Framework requires that a Map Phase be run before the Reduce > Phase. > By doing the initial 'reduce' in the map, a much smaller volume of data has > to flow across the network to the reduce tasks. > But yes, this could simply be done by using an IdentityMapper and then have > all of the work done in the reduce. > > > On Mon, Apr 20, 2009 at 4:26 AM, Shevek <had...@anarres.org> wrote: > >> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote: >> > The traditional approach would be a Mapper class that maintained a member >> > variable that you kept the max value record, and in the close method of >> your >> > mapper you output a single record containing that value. >> >> Perhaps you can forgive the question from a heathen, but why is this >> first mapper not also a reducer? It seems to me that it is performing a >> reduce operation, and that maps should (philosophically speaking) not >> maintain data from one input to the next, since the order (and location) >> of inputs is not well defined. The program to compute a maximum should >> then be a tree of reduction operations, with no maps at all. >> >> Of course in this instance, what you propose works, but it does seem >> puzzling. Perhaps the answer is simple architectural limitation? >> >> S. >> >> > The map method of course compares the current record against the max and >> > stores current in max when current is larger than max. >> > >> > Then each map output is a single record and the reduce behaves very >> > similarly, in that the close method outputs the final max record. A >> single >> > reduce would be the simplest. >> > >> > On your question a Mapper and Reducer defines 3 entry points, configure, >> > called once on on task start, the map/reduce called once for each record, >> > and close, called once after the last call to map/reduce. >> > at least through 0.19, the close is not provided with the output >> collector >> > or the reporter, so you need to save them in the map/reduce method. >> > >> > On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <russ...@gmail.com> >> wrote: >> > >> > > How do you identify that map task is ending within the map method? Is >> it >> > > possible to know which is the last call to map method? >> > > >> > > On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo < >> edlinuxg...@gmail.com >> > > >wrote: >> > > >> > > > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase >> > > > support the ability to max(). I am writing my own max() over a simple >> > > > one column dataset. >> > > > >> > > > The best solution I came up with was using MapRunner. With maprunner >> I >> > > > can store the highest value in a private member variable. I can read >> > > > through the entire data set and only have to emit one value per >> mapper >> > > > upon completion of the map data. Then I can specify one reducer and >> > > > carry out the same operation. >> > > > >> > > > Does anyone have a better tactic. I thought a counter could do this >> > > > but are they atomic? >> > > > >> > > >> > >> > >> > >> >> > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 >