I worked out how to use the aggregation service with streaming to do this, entertainingly simple once you have figured this out.
Full details will be in ch08 of my book - buy a copy so I can afford to write another :) /tmp/numbers contains a file full of white space separated whole numbers. /tmp/LongMax.pl is the attached perl script. The output will be a single file part-00000 in /tmp/numbers_max_output. Note: this job is run using the local runner (-jt local) so only 1 reduce is allowed. hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -jt local -fs file:/// -input /tmp/numbers -output /tmp/numbers_max_output -reducer aggregate -mapper LongMax.pl -file /tmp/LongMax.pl On Tue, Apr 21, 2009 at 7:42 PM, jason hadoop <jason.had...@gmail.com>wrote: > There is no reason to use a combiner in this case, as there is only a > single output record from the map. > > Combiners buy you data reduction when you have output values in your map > that share keys, and your application allows you to do something with the > values that results in smaller/fewer records being passed to the reduce. > > > On Mon, Apr 20, 2009 at 4:24 PM, Brian Bockelman <bbock...@cse.unl.edu>wrote: > >> Hey Jason, >> >> Wouldn't this be avoided if you used a combiner to also perform the max() >> operation? A minimal amount of data would be written over the network. >> >> I can't remember if the map output gets written to disk first, then >> combine applied or if the combine is applied and then the data is written to >> disk. I suspect the latter, but it'd be a big difference. >> >> However, the original poster mentioned he was using hbase/pig -- >> certainly, there's some better way to perform max() in hbase/pig? This list >> probably isn't the right place to ask if you are using those technologies; >> I'd suspect they do something more clever (certainly, you're performing a >> SQL-like operation in MapReduce; not always the best way to approach this >> type of problem). >> >> Brian >> >> >> On Apr 20, 2009, at 8:25 PM, jason hadoop wrote: >> >> The Hadoop Framework requires that a Map Phase be run before the Reduce >>> Phase. >>> By doing the initial 'reduce' in the map, a much smaller volume of data >>> has >>> to flow across the network to the reduce tasks. >>> But yes, this could simply be done by using an IdentityMapper and then >>> have >>> all of the work done in the reduce. >>> >>> >>> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <had...@anarres.org> wrote: >>> >>> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote: >>>> >>>>> The traditional approach would be a Mapper class that maintained a >>>>> member >>>>> variable that you kept the max value record, and in the close method of >>>>> >>>> your >>>> >>>>> mapper you output a single record containing that value. >>>>> >>>> >>>> Perhaps you can forgive the question from a heathen, but why is this >>>> first mapper not also a reducer? It seems to me that it is performing a >>>> reduce operation, and that maps should (philosophically speaking) not >>>> maintain data from one input to the next, since the order (and location) >>>> of inputs is not well defined. The program to compute a maximum should >>>> then be a tree of reduction operations, with no maps at all. >>>> >>>> Of course in this instance, what you propose works, but it does seem >>>> puzzling. Perhaps the answer is simple architectural limitation? >>>> >>>> S. >>>> >>>> The map method of course compares the current record against the max >>>>> and >>>>> stores current in max when current is larger than max. >>>>> >>>>> Then each map output is a single record and the reduce behaves very >>>>> similarly, in that the close method outputs the final max record. A >>>>> >>>> single >>>> >>>>> reduce would be the simplest. >>>>> >>>>> On your question a Mapper and Reducer defines 3 entry points, >>>>> configure, >>>>> called once on on task start, the map/reduce called once for each >>>>> record, >>>>> and close, called once after the last call to map/reduce. >>>>> at least through 0.19, the close is not provided with the output >>>>> >>>> collector >>>> >>>>> or the reporter, so you need to save them in the map/reduce method. >>>>> >>>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <russ...@gmail.com> >>>>> >>>> wrote: >>>> >>>>> >>>>> How do you identify that map task is ending within the map method? Is >>>>>> >>>>> it >>>> >>>>> possible to know which is the last call to map method? >>>>>> >>>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo < >>>>>> >>>>> edlinuxg...@gmail.com >>>> >>>>> wrote: >>>>>>> >>>>>> >>>>>> I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase >>>>>>> support the ability to max(). I am writing my own max() over a simple >>>>>>> one column dataset. >>>>>>> >>>>>>> The best solution I came up with was using MapRunner. With maprunner >>>>>>> >>>>>> I >>>> >>>>> can store the highest value in a private member variable. I can read >>>>>>> through the entire data set and only have to emit one value per >>>>>>> >>>>>> mapper >>>> >>>>> upon completion of the map data. Then I can specify one reducer and >>>>>>> carry out the same operation. >>>>>>> >>>>>>> Does anyone have a better tactic. I thought a counter could do this >>>>>>> but are they atomic? >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> -- >>> Alpha Chapters of my book on Hadoop are available >>> http://www.apress.com/book/view/9781430219422 >>> >> >> > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422