On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman <bbock...@cse.unl.edu> wrote: > Hey Jason, > > Wouldn't this be avoided if you used a combiner to also perform the max() > operation? A minimal amount of data would be written over the network. > > I can't remember if the map output gets written to disk first, then combine > applied or if the combine is applied and then the data is written to disk. > I suspect the latter, but it'd be a big difference. > > However, the original poster mentioned he was using hbase/pig -- certainly, > there's some better way to perform max() in hbase/pig? This list probably > isn't the right place to ask if you are using those technologies; I'd > suspect they do something more clever (certainly, you're performing a > SQL-like operation in MapReduce; not always the best way to approach this > type of problem). > > Brian > > On Apr 20, 2009, at 8:25 PM, jason hadoop wrote: > >> The Hadoop Framework requires that a Map Phase be run before the Reduce >> Phase. >> By doing the initial 'reduce' in the map, a much smaller volume of data >> has >> to flow across the network to the reduce tasks. >> But yes, this could simply be done by using an IdentityMapper and then >> have >> all of the work done in the reduce. >> >> >> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <had...@anarres.org> wrote: >> >>> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote: >>>> >>>> The traditional approach would be a Mapper class that maintained a >>>> member >>>> variable that you kept the max value record, and in the close method of >>> >>> your >>>> >>>> mapper you output a single record containing that value. >>> >>> Perhaps you can forgive the question from a heathen, but why is this >>> first mapper not also a reducer? It seems to me that it is performing a >>> reduce operation, and that maps should (philosophically speaking) not >>> maintain data from one input to the next, since the order (and location) >>> of inputs is not well defined. The program to compute a maximum should >>> then be a tree of reduction operations, with no maps at all. >>> >>> Of course in this instance, what you propose works, but it does seem >>> puzzling. Perhaps the answer is simple architectural limitation? >>> >>> S. >>> >>>> The map method of course compares the current record against the max and >>>> stores current in max when current is larger than max. >>>> >>>> Then each map output is a single record and the reduce behaves very >>>> similarly, in that the close method outputs the final max record. A >>> >>> single >>>> >>>> reduce would be the simplest. >>>> >>>> On your question a Mapper and Reducer defines 3 entry points, configure, >>>> called once on on task start, the map/reduce called once for each >>>> record, >>>> and close, called once after the last call to map/reduce. >>>> at least through 0.19, the close is not provided with the output >>> >>> collector >>>> >>>> or the reporter, so you need to save them in the map/reduce method. >>>> >>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <russ...@gmail.com> >>> >>> wrote: >>>> >>>>> How do you identify that map task is ending within the map method? Is >>> >>> it >>>>> >>>>> possible to know which is the last call to map method? >>>>> >>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo < >>> >>> edlinuxg...@gmail.com >>>>>> >>>>>> wrote: >>>>> >>>>>> I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase >>>>>> support the ability to max(). I am writing my own max() over a simple >>>>>> one column dataset. >>>>>> >>>>>> The best solution I came up with was using MapRunner. With maprunner >>> >>> I >>>>>> >>>>>> can store the highest value in a private member variable. I can read >>>>>> through the entire data set and only have to emit one value per >>> >>> mapper >>>>>> >>>>>> upon completion of the map data. Then I can specify one reducer and >>>>>> carry out the same operation. >>>>>> >>>>>> Does anyone have a better tactic. I thought a counter could do this >>>>>> but are they atomic? >>>>>> >>>>> >>>> >>>> >>>> >>> >>> >> >> >> -- >> Alpha Chapters of my book on Hadoop are available >> http://www.apress.com/book/view/9781430219422 > >
I took a loot at the description of the book http://www.apress.com/book/view/9781430219422. Hopefully it and other endeavors like it can fill a need I have an see quite often. I am quite interested in practical hadoop algorithms. Most of my searching finds repeated WordCount examples, depictions of the shuffle-sort. The most practical lessons I took from my programming with Fortran was how to sum() min() max() and average() a data set. If the hadoop had a cookbook of sorts for algorithm design I think many people would benefit.