There will be a short summary of the hadoop aggregation tools in ch08, it got missed in the first pass through, and is being added back in this week. There are a number of howto's in the book particularly in ch08 and ch09.
I hope you enjoy them. On Tue, Apr 21, 2009 at 8:24 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman <bbock...@cse.unl.edu> > wrote: > > Hey Jason, > > > > Wouldn't this be avoided if you used a combiner to also perform the max() > > operation? A minimal amount of data would be written over the network. > > > > I can't remember if the map output gets written to disk first, then > combine > > applied or if the combine is applied and then the data is written to > disk. > > I suspect the latter, but it'd be a big difference. > > > > However, the original poster mentioned he was using hbase/pig -- > certainly, > > there's some better way to perform max() in hbase/pig? This list > probably > > isn't the right place to ask if you are using those technologies; I'd > > suspect they do something more clever (certainly, you're performing a > > SQL-like operation in MapReduce; not always the best way to approach this > > type of problem). > > > > Brian > > > > On Apr 20, 2009, at 8:25 PM, jason hadoop wrote: > > > >> The Hadoop Framework requires that a Map Phase be run before the Reduce > >> Phase. > >> By doing the initial 'reduce' in the map, a much smaller volume of data > >> has > >> to flow across the network to the reduce tasks. > >> But yes, this could simply be done by using an IdentityMapper and then > >> have > >> all of the work done in the reduce. > >> > >> > >> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <had...@anarres.org> wrote: > >> > >>> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote: > >>>> > >>>> The traditional approach would be a Mapper class that maintained a > >>>> member > >>>> variable that you kept the max value record, and in the close method > of > >>> > >>> your > >>>> > >>>> mapper you output a single record containing that value. > >>> > >>> Perhaps you can forgive the question from a heathen, but why is this > >>> first mapper not also a reducer? It seems to me that it is performing a > >>> reduce operation, and that maps should (philosophically speaking) not > >>> maintain data from one input to the next, since the order (and > location) > >>> of inputs is not well defined. The program to compute a maximum should > >>> then be a tree of reduction operations, with no maps at all. > >>> > >>> Of course in this instance, what you propose works, but it does seem > >>> puzzling. Perhaps the answer is simple architectural limitation? > >>> > >>> S. > >>> > >>>> The map method of course compares the current record against the max > and > >>>> stores current in max when current is larger than max. > >>>> > >>>> Then each map output is a single record and the reduce behaves very > >>>> similarly, in that the close method outputs the final max record. A > >>> > >>> single > >>>> > >>>> reduce would be the simplest. > >>>> > >>>> On your question a Mapper and Reducer defines 3 entry points, > configure, > >>>> called once on on task start, the map/reduce called once for each > >>>> record, > >>>> and close, called once after the last call to map/reduce. > >>>> at least through 0.19, the close is not provided with the output > >>> > >>> collector > >>>> > >>>> or the reporter, so you need to save them in the map/reduce method. > >>>> > >>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <russ...@gmail.com> > >>> > >>> wrote: > >>>> > >>>>> How do you identify that map task is ending within the map method? Is > >>> > >>> it > >>>>> > >>>>> possible to know which is the last call to map method? > >>>>> > >>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo < > >>> > >>> edlinuxg...@gmail.com > >>>>>> > >>>>>> wrote: > >>>>> > >>>>>> I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase > >>>>>> support the ability to max(). I am writing my own max() over a > simple > >>>>>> one column dataset. > >>>>>> > >>>>>> The best solution I came up with was using MapRunner. With maprunner > >>> > >>> I > >>>>>> > >>>>>> can store the highest value in a private member variable. I can read > >>>>>> through the entire data set and only have to emit one value per > >>> > >>> mapper > >>>>>> > >>>>>> upon completion of the map data. Then I can specify one reducer and > >>>>>> carry out the same operation. > >>>>>> > >>>>>> Does anyone have a better tactic. I thought a counter could do this > >>>>>> but are they atomic? > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>> > >>> > >> > >> > >> -- > >> Alpha Chapters of my book on Hadoop are available > >> http://www.apress.com/book/view/9781430219422 > > > > > > I took a loot at the description of the book > http://www.apress.com/book/view/9781430219422. Hopefully it and other > endeavors like it can fill a need I have an see quite often. I am > quite interested in practical hadoop algorithms. Most of my searching > finds repeated WordCount examples, depictions of the shuffle-sort. > > The most practical lessons I took from my programming with Fortran was > how to sum() min() max() and average() a data set. If the hadoop had a > cookbook of sorts for algorithm design I think many people would > benefit. > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422