Re: max value for a dataset

jason hadoop Sat, 18 Apr 2009 09:58:04 -0700

The traditional approach would be a Mapper class that maintained a member
variable that you kept the max value record, and in the close method of your
mapper you output a single record containing that value.

The map method of course compares the current record against the max and
stores current in max when current is larger than max.

Then each map output is a single record and the reduce behaves very
similarly, in that the close method outputs the final max record. A single
reduce would be the simplest.

On your question a Mapper and Reducer defines 3 entry points, configure,
called once on on task start, the map/reduce called once for each record,
and close, called once after the last call to map/reduce.
at least through 0.19, the close is not provided with the output collector
or the reporter, so you need to save them in the map/reduce method.

On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <russ...@gmail.com> wrote:

> How do you identify that map task is ending within the map method? Is it
> possible to know which is the last call to map method?
>
> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <edlinuxg...@gmail.com
> >wrote:
>
> > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> > support the ability to max(). I am writing my own max() over a simple
> > one column dataset.
> >
> > The best solution I came up with was using MapRunner. With maprunner I
> > can store the highest value in a private member variable. I can read
> > through the entire data set and only have to emit one value per mapper
> > upon completion of the map data. Then I can specify one reducer and
> > carry out the same operation.
> >
> > Does anyone have a better tactic. I thought a counter could do this
> > but are they atomic?
> >
>

-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: max value for a dataset

Reply via email to