Re: max value for a dataset

Brian Bockelman Mon, 20 Apr 2009 16:25:33 -0700

Hey Jason,

Wouldn't this be avoided if you used a combiner to also perform themax() operation? A minimal amount of data would be written over thenetwork.

I can't remember if the map output gets written to disk first, thencombine applied or if the combine is applied and then the data iswritten to disk. I suspect the latter, but it'd be a big difference.

However, the original poster mentioned he was using hbase/pig --certainly, there's some better way to perform max() in hbase/pig?This list probably isn't the right place to ask if you are using thosetechnologies; I'd suspect they do something more clever (certainly,you're performing a SQL-like operation in MapReduce; not always thebest way to approach this type of problem).


Brian

On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:

The Hadoop Framework requires that a Map Phase be run before theReduce
Phase.
By doing the initial 'reduce' in the map, a much smaller volume ofdata has
to flow across the network to the reduce tasks.
But yes, this could simply be done by using an IdentityMapper andthen have
all of the work done in the reduce.


On Mon, Apr 20, 2009 at 4:26 AM, Shevek <had...@anarres.org> wrote:
On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
The traditional approach would be a Mapper class that maintained amembervariable that you kept the max value record, and in the closemethod of
your
mapper you output a single record containing that value.
Perhaps you can forgive the question from a heathen, but why is this
first mapper not also a reducer? It seems to me that it isperforming a
reduce operation, and that maps should (philosophically speaking) not
maintain data from one input to the next, since the order (andlocation)of inputs is not well defined. The program to compute a maximumshould
then be a tree of reduction operations, with no maps at all.

Of course in this instance, what you propose works, but it does seem
puzzling. Perhaps the answer is simple architectural limitation?

S.
The map method of course compares the current record against themax and
stores current in max when current is larger than max.

Then each map output is a single record and the reduce behaves very
similarly, in that the close method outputs the final max record. A
single
reduce would be the simplest.
On your question a Mapper and Reducer defines 3 entry points,configure,called once on on task start, the map/reduce called once for eachrecord,
and close, called once after the last call to map/reduce.
at least through 0.19, the close is not provided with the output
collector
or the reporter, so you need to save them in the map/reduce method.

On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <russ...@gmail.com>
wrote:
How do you identify that map task is ending within the mapmethod? Is
it
possible to know which is the last call to map method?

On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
edlinuxg...@gmail.com
wrote:
I jumped into Hadoop at the 'deep end'. I know pig, hive, andhbasesupport the ability to max(). I am writing my own max() over asimple
one column dataset.
The best solution I came up with was using MapRunner. Withmaprunner
I
can store the highest value in a private member variable. I canread
through the entire data set and only have to emit one value per
mapper
upon completion of the map data. Then I can specify one reducerand
carry out the same operation.
Does anyone have a better tactic. I thought a counter could dothis
but are they atomic?
--
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: max value for a dataset

Reply via email to