Re: max value for a dataset

Edward Capriolo Tue, 21 Apr 2009 08:25:13 -0700

On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman <bbock...@cse.unl.edu> wrote:
> Hey Jason,
>
> Wouldn't this be avoided if you used a combiner to also perform the max()
> operation?  A minimal amount of data would be written over the network.
>
> I can't remember if the map output gets written to disk first, then combine
> applied or if the combine is applied and then the data is written to disk.
>  I suspect the latter, but it'd be a big difference.
>
> However, the original poster mentioned he was using hbase/pig -- certainly,
> there's some better way to perform max() in hbase/pig?  This list probably
> isn't the right place to ask if you are using those technologies; I'd
> suspect they do something more clever (certainly, you're performing a
> SQL-like operation in MapReduce; not always the best way to approach this
> type of problem).
>
> Brian
>
> On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:
>
>> The Hadoop Framework requires that a Map Phase be run before the Reduce
>> Phase.
>> By doing the initial 'reduce' in the map, a much smaller volume of data
>> has
>> to flow across the network to the reduce tasks.
>> But yes, this could simply be done by using an IdentityMapper and then
>> have
>> all of the work done in the reduce.
>>
>>
>> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <had...@anarres.org> wrote:
>>
>>> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
>>>>
>>>> The traditional approach would be a Mapper class that maintained a
>>>> member
>>>> variable that you kept the max value record, and in the close method of
>>>
>>> your
>>>>
>>>> mapper you output a single record containing that value.
>>>
>>> Perhaps you can forgive the question from a heathen, but why is this
>>> first mapper not also a reducer? It seems to me that it is performing a
>>> reduce operation, and that maps should (philosophically speaking) not
>>> maintain data from one input to the next, since the order (and location)
>>> of inputs is not well defined. The program to compute a maximum should
>>> then be a tree of reduction operations, with no maps at all.
>>>
>>> Of course in this instance, what you propose works, but it does seem
>>> puzzling. Perhaps the answer is simple architectural limitation?
>>>
>>> S.
>>>
>>>> The map method of course compares the current record against the max and
>>>> stores current in max when current is larger than max.
>>>>
>>>> Then each map output is a single record and the reduce behaves very
>>>> similarly, in that the close method outputs the final max record. A
>>>
>>> single
>>>>
>>>> reduce would be the simplest.
>>>>
>>>> On your question a Mapper and Reducer defines 3 entry points, configure,
>>>> called once on on task start, the map/reduce called once for each
>>>> record,
>>>> and close, called once after the last call to map/reduce.
>>>> at least through 0.19, the close is not provided with the output
>>>
>>> collector
>>>>
>>>> or the reporter, so you need to save them in the map/reduce method.
>>>>
>>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <russ...@gmail.com>
>>>
>>> wrote:
>>>>
>>>>> How do you identify that map task is ending within the map method? Is
>>>
>>> it
>>>>>
>>>>> possible to know which is the last call to map method?
>>>>>
>>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
>>>
>>> edlinuxg...@gmail.com
>>>>>>
>>>>>> wrote:
>>>>>
>>>>>> I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
>>>>>> support the ability to max(). I am writing my own max() over a simple
>>>>>> one column dataset.
>>>>>>
>>>>>> The best solution I came up with was using MapRunner. With maprunner
>>>
>>> I
>>>>>>
>>>>>> can store the highest value in a private member variable. I can read
>>>>>> through the entire data set and only have to emit one value per
>>>
>>> mapper
>>>>>>
>>>>>> upon completion of the map data. Then I can specify one reducer and
>>>>>> carry out the same operation.
>>>>>>
>>>>>> Does anyone have a better tactic. I thought a counter could do this
>>>>>> but are they atomic?
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Alpha Chapters of my book on Hadoop are available
>> http://www.apress.com/book/view/9781430219422
>
>


I took a loot at the description of the book
http://www.apress.com/book/view/9781430219422. Hopefully it and other
endeavors like it can fill a need I have an see quite often. I am
quite interested in practical hadoop algorithms. Most of my searching
finds repeated WordCount examples, depictions of the shuffle-sort.

The most practical lessons I took from my programming with Fortran was
how to sum() min() max() and average() a data set. If the hadoop had a
cookbook of sorts for algorithm design I think many people would
benefit.

Re: max value for a dataset

Reply via email to