Re: max value for a dataset

jason hadoop Wed, 22 Apr 2009 01:01:47 -0700

I worked out how to use the aggregation service with streaming to do this,
entertainingly simple once you have figured this out.


Full details will be in ch08 of my book - buy a copy so I can afford to
write another :)

/tmp/numbers contains a file full of white space separated whole numbers.
/tmp/LongMax.pl is the attached perl script.
The output will be a single file part-00000 in /tmp/numbers_max_output.
Note: this job is run using the local runner (-jt local) so only 1 reduce is
allowed.



hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -jt local -fs
file:/// -input /tmp/numbers  -output /tmp/numbers_max_output  -reducer
aggregate -mapper LongMax.pl -file /tmp/LongMax.pl



On Tue, Apr 21, 2009 at 7:42 PM, jason hadoop <jason.had...@gmail.com>wrote:

> There is no reason to use a combiner in this case, as there is only a
> single output record from the map.
>
> Combiners buy you data reduction when you have output values in your map
> that share keys, and your application allows you to do something with the
> values that results in smaller/fewer records being passed to the reduce.
>
>
> On Mon, Apr 20, 2009 at 4:24 PM, Brian Bockelman <bbock...@cse.unl.edu>wrote:
>
>> Hey Jason,
>>
>> Wouldn't this be avoided if you used a combiner to also perform the max()
>> operation?  A minimal amount of data would be written over the network.
>>
>> I can't remember if the map output gets written to disk first, then
>> combine applied or if the combine is applied and then the data is written to
>> disk.  I suspect the latter, but it'd be a big difference.
>>
>> However, the original poster mentioned he was using hbase/pig --
>> certainly, there's some better way to perform max() in hbase/pig?  This list
>> probably isn't the right place to ask if you are using those technologies;
>> I'd suspect they do something more clever (certainly, you're performing a
>> SQL-like operation in MapReduce; not always the best way to approach this
>> type of problem).
>>
>> Brian
>>
>>
>> On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:
>>
>>  The Hadoop Framework requires that a Map Phase be run before the Reduce
>>> Phase.
>>> By doing the initial 'reduce' in the map, a much smaller volume of data
>>> has
>>> to flow across the network to the reduce tasks.
>>> But yes, this could simply be done by using an IdentityMapper and then
>>> have
>>> all of the work done in the reduce.
>>>
>>>
>>> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <had...@anarres.org> wrote:
>>>
>>>  On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
>>>>
>>>>> The traditional approach would be a Mapper class that maintained a
>>>>> member
>>>>> variable that you kept the max value record, and in the close method of
>>>>>
>>>> your
>>>>
>>>>> mapper you output a single record containing that value.
>>>>>
>>>>
>>>> Perhaps you can forgive the question from a heathen, but why is this
>>>> first mapper not also a reducer? It seems to me that it is performing a
>>>> reduce operation, and that maps should (philosophically speaking) not
>>>> maintain data from one input to the next, since the order (and location)
>>>> of inputs is not well defined. The program to compute a maximum should
>>>> then be a tree of reduction operations, with no maps at all.
>>>>
>>>> Of course in this instance, what you propose works, but it does seem
>>>> puzzling. Perhaps the answer is simple architectural limitation?
>>>>
>>>> S.
>>>>
>>>>  The map method of course compares the current record against the max
>>>>> and
>>>>> stores current in max when current is larger than max.
>>>>>
>>>>> Then each map output is a single record and the reduce behaves very
>>>>> similarly, in that the close method outputs the final max record. A
>>>>>
>>>> single
>>>>
>>>>> reduce would be the simplest.
>>>>>
>>>>> On your question a Mapper and Reducer defines 3 entry points,
>>>>> configure,
>>>>> called once on on task start, the map/reduce called once for each
>>>>> record,
>>>>> and close, called once after the last call to map/reduce.
>>>>> at least through 0.19, the close is not provided with the output
>>>>>
>>>> collector
>>>>
>>>>> or the reporter, so you need to save them in the map/reduce method.
>>>>>
>>>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <russ...@gmail.com>
>>>>>
>>>> wrote:
>>>>
>>>>>
>>>>>  How do you identify that map task is ending within the map method? Is
>>>>>>
>>>>> it
>>>>
>>>>> possible to know which is the last call to map method?
>>>>>>
>>>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
>>>>>>
>>>>> edlinuxg...@gmail.com
>>>>
>>>>>  wrote:
>>>>>>>
>>>>>>
>>>>>>  I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
>>>>>>> support the ability to max(). I am writing my own max() over a simple
>>>>>>> one column dataset.
>>>>>>>
>>>>>>> The best solution I came up with was using MapRunner. With maprunner
>>>>>>>
>>>>>> I
>>>>
>>>>>  can store the highest value in a private member variable. I can read
>>>>>>> through the entire data set and only have to emit one value per
>>>>>>>
>>>>>> mapper
>>>>
>>>>>  upon completion of the map data. Then I can specify one reducer and
>>>>>>> carry out the same operation.
>>>>>>>
>>>>>>> Does anyone have a better tactic. I thought a counter could do this
>>>>>>> but are they atomic?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Alpha Chapters of my book on Hadoop are available
>>> http://www.apress.com/book/view/9781430219422
>>>
>>
>>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: max value for a dataset

Reply via email to