Assuming N is not too large in the sense that your reducers can keep a tree
map of N elements, then you can have your reducer maintain the top N
elements in a tree-map (or a priority queue, or a heap, whatever), with
counts as keys in the tree-map. As the reducers progress, you throw away
the items with smaller counts, and always keep the top N seen so far in the
tree-map. At the end of the reducing process, each reducer outputs the
elements in its tree-map, which should be the top N in that reducer. If your
job has only one reducer, then the output is your final answer. Otherwise,
you can use a second job (or any tools) to merge/sort the R output files
(each has the top N elements for the corresponding reducer). If N * R
input size, the second step should take much less time than the first one.
On Fri, Sep 10, 2010 at 5:37 PM, Neil Ghosh neil.gh...@gmail.com wrote:
Hi Alex ,
Thanks so much for the reply . As of now I don't have any issue with 2
Jobs.I was just making sure that I am not missing any obvious way of
writing
the program in one job.I will get back if I need to optimize on performance
based on specific pattern of input.
Thank you so much you all for helping me on this issue.
Neil
On Sat, Sep 11, 2010 at 5:56 AM, Alex Kozlov ale...@cloudera.com wrote:
Hi Neil,
Uniques and Top N, as well as percentiles, are inherently difficult to
distribute/parallelize since you have to have a global view of the
dataset.
You can optimize the computations given some assumptions about the input
(the # of unique values, prevalence of the most frequent value larger
than
certain threshold, etc.). There is no way to avoid two jobs in a general
case.
Can you specify your problem more precisely and assumptions, if any?
--
Alex Kozlov
Solutions Architect
Cloudera, Inc
twitter: alexvk2009
Hadoop World 2010, October 12, New York City - Register now:
http://www.cloudera.com/company/press-center/hadoop-world-nyc/
On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh neil.gh...@gmail.com
wrote:
Thanks Aaron. I employed two Jobs and solved the problem.
I was just wondering is there anyway , it can be done in single job so
that
disk/network I/O is less and no temporary storage is required between
1st
and second job.
Neil
On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff aaron.b...@telescope.tv
wrote:
I'm still fairly new at MapReduce, but here's my thoughts the
solution.
Use the Item as the Key, the Count as the Value, in the Reducer, sum
up
all
of the Count's and output the Item,sum(Count). To make it more
efficient,
use the same Reducer as the Combiner.
Then do a 2nd Job where you map the Count as the Key, and Item as the
Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any
reducing,
just output the Count,Item).
Aaron Baff | Developer | Telescope, Inc.
email: aaron.b...@telescope.tv | office: 424 270 2913 |
www.telescope.tv
Bored with summer reruns? Spice up your TV week by watching and
voting
for
your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on
NBC.
The information contained in this email is confidential and may be
legally
privileged. It is intended solely for the addressee. Access to this
email by
anyone else is unauthorized. If you are not the intended recipient,
any
disclosure, copying, distribution or any action taken or omitted to be
taken
in reliance on it, is prohibited and may be unlawful. Any views
expressed in
this message are those of the individual and may not necessarily
reflect
the
views of Telescope Inc. or its associated companies.
-Original Message-
From: Neil Ghosh [mailto:neil.gh...@gmail.com]
Sent: Friday, September 10, 2010 3:51 PM
To: James Seigel
Cc: common-user@hadoop.apache.org
Subject: Re: TOP N items
Thanks James,
This gives me only N results for sure but not necessarily the top N
I have used the Item as Key and Count as Value as input to the
reducer.
and my reducing logic is to sum the count for a particular item.
Now my output comes as grouped but not in order.
Do I need to use custom comparator ?
Thanks
Neil
On Sat, Sep 11, 2010 at 2:41 AM, James Seigel ja...@tynt.com wrote:
Welcome to the land of the fuzzy elephant!
Of course there are many ways to do it. Here is one, it might not
be
brilliant or the right was, but I am sure you will get more :)
Use the identity mapper...
job.setMapperClass(Mapper.class);
then have one reducer
job.setNumReduceTasks(1);
then have a reducer that has something like this around your
reducing
code...
Counter counter = context.getCounter(ME, total output
records
);
if (counter.getValue() LIMIT) {
do your reducey stuff here