Re: TOP N items

2010-09-10 Thread Runping Qi
Assuming N is not too large in the sense that your reducers can keep a tree
map of N elements, then you can have your reducer maintain the top N
elements in a tree-map (or a priority queue, or a heap, whatever), with
counts as  keys in the tree-map. As the reducers progress, you throw away
the items with smaller counts, and always keep the top N seen so far in the
tree-map. At the end of the reducing process, each reducer outputs the
elements in its tree-map, which should be the top N in that reducer. If your
job has only one reducer, then the output is your final answer. Otherwise,
you can use a second job (or any tools) to merge/sort the R output files
(each has the top N elements for the corresponding reducer). If N * R 
input size, the second step should take much less time than the first one.



On Fri, Sep 10, 2010 at 5:37 PM, Neil Ghosh neil.gh...@gmail.com wrote:

 Hi Alex ,

 Thanks so much for the reply . As of now I don't have any issue with 2
 Jobs.I was just making sure that I am not missing any obvious way of
 writing
 the program in one job.I will get back if I need to optimize on performance
 based on specific pattern of input.

 Thank you so much you all for helping me on this issue.
 Neil

 On Sat, Sep 11, 2010 at 5:56 AM, Alex Kozlov ale...@cloudera.com wrote:

  Hi Neil,
 
  Uniques and Top N, as well as percentiles, are inherently difficult to
  distribute/parallelize since you have to have a global view of the
 dataset.
  You can optimize the computations given some assumptions about the input
  (the # of unique values, prevalence of the most frequent value larger
 than
  certain threshold, etc.).  There is no way to avoid two jobs in a general
  case.
 
  Can you specify your problem more precisely and assumptions, if any?
 
  --
  Alex Kozlov
  Solutions Architect
  Cloudera, Inc
  twitter: alexvk2009
 
  Hadoop World 2010, October 12, New York City - Register now:
  http://www.cloudera.com/company/press-center/hadoop-world-nyc/
 
  On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh neil.gh...@gmail.com
 wrote:
 
  Thanks Aaron. I employed two Jobs and solved the problem.
 
  I was just wondering is there anyway , it can be done in single  job so
  that
  disk/network I/O is less and no temporary storage is required between
 1st
  and second job.
 
  Neil
 
  On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff aaron.b...@telescope.tv
  wrote:
 
   I'm still fairly new at MapReduce, but here's my thoughts the
 solution.
  
   Use the Item as the Key, the Count as the Value, in the Reducer, sum
 up
  all
   of the Count's and output the Item,sum(Count). To make it more
  efficient,
   use the same Reducer as the Combiner.
  
   Then do a 2nd Job where you map the Count as the Key, and Item as the
   Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any
  reducing,
   just output the Count,Item).
  
   Aaron Baff | Developer | Telescope, Inc.
  
   email:  aaron.b...@telescope.tv | office:  424 270 2913 |
  www.telescope.tv
  
   Bored with summer reruns?  Spice up your TV week by watching and
 voting
  for
   your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on
  NBC.
  
   The information contained in this email is confidential and may be
  legally
   privileged. It is intended solely for the addressee. Access to this
  email by
   anyone else is unauthorized. If you are not the intended recipient,
 any
   disclosure, copying, distribution or any action taken or omitted to be
  taken
   in reliance on it, is prohibited and may be unlawful. Any views
  expressed in
   this message are those of the individual and may not necessarily
 reflect
  the
   views of Telescope Inc. or its associated companies.
  
  
   -Original Message-
   From: Neil Ghosh [mailto:neil.gh...@gmail.com]
   Sent: Friday, September 10, 2010 3:51 PM
   To: James Seigel
   Cc: common-user@hadoop.apache.org
   Subject: Re: TOP N items
  
   Thanks James,
  
   This gives me only N results for sure but not necessarily the top N
  
   I have used the Item as Key and Count as Value as input to the
 reducer.
  
   and my reducing logic is to sum the count for a particular item.
  
   Now my output comes as grouped but not in order.
  
   Do I need to use custom comparator ?
  
   Thanks
   Neil
  
   On Sat, Sep 11, 2010 at 2:41 AM, James Seigel ja...@tynt.com wrote:
  
Welcome to the land of the fuzzy elephant!
   
Of course there are many ways to do it.  Here is one, it might not
 be
brilliant or the right was, but I am sure you will get more :)
   
Use the identity mapper...
   
job.setMapperClass(Mapper.class);
   
then have one reducer
   
job.setNumReduceTasks(1);
   
then have a reducer that has something like this around your
 reducing
code...
   
Counter counter = context.getCounter(ME, total output
  records
);
if (counter.getValue()  LIMIT) {
   
 do your reducey stuff here
   

Namenode warnings

2010-05-11 Thread Runping Qi
Hi,

I saw a lot of  warnings like the following in namenode log:

2010-05-11 06:45:07,186 WARN /: /listPaths/s:
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.namenode.ListPathsServlet.doGet(ListPathsServlet.java:153)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:596)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at 
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at 
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)

I am using Hadoop 0.19.

Anybody knows what might be the problem?

Thanks,

Runping

at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)


Re: Namenode warnings

2010-05-11 Thread Runping Qi
So it's a known problem of Hadoop 0.19?


On Tue, May 11, 2010 at 11:06 AM, Allen Wittenauer awittena...@linkedin.com
 wrote:


 On May 11, 2010, at 9:53 AM, Runping Qi wrote:

  I am using Hadoop 0.19.
 
  Anybody knows what might be the problem?

 I think you answered your own question. :)




job stops progressing because tasktracker stop taking tasks

2009-10-25 Thread Runping Qi
I was testing a job on a single node hadoop cluster running Hadoo9 0.19.
The single tasktracker has 2 reduce slots.
After finishing 8 reduce tasks out of 17 total reduce tasks, the tasktracker
stopped taking any new tasks.
The job made no more progress then.

Anybody encountered somilar situations?

Thanks,

Runping