Re: Shuffle speed?

2009-03-02 Thread hc busy
There are a few things that caused this to happen to me earlier on.

Make sure to check that it actually makes progress. Sometimes, slowness is
result of negative progress: it gets to say 10% complete on reduce, and then
drop back down to 5%...In that case the output can output that line with the
slow throughput rate.

changing a few of the settings below did improve on things, but ultimately,
what fixed it for us was buying more hardware.

;-)

On Sun, Mar 1, 2009 at 10:21 PM, Jothi Padmanabhan joth...@yahoo-inc.comwrote:

 There are a lot of factors that affect shuffle speed.

 Some of them are:

 1. The Number of reducers concurrently running in a node
 2. The number of parallel copier threads that are pulling in map data (
 mapred.reduce.parallel.copies)
 3. Size of the individual map outputs. If Map outputs are huge, they are
 shuffled to disk and there might be some contention if several files are
 written to disk at the same time
 4. Size of the buffer reserved to accommodate map outputs on the reducer
 side ( mapred.job.shuffle.input.buffer.percent).

 Jothi



 On 2/28/09 6:55 AM, Nathan Marz nat...@rapleaf.com wrote:

  The Hadoop shuffle phase seems painstakingly slow. For example, I am
  running a very large job, and all the reducers report a status such as:
 
  reduce  copy (14266 of 28243 at 1.30 MB/s)
 
  This is after all the mappers are finished. Is it supposed to be so
  slow?
 




Re: hadoop mapper 100% but cannot complete?

2008-12-12 Thread hc busy
ahhh, apologies for badmouthing hadoop...

So, I finally disocvered one problem that may have caused this kind of
degradation of performance. After growing the data set even larger to 35gb,
Hadoop crashed with disk full error. It would appear that the system will
actually continue to work when the disk is almost full, but there appear to
be some thing that causes it to slow down. Does hdfs juggle blocks around
when there isn't enough space on a slave machine? That would explain why it
was slowing down so much when the fs is almost full... Another part of this
is that I've updated by expectation of the speedup, accounting for the sort
that is happening, it is indeed faster.

I've upgraded to 0.18.2, and I now see the exception that is slowing down
the reducer near the end of the run(pasted below),any suggestions on this
one?

  at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

2008-12-12 12:06:28,403 WARN /:
/mapOutput?job=job_200812121139_0001map=attempt_200812121139_0001_m_000114_0reduce=2:
java.lang.IllegalStateException: Committed
  at
org.mortbay.jetty.servlet.ServletHttpResponse.resetBuffer(ServletHttpResponse.java:212)
  at
org.mortbay.jetty.servlet.ServletHttpResponse.sendError(ServletHttpResponse.java:375)
  at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2504)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
  at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
  at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
  at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
  at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
  at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
  at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
  at org.mortbay.http.HttpServer.service(HttpServer.java:954)
  at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
  at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
  at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
  at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
  at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
  at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
2008-12-12 12:06:31,107 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200812121139_0001_r_07_0 0.29801327% reduce  copy (135 of 151
at 1.42 MB/s) 


On Thu, Dec 11, 2008 at 12:13 PM, hc busy hc.b...@gmail.com wrote:

 And aside from refusing to declare task complete after everything is 100%,
 I also notied that the mapper seems too slow. It's taking the same amount of
 time for 4 machines to read and write through the 30gb file as if I did it
 with a /bin/cat on one machine. Do you guys have any suggestions with
 regards to these two problems?
 On Wed, Dec 10, 2008 at 4:37 PM, hc busy hc.b...@gmail.com wrote:

 Guys, I've just configured a hadoop cluster for the first time, and I'm
 running a null map-reduction over the streaming interface. (/bin/cat for
 both map and reducer). So I noticed that the mapper and reducer complete
 100% in the web ui within a reasonable amount of time, but the job does not
 complete. On command line it displays

 ...INFO streaming.StreamJob: map 100% reduce 100%

 In the web ui, it shows map completion graph is 100%, but does not display
 a reduce completion graph. The four machines are well equiped to handle the
 size of data (30gb). Looking at the task tracker on each of the machines, I
 noticed that it is ticking down the percents very very slowly:

 2008-12-10 16:18:55,265 INFO org.apache.hadoop.mapred.TaskTracker:
 task_200812101532_0001_r_02_0 46.684883% Records R/W=149326846/149326834
  reduce
 2008-12-10 16:18:57,055 INFO org.apache.hadoop.mapred.TaskTracker:
 task_200812101532_0001_r_06_0 47.566963% Records R/W=151739348/151739342
  reduce
 2008-12-10 16:18:58,268 INFO org.apache.hadoop.mapred.TaskTracker:
 task_200812101532_0001_r_02_0 46.826576% Records R/W=149326846/149326834
  reduce
 2008-12-10 16:19:00,058 INFO org.apache.hadoop.mapred.TaskTracker:
 task_200812101532_0001_r_06_0 47.741756% Records R/W=153377016/153376990
  reduce
 2008-12-10 16:19:01,271 INFO org.apache.hadoop.mapred.TaskTracker:
 task_200812101532_0001_r_02_0 46.9636% Records R/W=149326846/149326834 
 reduce
 2008-12-10 16:19:03,061 INFO org.apache.hadoop.mapred.TaskTracker:
 task_200812101532_0001_r_06_0 47.94259% Records R/W=153377016/153376990
  reduce
 2008-12-10 16:19:04,274 INFO org.apache.hadoop.mapred.TaskTracker:
 task_200812101532_0001_r_02_0 47.110992% Records R/W=150960648/150960644
  reduce

 so it would continue like this for hours and hours. What buffer am I
 setting too small, or what could possiblly make it go so slow?? I've worked
 on hadoop clusters

hadoop mapper 100% but cannot complete?

2008-12-10 Thread hc busy
Guys, I've just configured a hadoop cluster for the first time, and I'm
running a null map-reduction over the streaming interface. (/bin/cat for
both map and reducer). So I noticed that the mapper and reducer complete
100% in the web ui within a reasonable amount of time, but the job does not
complete. On command line it displays

...INFO streaming.StreamJob: map 100% reduce 100%

In the web ui, it shows map completion graph is 100%, but does not display a
reduce completion graph. The four machines are well equiped to handle the
size of data (30gb). Looking at the task tracker on each of the machines, I
noticed that it is ticking down the percents very very slowly:

2008-12-10 16:18:55,265 INFO org.apache.hadoop.mapred.TaskTracker:
task_200812101532_0001_r_02_0 46.684883% Records R/W=149326846/149326834
 reduce
2008-12-10 16:18:57,055 INFO org.apache.hadoop.mapred.TaskTracker:
task_200812101532_0001_r_06_0 47.566963% Records R/W=151739348/151739342
 reduce
2008-12-10 16:18:58,268 INFO org.apache.hadoop.mapred.TaskTracker:
task_200812101532_0001_r_02_0 46.826576% Records R/W=149326846/149326834
 reduce
2008-12-10 16:19:00,058 INFO org.apache.hadoop.mapred.TaskTracker:
task_200812101532_0001_r_06_0 47.741756% Records R/W=153377016/153376990
 reduce
2008-12-10 16:19:01,271 INFO org.apache.hadoop.mapred.TaskTracker:
task_200812101532_0001_r_02_0 46.9636% Records R/W=149326846/149326834 
reduce
2008-12-10 16:19:03,061 INFO org.apache.hadoop.mapred.TaskTracker:
task_200812101532_0001_r_06_0 47.94259% Records R/W=153377016/153376990
 reduce
2008-12-10 16:19:04,274 INFO org.apache.hadoop.mapred.TaskTracker:
task_200812101532_0001_r_02_0 47.110992% Records R/W=150960648/150960644
 reduce

so it would continue like this for hours and hours. What buffer am I setting
too small, or what could possiblly make it go so slow?? I've worked on
hadoop clusters before and it had always performed great on similar sized or
larger data sets, so I suspect it's just a configuration some where that is
making it do this?

thanks in advance.