Hi there,

i am currently running a job where i selfjoin a 63 gigabyte big csv file on 20 physically distinct nodes with 15GB each:

While the mapping works just fine and is low cost, the reducer does the main work: holding a hashmap with elements to join with and finding join tuples for evry incoming key-value-pair.

The jobs works perfectly on small files with 2 gigabytes, but starts to get "unstable" as the file size goes up: this becomes evident with a look into the tasktracker's logs saying:

ERROR org.mortbay.log: /mapOutput
java.lang.IllegalStateException: Committed
    at org.mortbay.jetty.Response.resetBuffer(Response.java:1023)
    at org.mortbay.jetty.Response.sendError(Response.java:240)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3945)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


And while it is no problem at the beginning of the reduce process, where this happens only on a few nodes and rarely, it becomes crucial as the progress rises. The reason for this (afaik from reading articles), is that there are memory or file handle problems. I addressed the memory problem by conitiously purging the map of outdated elements evry 5 million processed key-value-pairs. And i set mapred.child.ulimit to 100000000 (ulimit in the shell tells me it is 400000000).

Anyway i am still running into those mortbay errors and i start to wonder, if hadoop can manage the job with this algorithmn anyways. By pure naive math it should be: i explicily assigned 10GB memory to each JVM on each node and set mapred.child.java.opts to "-Xmx10240m -XX:+UseCompressedOops -XX:-UseGCOverheadLimit" (its a 64 bit environment and large datastructures cause the GC to throw exceptions). This would naively make 18 slave machines with 10GB each resulting in an overall memory of 180GB - three times as much as needed... i would think. So if the Partitioner distributes them just about equally to all nodes i should not run into any errors, do i?

Can anybody help me with this issue?

Best regards,
Elmar

Reply via email to