Hi there,
i am currently running a job where i selfjoin a 63 gigabyte big csv file
on 20 physically distinct nodes with 15GB each:
While the mapping works just fine and is low cost, the reducer does the
main work: holding a hashmap with elements to join with and finding join
tuples for evry incoming key-value-pair.
The jobs works perfectly on small files with 2 gigabytes, but starts to
get "unstable" as the file size goes up: this becomes evident with a
look into the tasktracker's logs saying:
ERROR org.mortbay.log: /mapOutput
java.lang.IllegalStateException: Committed
at org.mortbay.jetty.Response.resetBuffer(Response.java:1023)
at org.mortbay.jetty.Response.sendError(Response.java:240)
at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3945)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
And while it is no problem at the beginning of the reduce process, where
this happens only on a few nodes and rarely, it becomes crucial as the
progress rises. The reason for this (afaik from reading articles), is
that there are memory or file handle problems. I addressed the memory
problem by conitiously purging the map of outdated elements evry 5
million processed key-value-pairs. And i set mapred.child.ulimit to
100000000 (ulimit in the shell tells me it is 400000000).
Anyway i am still running into those mortbay errors and i start to
wonder, if hadoop can manage the job with this algorithmn anyways. By
pure naive math it should be:
i explicily assigned 10GB memory to each JVM on each node and set
mapred.child.java.opts to "-Xmx10240m -XX:+UseCompressedOops
-XX:-UseGCOverheadLimit" (its a 64 bit environment and large
datastructures cause the GC to throw exceptions). This would naively
make 18 slave machines with 10GB each resulting in an overall memory of
180GB - three times as much as needed... i would think. So if the
Partitioner distributes them just about equally to all nodes i should
not run into any errors, do i?
Can anybody help me with this issue?
Best regards,
Elmar