Andreas Kostyrka wrote:
Ok, a new dead job: ;(

This time after 2.4GB/11,3M lines ;(

Any idea what I could do debug this?
(No idea how to go at debugging a Java process that is distributed and does GBs of data.

Its one of the big problems of distributed computing; distributed debugging

How does one stabilize that kind of stuff to generate a reproducable situation?)

-If you are using vmware/xen or similar with a private you can snapshot the entire cluster, but then you are left with many GB of machine state to deal with. But virtualisation has its own problems, as timing can get very screwed up

-any hung process, kill -QUIT it and you get a stack trace

-logging, lots of it. Get all the boxes in perfect NTP-driven sync and you can correlate events in a single-site cluster. Dealiing with logs from the other side of the world is a harder problem -don't go there if you can help it.

The x-trace team are trying to instrument hadoop for better debugging
http://radlab.cs.berkeley.edu/wiki/Projects/X-Trace_on_Hadoop

this looks really interesting

--
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Reply via email to