On Thu, Oct 8, 2009 at 9:20 PM, Todd Lipcon <t...@cloudera.com> wrote: > Hi James, > This doesn't quite answer your original question, but if you want to help > track down these kinds of bugs, you should grab a stack trace next time this > happens. > > You can do this either using "jstack" from the command line, by visiting > /stacks on the HTTP interface, or by sending the process a SIGQUIT (kill > -QUIT <pid>). If you go the SIGQUIT route, the stack dump will show up in > that daemon's stdout log (logs/hadoop-....out). > > Oftentimes the stack trace will be enough for the developers to track down a > deadlock, or it may point to some sort of configuration issue on your > machine. > > -Todd > > > On Wed, Oct 7, 2009 at 11:19 PM, james warren <ja...@rockyou.com> wrote: > >> Quick question for the hadoop / linux masters out there: >> >> I recently observed a stalled tasktracker daemon on our production cluster, >> and was wondering if there were common tests to detect failures so that >> administration tools (e.g. monit) can automatically restart the daemon. >> The >> particular observed symptoms were: >> >> - the node was dropped by the jobtracker >> - information in /proc listed the tasktracker process as sleeping, not >> zombie >> - the web interface (port 50060) was unresponsive, though telnet did >> connect >> - no error information in the hadoop logs -- they simply were no longer >> being updated >> >> I certainly cannot be the first person to encounter this - anyone have a >> neat and tidy solution they could share? >> >> (And yes, we will eventually we go down the nagios / ganglia / cloudera >> desktop path but we're waiting until we're running CDH2.) >> >> Many thanks, >> -James Warren >> >
James, I am using nagios to run a web_check on each of the components web interfaces. http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/check_scripts/0_19/ I know there is a Jira open to add life cycle methods to each hadoop component that can be polled for progress. I dont know the # off hand. Edward