Quick question for the hadoop / linux masters out there: I recently observed a stalled tasktracker daemon on our production cluster, and was wondering if there were common tests to detect failures so that administration tools (e.g. monit) can automatically restart the daemon. The particular observed symptoms were:
- the node was dropped by the jobtracker - information in /proc listed the tasktracker process as sleeping, not zombie - the web interface (port 50060) was unresponsive, though telnet did connect - no error information in the hadoop logs -- they simply were no longer being updated I certainly cannot be the first person to encounter this - anyone have a neat and tidy solution they could share? (And yes, we will eventually we go down the nagios / ganglia / cloudera desktop path but we're waiting until we're running CDH2.) Many thanks, -James Warren