Re: detecting stalled daemons?
Edward Capriolo wrote: I know there is a Jira open to add life cycle methods to each hadoop component that can be polled for progress. I dont know the # off hand. HDFS-326 https://issues.apache.org/jira/browse/HDFS-326 the code has its own branch. This is still something I'm working on, the code works, all the tests work, but there are some quirks with JobTracker startup now that it blocks waiting for the filesystem to come up that I'm not happy with; I need to add some new tests/mechanisms to shut down a service while it is still starting up, which includes interrupting the JT and TT. You can get RPMs with all this stuff packaged up for use from http://smartfrog.org/ , with the caveat that it's still fairly unstable. I am currently work on the other side of the equation, integration with multiple cloud infrastructures, with all the fun testing issues that follow: http://www.1060.org/blogxter/entry?publicid=12CE2B62F71239349F3E9903EAE9D1F0 * The simplest liveness test for any of the workers right now is to hit their HTTP pages, its the classic happy test. We can and should extend this with more self-tests, some equivalent of Axis's happy.jsp. The nice thing about these is they integrate well with all the existing web page monitoring tools, though I should warn that the same tooling that tracks and reports the health of a four-way app server doesn't really scale to keeping an eye on 3000 task trackers. It's not the monitoring, but the reporting. * Detecting failures of TTs and DNs is kind of tricky too; it's really the namenode and jobtracker that know best. We need to get some reporting in there so that when either of the masters think that one of their workers is playing up, they report it to whatever plugin wants to know. * Handling failures of VMs is very different from physical machines. You just kill the VM and restart a new one. We don't need all the blacklisting stuff, just some infrastructure operations and various notifications to the ops team. -steve
Re: detecting stalled daemons?
Hi James, This doesn't quite answer your original question, but if you want to help track down these kinds of bugs, you should grab a stack trace next time this happens. You can do this either using jstack from the command line, by visiting /stacks on the HTTP interface, or by sending the process a SIGQUIT (kill -QUIT pid). If you go the SIGQUIT route, the stack dump will show up in that daemon's stdout log (logs/hadoop-out). Oftentimes the stack trace will be enough for the developers to track down a deadlock, or it may point to some sort of configuration issue on your machine. -Todd On Wed, Oct 7, 2009 at 11:19 PM, james warren ja...@rockyou.com wrote: Quick question for the hadoop / linux masters out there: I recently observed a stalled tasktracker daemon on our production cluster, and was wondering if there were common tests to detect failures so that administration tools (e.g. monit) can automatically restart the daemon. The particular observed symptoms were: - the node was dropped by the jobtracker - information in /proc listed the tasktracker process as sleeping, not zombie - the web interface (port 50060) was unresponsive, though telnet did connect - no error information in the hadoop logs -- they simply were no longer being updated I certainly cannot be the first person to encounter this - anyone have a neat and tidy solution they could share? (And yes, we will eventually we go down the nagios / ganglia / cloudera desktop path but we're waiting until we're running CDH2.) Many thanks, -James Warren
Re: detecting stalled daemons?
On Thu, Oct 8, 2009 at 9:20 PM, Todd Lipcon t...@cloudera.com wrote: Hi James, This doesn't quite answer your original question, but if you want to help track down these kinds of bugs, you should grab a stack trace next time this happens. You can do this either using jstack from the command line, by visiting /stacks on the HTTP interface, or by sending the process a SIGQUIT (kill -QUIT pid). If you go the SIGQUIT route, the stack dump will show up in that daemon's stdout log (logs/hadoop-out). Oftentimes the stack trace will be enough for the developers to track down a deadlock, or it may point to some sort of configuration issue on your machine. -Todd On Wed, Oct 7, 2009 at 11:19 PM, james warren ja...@rockyou.com wrote: Quick question for the hadoop / linux masters out there: I recently observed a stalled tasktracker daemon on our production cluster, and was wondering if there were common tests to detect failures so that administration tools (e.g. monit) can automatically restart the daemon. The particular observed symptoms were: - the node was dropped by the jobtracker - information in /proc listed the tasktracker process as sleeping, not zombie - the web interface (port 50060) was unresponsive, though telnet did connect - no error information in the hadoop logs -- they simply were no longer being updated I certainly cannot be the first person to encounter this - anyone have a neat and tidy solution they could share? (And yes, we will eventually we go down the nagios / ganglia / cloudera desktop path but we're waiting until we're running CDH2.) Many thanks, -James Warren James, I am using nagios to run a web_check on each of the components web interfaces. http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/check_scripts/0_19/ I know there is a Jira open to add life cycle methods to each hadoop component that can be polled for progress. I dont know the # off hand. Edward