Re: detecting stalled daemons?

2009-10-15 Thread Steve Loughran

Edward Capriolo wrote:


I know there is a Jira open to add life cycle methods to each hadoop
component that can be polled for progress. I dont know the # off hand.



HDFS-326 https://issues.apache.org/jira/browse/HDFS-326 the code has its 
own branch.


This is still something I'm working on, the code works, all the tests 
work, but there are some quirks with JobTracker startup now that it 
blocks waiting for the filesystem to come up that I'm not happy with;  I 
need to add some new tests/mechanisms to shut down a service while it is 
still starting up, which includes interrupting the JT and TT.


You can get RPMs with all this stuff packaged up for use from 
http://smartfrog.org/ , with the caveat that it's still fairly unstable.


I am currently work on the other side of the equation, integration with 
multiple cloud infrastructures, with all the fun testing issues that follow:

http://www.1060.org/blogxter/entry?publicid=12CE2B62F71239349F3E9903EAE9D1F0


* The simplest liveness test for any of the workers right now is to hit 
their HTTP pages, its the classic happy test. We can and should extend 
this with more self-tests, some equivalent of Axis's happy.jsp. The nice 
thing about these is they integrate well with all the existing web page 
monitoring tools, though I should warn that the same tooling that tracks 
and reports the health of a four-way app server doesn't really scale to 
keeping an eye on 3000 task trackers. It's not the monitoring, but the 
reporting.


* Detecting failures of TTs and DNs is kind of tricky too; it's really 
the namenode and jobtracker that know best. We need to get some 
reporting in there so that when either of the masters think that one of 
their workers is playing up, they report it to whatever plugin wants to 
know.


* Handling failures of VMs is very different from physical machines. You 
just kill the VM and restart a new one. We don't need all the 
blacklisting stuff, just some infrastructure operations and various 
notifications to the ops team.


-steve







Re: detecting stalled daemons?

2009-10-08 Thread Todd Lipcon
Hi James,
This doesn't quite answer your original question, but if you want to help
track down these kinds of bugs, you should grab a stack trace next time this
happens.

You can do this either using jstack from the command line, by visiting
/stacks on the HTTP interface, or by sending the process a SIGQUIT (kill
-QUIT pid). If you go the SIGQUIT route, the stack dump will show up in
that daemon's stdout log (logs/hadoop-out).

Oftentimes the stack trace will be enough for the developers to track down a
deadlock, or it may point to some sort of configuration issue on your
machine.

-Todd


On Wed, Oct 7, 2009 at 11:19 PM, james warren ja...@rockyou.com wrote:

 Quick question for the hadoop / linux masters out there:

 I recently observed a stalled tasktracker daemon on our production cluster,
 and was wondering if there were common tests to detect failures so that
 administration tools (e.g. monit) can automatically restart the daemon.
  The
 particular observed symptoms were:

   - the node was dropped by the jobtracker
   - information in /proc listed the tasktracker process as sleeping, not
   zombie
   - the web interface (port 50060) was unresponsive, though telnet did
   connect
   - no error information in the hadoop logs -- they simply were no longer
   being updated

 I certainly cannot be the first person to encounter this - anyone have a
 neat and tidy solution they could share?

 (And yes, we will eventually we go down the nagios / ganglia / cloudera
 desktop path but we're waiting until we're running CDH2.)

 Many thanks,
 -James Warren



Re: detecting stalled daemons?

2009-10-08 Thread Edward Capriolo
On Thu, Oct 8, 2009 at 9:20 PM, Todd Lipcon t...@cloudera.com wrote:
 Hi James,
 This doesn't quite answer your original question, but if you want to help
 track down these kinds of bugs, you should grab a stack trace next time this
 happens.

 You can do this either using jstack from the command line, by visiting
 /stacks on the HTTP interface, or by sending the process a SIGQUIT (kill
 -QUIT pid). If you go the SIGQUIT route, the stack dump will show up in
 that daemon's stdout log (logs/hadoop-out).

 Oftentimes the stack trace will be enough for the developers to track down a
 deadlock, or it may point to some sort of configuration issue on your
 machine.

 -Todd


 On Wed, Oct 7, 2009 at 11:19 PM, james warren ja...@rockyou.com wrote:

 Quick question for the hadoop / linux masters out there:

 I recently observed a stalled tasktracker daemon on our production cluster,
 and was wondering if there were common tests to detect failures so that
 administration tools (e.g. monit) can automatically restart the daemon.
  The
 particular observed symptoms were:

   - the node was dropped by the jobtracker
   - information in /proc listed the tasktracker process as sleeping, not
   zombie
   - the web interface (port 50060) was unresponsive, though telnet did
   connect
   - no error information in the hadoop logs -- they simply were no longer
   being updated

 I certainly cannot be the first person to encounter this - anyone have a
 neat and tidy solution they could share?

 (And yes, we will eventually we go down the nagios / ganglia / cloudera
 desktop path but we're waiting until we're running CDH2.)

 Many thanks,
 -James Warren



James,

I am using nagios to run a web_check on each of the components web interfaces.

http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/check_scripts/0_19/

I know there is a Jira open to add life cycle methods to each hadoop
component that can be polled for progress. I dont know the # off hand.

Edward