Re: detecting stalled daemons?

Edward Capriolo Thu, 08 Oct 2009 18:34:53 -0700

On Thu, Oct 8, 2009 at 9:20 PM, Todd Lipcon <t...@cloudera.com> wrote:
> Hi James,
> This doesn't quite answer your original question, but if you want to help
> track down these kinds of bugs, you should grab a stack trace next time this
> happens.
>
> You can do this either using "jstack" from the command line, by visiting
> /stacks on the HTTP interface, or by sending the process a SIGQUIT (kill
> -QUIT <pid>). If you go the SIGQUIT route, the stack dump will show up in
> that daemon's stdout log (logs/hadoop-....out).
>
> Oftentimes the stack trace will be enough for the developers to track down a
> deadlock, or it may point to some sort of configuration issue on your
> machine.
>
> -Todd
>
>
> On Wed, Oct 7, 2009 at 11:19 PM, james warren <ja...@rockyou.com> wrote:
>
>> Quick question for the hadoop / linux masters out there:
>>
>> I recently observed a stalled tasktracker daemon on our production cluster,
>> and was wondering if there were common tests to detect failures so that
>> administration tools (e.g. monit) can automatically restart the daemon.
>>  The
>> particular observed symptoms were:
>>
>>   - the node was dropped by the jobtracker
>>   - information in /proc listed the tasktracker process as sleeping, not
>>   zombie
>>   - the web interface (port 50060) was unresponsive, though telnet did
>>   connect
>>   - no error information in the hadoop logs -- they simply were no longer
>>   being updated
>>
>> I certainly cannot be the first person to encounter this - anyone have a
>> neat and tidy solution they could share?
>>
>> (And yes, we will eventually we go down the nagios / ganglia / cloudera
>> desktop path but we're waiting until we're running CDH2.)
>>
>> Many thanks,
>> -James Warren
>>
>


James,

I am using nagios to run a web_check on each of the components web interfaces.

http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/check_scripts/0_19/

I know there is a Jira open to add life cycle methods to each hadoop
component that can be polled for progress. I dont know the # off hand.

Edward

Re: detecting stalled daemons?

Reply via email to