[jira] [Commented] (HADOOP-6473) Add hadoop health check/diagnostics to run from command line, JSP pages, other tools

2014-07-18 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066833#comment-14066833
 ] 

Allen Wittenauer commented on HADOOP-6473:
--

We sort of have this today with the health check being done by YARN.  But that 
really should be expanded to cover HDFS as well.  That's probably a separate 
JIRA from this one, however.

> Add hadoop health check/diagnostics to run from command line, JSP pages, 
> other tools
> 
>
> Key: HADOOP-6473
> URL: https://issues.apache.org/jira/browse/HADOOP-6473
> Project: Hadoop Common
>  Issue Type: New Feature
>Reporter: Steve Loughran
>Priority: Minor
>
> If the lifecycle ping() is for short-duration "are we still alive" checks, 
> Hadoop still needs something bigger to check the overall system health,.This 
> would be for end users, but also for automated cluster deployment, a complete 
> validation of the cluster, 
> It could be a command line tool, and something that runs on different nodes, 
> checked via IPC or JSP. the idea would be to do thorough checks with good 
> diagnostics.  Oh, and they should be executable through JUnit too.
> For example
>  -if running on windows, check that cygwin is on the path, fail with a 
> pointer to a wiki issue if not
>  -datanodes should check that it can create locks on the filesystem, create 
> files, timestamps are (roughly) aligned with local time.
>  -namenodes should try and create files/locks in the filesystem
>  -task tracker should try and exec() something
>  -run through the classpath and look for problems; duplicate JARs, 
> unsupported java, xerces versions, etc.
> * The number of tests should be extensible -rather than one single class with 
> all the tests, there'd be something separate for name, task, data, job 
> tracker nodes
> * They can't be in the nodes themselves, as they should be executable even if 
> the nodes don't come up. 
> * output could be in human readable text or html, and a form that could be 
> processed through hadoop itself in future
> * these tests could have side effects, such as actually trying to submit work 
> to a cluster



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] Commented: (HADOOP-6473) Add hadoop health check/diagnostics to run from command line, JSP pages, other tools

2010-06-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874521#action_12874521
 ] 

Steve Loughran commented on HADOOP-6473:


I think the checks/entry point should have a production/developer switch or set 
of maskable tests

Production
* HDFS: secondary namenode defined, hostname must resolve from NN
* DNS/rDNS must work against specifically defined hosts
* Maybe: stricter requirements about which interfaces come up on (e.g a valid 
range of IP addresses for each service)
* log directory space requirements
* temp dir space requirements
and failure/error code if anything isn't met, also consider running these 
checks on every service startup

Developer
* allow people to work on laptops with no external network, play in incomplete 
clusters.
* less disk space requirements

Logging also raises some questions
* Check/print log levels of the various services, warn if at DEBUG level in 
production
* Work out which back end to commons-logging is running, print its classname
* Print out the commons-logging, slf4j, jetty and log4j JVM config options
* print out which log4j.properties/XML file is resolving on the classpath

> Add hadoop health check/diagnostics to run from command line, JSP pages, 
> other tools
> 
>
> Key: HADOOP-6473
> URL: https://issues.apache.org/jira/browse/HADOOP-6473
> Project: Hadoop Common
>  Issue Type: New Feature
>Reporter: Steve Loughran
>Priority: Minor
>
> If the lifecycle ping() is for short-duration "are we still alive" checks, 
> Hadoop still needs something bigger to check the overall system health,.This 
> would be for end users, but also for automated cluster deployment, a complete 
> validation of the cluster, 
> It could be a command line tool, and something that runs on different nodes, 
> checked via IPC or JSP. the idea would be to do thorough checks with good 
> diagnostics.  Oh, and they should be executable through JUnit too.
> For example
>  -if running on windows, check that cygwin is on the path, fail with a 
> pointer to a wiki issue if not
>  -datanodes should check that it can create locks on the filesystem, create 
> files, timestamps are (roughly) aligned with local time.
>  -namenodes should try and create files/locks in the filesystem
>  -task tracker should try and exec() something
>  -run through the classpath and look for problems; duplicate JARs, 
> unsupported java, xerces versions, etc.
> * The number of tests should be extensible -rather than one single class with 
> all the tests, there'd be something separate for name, task, data, job 
> tracker nodes
> * They can't be in the nodes themselves, as they should be executable even if 
> the nodes don't come up. 
> * output could be in human readable text or html, and a form that could be 
> processed through hadoop itself in future
> * these tests could have side effects, such as actually trying to submit work 
> to a cluster

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6473) Add hadoop health check/diagnostics to run from command line, JSP pages, other tools

2010-06-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874516#action_12874516
 ] 

Steve Loughran commented on HADOOP-6473:


Some things we could check for and related bugreps/stack traces that are 
independent of node/client role

Network
# Network: IPv4 is in use: HADOOP-6056. Check out system property see if local 
ip address lookup with an IPv6 address ::1 works, see what happens on nslookup 
of 127.0.0.1
# DNS status. At the very least, the local hostname must resolve: HADOOP-3426. 
# Also check rdns and some nslookup of some common external addresses 
(hadoop.apache.org), warn if these are not available, but mention its not 
important if you don't want external network access
# Proxy server settings; print them, if non-null, check (hostname, port) and 
warn if missing

Classpath
# Print it out. 
# Check for duplicate filenames (with/without version endings?)

Dependencies
# XML engine name and version
# XML parser supports XInclude: HADOOP-5254
# XSL engine name and version

Local filesystem
# Space
# Temp dir is writeable. Write something, read it back in, verify checksum and 
that the file's timestamp is within a few seconds of the system clock.




> Add hadoop health check/diagnostics to run from command line, JSP pages, 
> other tools
> 
>
> Key: HADOOP-6473
> URL: https://issues.apache.org/jira/browse/HADOOP-6473
> Project: Hadoop Common
>  Issue Type: New Feature
>Reporter: Steve Loughran
>Priority: Minor
>
> If the lifecycle ping() is for short-duration "are we still alive" checks, 
> Hadoop still needs something bigger to check the overall system health,.This 
> would be for end users, but also for automated cluster deployment, a complete 
> validation of the cluster, 
> It could be a command line tool, and something that runs on different nodes, 
> checked via IPC or JSP. the idea would be to do thorough checks with good 
> diagnostics.  Oh, and they should be executable through JUnit too.
> For example
>  -if running on windows, check that cygwin is on the path, fail with a 
> pointer to a wiki issue if not
>  -datanodes should check that it can create locks on the filesystem, create 
> files, timestamps are (roughly) aligned with local time.
>  -namenodes should try and create files/locks in the filesystem
>  -task tracker should try and exec() something
>  -run through the classpath and look for problems; duplicate JARs, 
> unsupported java, xerces versions, etc.
> * The number of tests should be extensible -rather than one single class with 
> all the tests, there'd be something separate for name, task, data, job 
> tracker nodes
> * They can't be in the nodes themselves, as they should be executable even if 
> the nodes don't come up. 
> * output could be in human readable text or html, and a form that could be 
> processed through hadoop itself in future
> * these tests could have side effects, such as actually trying to submit work 
> to a cluster

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.