[jira] [Commented] (HADOOP-6473) Add hadoop health check/diagnostics to run from command line, JSP pages, other tools
[ https://issues.apache.org/jira/browse/HADOOP-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066833#comment-14066833 ] Allen Wittenauer commented on HADOOP-6473: -- We sort of have this today with the health check being done by YARN. But that really should be expanded to cover HDFS as well. That's probably a separate JIRA from this one, however. > Add hadoop health check/diagnostics to run from command line, JSP pages, > other tools > > > Key: HADOOP-6473 > URL: https://issues.apache.org/jira/browse/HADOOP-6473 > Project: Hadoop Common > Issue Type: New Feature >Reporter: Steve Loughran >Priority: Minor > > If the lifecycle ping() is for short-duration "are we still alive" checks, > Hadoop still needs something bigger to check the overall system health,.This > would be for end users, but also for automated cluster deployment, a complete > validation of the cluster, > It could be a command line tool, and something that runs on different nodes, > checked via IPC or JSP. the idea would be to do thorough checks with good > diagnostics. Oh, and they should be executable through JUnit too. > For example > -if running on windows, check that cygwin is on the path, fail with a > pointer to a wiki issue if not > -datanodes should check that it can create locks on the filesystem, create > files, timestamps are (roughly) aligned with local time. > -namenodes should try and create files/locks in the filesystem > -task tracker should try and exec() something > -run through the classpath and look for problems; duplicate JARs, > unsupported java, xerces versions, etc. > * The number of tests should be extensible -rather than one single class with > all the tests, there'd be something separate for name, task, data, job > tracker nodes > * They can't be in the nodes themselves, as they should be executable even if > the nodes don't come up. > * output could be in human readable text or html, and a form that could be > processed through hadoop itself in future > * these tests could have side effects, such as actually trying to submit work > to a cluster -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] Commented: (HADOOP-6473) Add hadoop health check/diagnostics to run from command line, JSP pages, other tools
[ https://issues.apache.org/jira/browse/HADOOP-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874521#action_12874521 ] Steve Loughran commented on HADOOP-6473: I think the checks/entry point should have a production/developer switch or set of maskable tests Production * HDFS: secondary namenode defined, hostname must resolve from NN * DNS/rDNS must work against specifically defined hosts * Maybe: stricter requirements about which interfaces come up on (e.g a valid range of IP addresses for each service) * log directory space requirements * temp dir space requirements and failure/error code if anything isn't met, also consider running these checks on every service startup Developer * allow people to work on laptops with no external network, play in incomplete clusters. * less disk space requirements Logging also raises some questions * Check/print log levels of the various services, warn if at DEBUG level in production * Work out which back end to commons-logging is running, print its classname * Print out the commons-logging, slf4j, jetty and log4j JVM config options * print out which log4j.properties/XML file is resolving on the classpath > Add hadoop health check/diagnostics to run from command line, JSP pages, > other tools > > > Key: HADOOP-6473 > URL: https://issues.apache.org/jira/browse/HADOOP-6473 > Project: Hadoop Common > Issue Type: New Feature >Reporter: Steve Loughran >Priority: Minor > > If the lifecycle ping() is for short-duration "are we still alive" checks, > Hadoop still needs something bigger to check the overall system health,.This > would be for end users, but also for automated cluster deployment, a complete > validation of the cluster, > It could be a command line tool, and something that runs on different nodes, > checked via IPC or JSP. the idea would be to do thorough checks with good > diagnostics. Oh, and they should be executable through JUnit too. > For example > -if running on windows, check that cygwin is on the path, fail with a > pointer to a wiki issue if not > -datanodes should check that it can create locks on the filesystem, create > files, timestamps are (roughly) aligned with local time. > -namenodes should try and create files/locks in the filesystem > -task tracker should try and exec() something > -run through the classpath and look for problems; duplicate JARs, > unsupported java, xerces versions, etc. > * The number of tests should be extensible -rather than one single class with > all the tests, there'd be something separate for name, task, data, job > tracker nodes > * They can't be in the nodes themselves, as they should be executable even if > the nodes don't come up. > * output could be in human readable text or html, and a form that could be > processed through hadoop itself in future > * these tests could have side effects, such as actually trying to submit work > to a cluster -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6473) Add hadoop health check/diagnostics to run from command line, JSP pages, other tools
[ https://issues.apache.org/jira/browse/HADOOP-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874516#action_12874516 ] Steve Loughran commented on HADOOP-6473: Some things we could check for and related bugreps/stack traces that are independent of node/client role Network # Network: IPv4 is in use: HADOOP-6056. Check out system property see if local ip address lookup with an IPv6 address ::1 works, see what happens on nslookup of 127.0.0.1 # DNS status. At the very least, the local hostname must resolve: HADOOP-3426. # Also check rdns and some nslookup of some common external addresses (hadoop.apache.org), warn if these are not available, but mention its not important if you don't want external network access # Proxy server settings; print them, if non-null, check (hostname, port) and warn if missing Classpath # Print it out. # Check for duplicate filenames (with/without version endings?) Dependencies # XML engine name and version # XML parser supports XInclude: HADOOP-5254 # XSL engine name and version Local filesystem # Space # Temp dir is writeable. Write something, read it back in, verify checksum and that the file's timestamp is within a few seconds of the system clock. > Add hadoop health check/diagnostics to run from command line, JSP pages, > other tools > > > Key: HADOOP-6473 > URL: https://issues.apache.org/jira/browse/HADOOP-6473 > Project: Hadoop Common > Issue Type: New Feature >Reporter: Steve Loughran >Priority: Minor > > If the lifecycle ping() is for short-duration "are we still alive" checks, > Hadoop still needs something bigger to check the overall system health,.This > would be for end users, but also for automated cluster deployment, a complete > validation of the cluster, > It could be a command line tool, and something that runs on different nodes, > checked via IPC or JSP. the idea would be to do thorough checks with good > diagnostics. Oh, and they should be executable through JUnit too. > For example > -if running on windows, check that cygwin is on the path, fail with a > pointer to a wiki issue if not > -datanodes should check that it can create locks on the filesystem, create > files, timestamps are (roughly) aligned with local time. > -namenodes should try and create files/locks in the filesystem > -task tracker should try and exec() something > -run through the classpath and look for problems; duplicate JARs, > unsupported java, xerces versions, etc. > * The number of tests should be extensible -rather than one single class with > all the tests, there'd be something separate for name, task, data, job > tracker nodes > * They can't be in the nodes themselves, as they should be executable even if > the nodes don't come up. > * output could be in human readable text or html, and a form that could be > processed through hadoop itself in future > * these tests could have side effects, such as actually trying to submit work > to a cluster -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.