rejig titles of troubleshooting sections
Project: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/commit/3d930107 Tree: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/tree/3d930107 Diff: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/diff/3d930107 Branch: refs/heads/master Commit: 3d930107c6c1dceeb1f41c00f56657d24590f1a7 Parents: 3d29c8b Author: Alex Heneveld <[email protected]> Authored: Fri Jul 24 15:18:51 2015 +0100 Committer: Alex Heneveld <[email protected]> Committed: Tue Jul 28 12:33:56 2015 +0100 ---------------------------------------------------------------------- .../dev/tips/troubleshooting-exceptions.md | 489 ------------------- docs/guide/ops/troubleshooting/connectivity.md | 143 ++++++ docs/guide/ops/troubleshooting/deployment.md | 88 ++++ .../going-deep-in-java-and-logs.md | 488 ++++++++++++++++++ docs/guide/ops/troubleshooting/index.md | 11 +- docs/guide/ops/troubleshooting/overview.md | 116 +++++ .../ops/troubleshooting/softwareprocess.md | 50 ++ .../troubleshooting-connectivity.md | 143 ------ .../troubleshooting-deployment.md | 88 ---- .../troubleshooting-runtime-errors.md | 116 ----- .../troubleshooting-softwareprocess.md | 50 -- 11 files changed, 891 insertions(+), 891 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d930107/docs/guide/dev/tips/troubleshooting-exceptions.md ---------------------------------------------------------------------- diff --git a/docs/guide/dev/tips/troubleshooting-exceptions.md b/docs/guide/dev/tips/troubleshooting-exceptions.md deleted file mode 100644 index fb3d849..0000000 --- a/docs/guide/dev/tips/troubleshooting-exceptions.md +++ /dev/null @@ -1,489 +0,0 @@ ---- -layout: website-normal -title: Troubleshooting Exceptions and Node Failure -toc: /guide/toc.json ---- - -Whether you're customizing out-of-the-box blueprints, or developing your own custom blueprints, you will -inevitably have to deal with node failure, or exceptions being thrown by your node. Thankfully Brooklyn -provides plenty of information to help you locate and resolve any issues you may encounter. - -This guide looks at three common failure scenarios and describes the steps that can be taken to -identify the issue. - -## Script failure -Many blueprints run bash scripts as part of the installation. This section highlights how to identify a problem with -a bash script. - -First let's take a look at the `customize()` method of the Tomcat server blueprint: - -{% highlight java %} - @Override - public void customize() { - newScript(CUSTOMIZING) - .body.append("mkdir -p conf logs webapps temp") - .failOnNonZeroResultCode() - .execute(); - - copyTemplate(entity.getConfig(TomcatServer.SERVER_XML_RESOURCE), Os.mergePaths(getRunDir(), "conf", "server.xml")); - copyTemplate(entity.getConfig(TomcatServer.WEB_XML_RESOURCE), Os.mergePaths(getRunDir(), "conf", "web.xml")); - - if (isProtocolEnabled("HTTPS")) { - String keystoreUrl = Preconditions.checkNotNull(getSslKeystoreUrl(), "keystore URL must be specified if using HTTPS for " + entity); - String destinationSslKeystoreFile = getHttpsSslKeystoreFile(); - InputStream keystoreStream = resource.getResourceFromUrl(keystoreUrl); - getMachine().copyTo(keystoreStream, destinationSslKeystoreFile); - } - - getEntity().deployInitialWars(); - } -{% endhighlight %} - -Here we can see that it's running a script to create four directories before continuing with the customization. Let's -introduce an error by changing `mkdir` to `mkrid`: - -{% highlight java %} - newScript(CUSTOMIZING) - .body.append("mkrid -p conf logs webapps temp") // `mkdir` changed to `mkrid` - .failOnNonZeroResultCode() - .execute(); -{% endhighlight %} - -Now let's try deploying this using the following YAML: - -{% highlight yaml %} - -name: Tomcat failure test -location: localhost -services: -- type: brooklyn.entity.webapp.tomcat.TomcatServer - -{% endhighlight %} - -Shortly after deployment, the entity fails with the following error: - -`Failure running task ssh: customizing TomcatServerImpl{id=e1HP2s8x} (HmyPAozV): -Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x}` - -[](images/script-failure-large.png) - -By selecting the `Activities` tab, we can drill into the task that failed. The list of tasks shown (where the -effectors are shown as top-level tasks) are clickable links. Selecting that row will show the details of -that particular task, including its sub-tasks. We can eventually get to the specific sub-task that failed: - -[](images/failed-task-large.png) - -By clicking on the `stderr` link, we can see the script failed with the following error: - -{% highlight console %} -/tmp/brooklyn-20150721-132251052-l4b9-customizing_TomcatServerImpl_i.sh: line 10: mkrid: command not found -{% endhighlight %} - -This tells us *what* went wrong, but doesn't tell us *where*. In order to find that, we'll need to look at the -stack trace that was logged when the exception was thrown. - -It's always worth looking at the Detailed Status section as sometimes this will give you the information you need. -In this case, the stack trace is limited to the thread that was used to execute the task that ran the script: - -{% highlight console %} -Failed after 40ms - -STDERR -/tmp/brooklyn-20150721-132251052-l4b9-customizing_TomcatServerImpl_i.sh: line 10: mkrid: command not found - - -STDOUT -Executed /tmp/brooklyn-20150721-132251052-l4b9-customizing_TomcatServerImpl_i.sh, result 127: Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x} - -java.lang.IllegalStateException: Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x} - at brooklyn.entity.basic.lifecycle.ScriptHelper.logWithDetailsAndThrow(ScriptHelper.java:390) - at brooklyn.entity.basic.lifecycle.ScriptHelper.executeInternal(ScriptHelper.java:379) - at brooklyn.entity.basic.lifecycle.ScriptHelper$8.call(ScriptHelper.java:289) - at brooklyn.entity.basic.lifecycle.ScriptHelper$8.call(ScriptHelper.java:287) - at brooklyn.util.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:343) - at brooklyn.util.task.BasicExecutionManager$SubmissionCallable.call(BasicExecutionManager.java:469) - at java.util.concurrent.FutureTask.run(FutureTask.java:262) - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -at java.lang.Thread.run(Thread.java:745) -{% endhighlight %} - -In order to find the exception, we'll need to look in Brooklyn's debug log file. By default, the debug log file -is named `brooklyn.debug.log`. Usually the easiest way to navigate the log file is to use `less`, e.g. -`less brooklyn.debug.log`. We can quickly find find the stack trace by first navigating to the end of the log file -with `Shift-G`, then performing a reverse-lookup by typing `?Tomcat` and pressing `Enter`. If searching for the -blueprint type (in this case Tomcat) simply matches tasks unrelated to the exception, you can also search for -the text of the error message, in this case `? invalid result 127`. You can make the search case-insensitivity by -typing `-i` before performing the search. To skip the current match and move to the next on (i.e. 'up' as we're -performing a reverse-lookup), simply press `n` - -In this case, the `?Tomcat` search takes us directly to the full stack trace (Only the last part of the trace -is shown here): - -{% highlight console %} - -at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:63) ~[guava-17.0.jar:na] -at brooklyn.util.task.BasicTask.get(BasicTask.java:343) ~[classes/:na] -at brooklyn.util.task.BasicTask.getUnchecked(BasicTask.java:352) ~[classes/:na] -... 9 common frames omitted -Caused by: brooklyn.util.exceptions.PropagatedRuntimeException: -at brooklyn.util.exceptions.Exceptions.propagate(Exceptions.java:97) ~[classes/:na] -at brooklyn.util.task.BasicTask.getUnchecked(BasicTask.java:354) ~[classes/:na] -at brooklyn.entity.basic.lifecycle.ScriptHelper.execute(ScriptHelper.java:339) ~[classes/:na] -at brooklyn.entity.webapp.tomcat.TomcatSshDriver.customize(TomcatSshDriver.java:72) ~[classes/:na] -at brooklyn.entity.basic.AbstractSoftwareProcessDriver$8.run(AbstractSoftwareProcessDriver.java:150) ~[classes/:na] -at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_71] -at brooklyn.util.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:343) ~[classes/:na] -... 5 common frames omitted -Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x} -at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.7.0_71] -at java.util.concurrent.FutureTask.get(FutureTask.java:188) [na:1.7.0_71] -at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:63) ~[guava-17.0.jar:na] -at brooklyn.util.task.BasicTask.get(BasicTask.java:343) ~[classes/:na] -at brooklyn.util.task.BasicTask.getUnchecked(BasicTask.java:352) ~[classes/:na] -... 10 common frames omitted -Caused by: java.lang.IllegalStateException: Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x} -at brooklyn.entity.basic.lifecycle.ScriptHelper.logWithDetailsAndThrow(ScriptHelper.java:390) ~[classes/:na] -at brooklyn.entity.basic.lifecycle.ScriptHelper.executeInternal(ScriptHelper.java:379) ~[classes/:na] -at brooklyn.entity.basic.lifecycle.ScriptHelper$8.call(ScriptHelper.java:289) ~[classes/:na] -at brooklyn.entity.basic.lifecycle.ScriptHelper$8.call(ScriptHelper.java:287) ~[classes/:na] -... 6 common frames omitted - -{% endhighlight %} - -Brooklyn's use of tasks and helper classes can make the stack trace a little harder than usual to follow, but a good -place to start is to look through the stack trace for the node's implementation or ssh driver classes (usually -named `FooNodeImpl` or `FooSshDriver`). In this case we can see the following: - -{% highlight console %} -at brooklyn.entity.webapp.tomcat.TomcatSshDriver.customize(TomcatSshDriver.java:72) ~[classes/:na] -{% endhighlight %} - -Combining this with the error message of `mkrid: command not found` we can see that indeed `mkdir` has been -misspelled `mkrid` on line 72 of `TomcatSshDriver.java`. - - -## Non-Script failure -The section above gives an example of a failure that occurs when a script is run. In this section we will look at -a failure in a non-script related part of the code. We'll use the `customize()` method of the Tomcat server again, -but this time, we'll correct the spelling of 'mkdir' and add a line that attempts to copy a nonexistent resource -to the remote server: - -{% highlight java %} - -newScript(CUSTOMIZING) - .body.append("mkdir -p conf logs webapps temp") - .failOnNonZeroResultCode() - .execute(); - -copyTemplate(entity.getConfig(TomcatServer.SERVER_XML_RESOURCE), Os.mergePaths(getRunDir(), "conf", "server.xml")); -copyTemplate(entity.getConfig(TomcatServer.WEB_XML_RESOURCE), Os.mergePaths(getRunDir(), "conf", "web.xml")); -copyTemplate("classpath://nonexistent.xml", Os.mergePaths(getRunDir(), "conf", "nonexistent.xml")); // Resource does not exist! - -{% endhighlight %} - -Let's deploy this using the same YAML from above. Here's the resulting error in the Brooklyn debug console: - -[](images/resource-exception-large.png) - -Again, this tells us *what* the error is, but we need to find *where* the code is that attempts to copy this file. In -this case it's shown in the Detailed Status section, and we don't need to go to the log file: - -{% highlight console %} - -Failed after 221ms: Error getting resource 'classpath://nonexistent.xml' for TomcatServerImpl{id=PVZxDKU1}: java.io.IOException: Error accessing classpath://nonexistent.xml: java.io.IOException: nonexistent.xml not found on classpath - -java.lang.RuntimeException: Error getting resource 'classpath://nonexistent.xml' for TomcatServerImpl{id=PVZxDKU1}: java.io.IOException: Error accessing classpath://nonexistent.xml: java.io.IOException: nonexistent.xml not found on classpath - at brooklyn.util.ResourceUtils.getResourceFromUrl(ResourceUtils.java:297) - at brooklyn.util.ResourceUtils.getResourceAsString(ResourceUtils.java:475) - at brooklyn.entity.basic.AbstractSoftwareProcessDriver.getResourceAsString(AbstractSoftwareProcessDriver.java:447) - at brooklyn.entity.basic.AbstractSoftwareProcessDriver.processTemplate(AbstractSoftwareProcessDriver.java:469) - at brooklyn.entity.basic.AbstractSoftwareProcessDriver.copyTemplate(AbstractSoftwareProcessDriver.java:390) - at brooklyn.entity.basic.AbstractSoftwareProcessDriver.copyTemplate(AbstractSoftwareProcessDriver.java:379) - at brooklyn.entity.webapp.tomcat.TomcatSshDriver.customize(TomcatSshDriver.java:79) - at brooklyn.entity.basic.AbstractSoftwareProcessDriver$8.run(AbstractSoftwareProcessDriver.java:150) - at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) - at brooklyn.util.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:343) - at brooklyn.util.task.BasicExecutionManager$SubmissionCallable.call(BasicExecutionManager.java:469) - at java.util.concurrent.FutureTask.run(FutureTask.java:262) - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -at java.lang.Thread.run(Thread.java:745) - Caused by: java.io.IOException: Error accessing classpath://nonexistent.xml: java.io.IOException: nonexistent.xml not found on classpath -at brooklyn.util.ResourceUtils.getResourceFromUrl(ResourceUtils.java:233) - ... 14 more - Caused by: java.io.IOException: nonexistent.xml not found on classpath - at brooklyn.util.ResourceUtils.getResourceViaClasspath(ResourceUtils.java:372) -at brooklyn.util.ResourceUtils.getResourceFromUrl(ResourceUtils.java:230) - ... 14 more - -{% endhighlight %} - -Looking for `Tomcat` in the stack trace, we can see in this case the problem lies at line 79 of `TomcatSshDriver.java` - - -## External failure - -Sometimes an entity will fail outside the direct commands issues by Brooklyn. When installing and launching an entity, -Brooklyn will check the return code of scripts that were run to ensure that they completed successfully (i.e. the -return code of the script is zero). It is possible, for example, that a launch script completes successfully, but -the entity fails to start. - -We can simulate this type of failure by launching Tomcat with an invalid configuration file. As seen in the previous -examples, Brooklyn copies two xml configuration files to the server: `server.xml` and `web.xml` - -The first few non-comment lines of `server.xml` are as follows (you can see the full file [here] -(https://github.com/apache/incubator-brooklyn/blob/master/software/webapp/src/main/resources/brooklyn/entity/webapp/tomcat/server.xml)): - -{% highlight xml %} - -<Server port="${driver.shutdownPort?c}" shutdown="SHUTDOWN"> - <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on" /> - <Listener className="org.apache.catalina.core.JasperListener" /> - -{% endhighlight%} - -Let's add an unmatched XML element, which will make this XML file invalid: - -{% highlight xml %} - -<Server port="${driver.shutdownPort?c}" shutdown="SHUTDOWN"> - <unmatched-element> <!-- This is invalid XML as we won't add </unmatched-element> --> - <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on" /> - <Listener className="org.apache.catalina.core.JasperListener" /> - -{% endhighlight%} - -As Brooklyn doesn't know how these types of resources are used, they're not validated as they're copied to the remote machine. -As far as Brooklyn is concerned, the file will have copied successfully. - -Let's deploy Tomcat again, using the same YAML as before. This time, the deployment runs for a few minutes before failing -with `Timeout waiting for SERVICE_UP`: - -[](images/external-error-large.png) - -If we drill down into the tasks in the `Activities` tab, we can see that all of the installation and launch tasks -completed successfully, and stdout of the `launch` script is as follows: - -{% highlight console %} - -Executed /tmp/brooklyn-20150721-153049139-fK2U-launching_TomcatServerImpl_id_.sh, result 0 - -{% endhighlight %} - -The task that failed was the `post-start` task, and the stack trace from the Detailed Status section is as follows: - -{% highlight console %} - -Failed after 5m 1s: Timeout waiting for SERVICE_UP from TomcatServerImpl{id=BUHgQeOs} - -java.lang.IllegalStateException: Timeout waiting for SERVICE_UP from TomcatServerImpl{id=BUHgQeOs} - at brooklyn.entity.basic.Entities.waitForServiceUp(Entities.java:1073) - at brooklyn.entity.basic.SoftwareProcessImpl.waitForServiceUp(SoftwareProcessImpl.java:388) - at brooklyn.entity.basic.SoftwareProcessImpl.waitForServiceUp(SoftwareProcessImpl.java:385) - at brooklyn.entity.basic.SoftwareProcessDriverLifecycleEffectorTasks.postStartCustom(SoftwareProcessDriverLifecycleEffectorTasks.java:164) - at brooklyn.entity.software.MachineLifecycleEffectorTasks$7.run(MachineLifecycleEffectorTasks.java:433) - at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) - at brooklyn.util.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:343) - at brooklyn.util.task.BasicExecutionManager$SubmissionCallable.call(BasicExecutionManager.java:469) - at java.util.concurrent.FutureTask.run(FutureTask.java:262) - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -at java.lang.Thread.run(Thread.java:745) - -{% endhighlight %} - -This doesn't really tell us what we need to know, and looking in the `brooklyn.debug.log` file yields no further -clues. The key here is the error message `Timeout waiting for SERVICE_UP`. After running the installation and -launch scripts, assuming all scripts completed successfully, Brooklyn will periodically check the health of the node -and will set the node on fire if the health check does not pass within a pre-prescribed period (the default is -two minutes, and can be configured using the `start.timeout` config key). The periodic health check also continues -after the successful launch in order to check continued operation of the node, but in this case it fails to pass -at all. - -The first thing we need to do is to find out how Brooklyn determines the health of the node. The health-check is -often implemented in the `isRunning()` method in the entity's ssh driver. Tomcat's implementation of `isRunning()` -is as follows: - -{% highlight java %} -@Override -public boolean isRunning() { - return newScript(MutableMap.of(USE_PID_FILE, "pid.txt"), CHECK_RUNNING).execute() == 0; -} -{% endhighlight %} - -The `newScript` method has conveniences for default scripts to check if a process is running based on its PID. In this -case, it will look for Tomcat's PID in the `pid.txt` file and check if the PID is the PID of a running process - -It's worth a quick sanity check at this point to check if the PID file exists, and if the process is running. -By default, the pid file is located in the run directory of the entity. You can find the location of the entity's run -directory by looking at the `run.dir` sensor. In this case it is `/tmp/brooklyn-martin/apps/jIzIHXtP/entities/TomcatServer_BUHgQeOs`. -To find the pid, you simply cat the pid.txt file in this directory: - -{% highlight console %} -$ cat /tmp/brooklyn-martin/apps/jIzIHXtP/entities/TomcatServer_BUHgQeOs/pid.txt -73714 -{% endhighlight %} - -In this case, the PID in the file is 73714. You can then check if the process is running using `ps`. You can also -pipe the output to `fold` so the full launch command is visible: - -{% highlight console %} - -$ ps -p 73714 | fold -w 120 -PID TTY TIME CMD -73714 ?? 0:08.03 /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/bin/java -Dnop -Djava.util.logg -ing.manager=org.apache.juli.ClassLoaderLogManager -javaagent:/tmp/brooklyn-martin/apps/jIzIHXtP/entities/TomcatServer_BU -HgQeOs/brooklyn-jmxmp-agent-shaded-0.8.0-SNAPSHOT.jar -Xms200m -Xmx800m -XX:MaxPermSize=400m -Dcom.sun.management.jmxrem -ote -Dbrooklyn.jmxmp.rmi-port=1099 -Dbrooklyn.jmxmp.port=31001 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.manage -ment.jmxremote.authenticate=false -Djava.endorsed.dirs=/tmp/brooklyn-martin/installs/TomcatServer_7.0.56/apache-tomcat-7 -.0.56/endorsed -classpath /tmp/brooklyn-martin/installs/TomcatServer_7.0.56/apache-tomcat-7.0.56/bin/bootstrap.jar:/tmp/ -brooklyn-martin/installs/TomcatServer_7.0.56/apache-tomcat-7.0.56/bin/tomcat-juli.jar -Dcatalina.base=/tmp/brooklyn-mart -in/apps/jIzIHXtP/entities/TomcatServer_BUHgQeOs -Dcatalina.home=/tmp/brooklyn-martin/installs/TomcatServer_7.0.56/apache --tomcat-7.0.56 -Djava.io.tmpdir=/tmp/brooklyn-martin/apps/jIzIHXtP/entities/TomcatServer_BUHgQeOs/temp org.apache.catali -na.startup.Bootstrap start - -{% endhighlight %} - -This confirms that the process is running. The next thing we can look at is the `service.notUp.indicators` sensor. This -reads as follows: - -{% highlight json %} - -{"service.process.isRunning":"The software process for this entity does not appear to be running"} - -{% endhighlight %} - -This confirms that the problem is indeed due to the `service.process.isRunning` sensor. We assumed earlier that this was -set by the `isRunning()` method in `TomcatSshDriver.java`, but this isn't always the case. The `service.process.isRunning` -sensor is wired up by the `connectSensors()` method in the node's implementation class, in this case -`TomcatServerImpl.java`. Tomcat's implementation of `connectSensors()` is as follows: - -{% highlight java %} - -@Override -public void connectSensors() { - super.connectSensors(); - - if (getDriver().isJmxEnabled()) { - String requestProcessorMbeanName = "Catalina:type=GlobalRequestProcessor,name=\"http-*\""; - - Integer port = isHttpsEnabled() ? getAttribute(HTTPS_PORT) : getAttribute(HTTP_PORT); - String connectorMbeanName = format("Catalina:type=Connector,port=%s", port); - - jmxWebFeed = JmxFeed.builder() - .entity(this) - .period(3000, TimeUnit.MILLISECONDS) - .pollAttribute(new JmxAttributePollConfig<Integer>(ERROR_COUNT) - .objectName(requestProcessorMbeanName) - .attributeName("errorCount")) - .pollAttribute(new JmxAttributePollConfig<Integer>(REQUEST_COUNT) - .objectName(requestProcessorMbeanName) - .attributeName("requestCount")) - .pollAttribute(new JmxAttributePollConfig<Integer>(TOTAL_PROCESSING_TIME) - .objectName(requestProcessorMbeanName) - .attributeName("processingTime")) - .pollAttribute(new JmxAttributePollConfig<String>(CONNECTOR_STATUS) - .objectName(connectorMbeanName) - .attributeName("stateName")) - .pollAttribute(new JmxAttributePollConfig<Boolean>(SERVICE_PROCESS_IS_RUNNING) - .objectName(connectorMbeanName) - .attributeName("stateName") - .onSuccess(Functions.forPredicate(Predicates.<Object>equalTo("STARTED"))) - .setOnFailureOrException(false)) - .build(); - - jmxAppFeed = JavaAppUtils.connectMXBeanSensors(this); - } else { - // if not using JMX - LOG.warn("Tomcat running without JMX monitoring; limited visibility of service available"); - connectServiceUpIsRunning(); - } -} - -{% endhighlight %} - -We can see here that if jmx is not enabled, the method will call `connectServiceUpIsRunning()` which will use the -default PID-based method of determining if a process is running. However, as JMX *is* running, the `service.process.isRunning` -sensor (denoted here by the `SERVICE_PROCESS_IS_RUNNING` variable) is set to true if and only if the -`stateName` JMX attribute equals `STARTED`. We can see from the previous call to `.pollAttribute` that this -attribute is also published to the `CONNECTOR_STATUS` sensor. The `CONNECTOR_STATUS` sensor is defined as follows: - -{% highlight java %} - -AttributeSensor<String> CONNECTOR_STATUS = - new BasicAttributeSensor<String>(String.class, "webapp.tomcat.connectorStatus", "Catalina connector state name"); - -{% endhighlight %} - -Let's go back to the Brooklyn debug console and look for the `webapp.tomcat.connectorStatus`: - -[](images/jmx-sensors-large.png) - -As the sensor is not shown, it's likely that it's simply null or not set. We can check this by clicking -the `Show/hide empty records` icon (highlighted in yellow above): - -[](images/jmx-sensors-all-large.png) - -We know from previous steps that the installation and launch scripts completed, and we know the procecess is running, -but we can see here that the server is not responding to JMX requests. A good thing to check here would be that the -JMX port is not being blocked by iptables, firewalls or security groups - (see the (troubleshooting connectivity guide)[troubleshooting-connectivity.html]). -Let's assume that we've checked that and they're all open. There is still one more thing that Brooklyn can tell us. - -Still on the `Sensors` tab, let's take a look at the `log.location` sensor: - -{% highlight console %} - -/tmp/brooklyn-martin/apps/c3bmrlC3/entities/TomcatServer_C1TAjYia/logs/catalina.out - -{% endhighlight %} - -This is the location of Tomcat's own log file. The location of the log file will differ from process to process -and when writing a custom entity you will need to check the software's own documentation. If your blueprint's -ssh driver extends `JavaSoftwareProcessSshDriver`, the value returned by the `getLogFileLocation()` method will -automatically be published to the `log.location` sensor. Otherwise, you can publish the value yourself by calling -`entity.setAttribute(Attributes.LOG_FILE_LOCATION, getLogFileLocation());` in your ssh driver - -**Note:** The log file will be on the server to which you have deployed Tomcat, and not on the Brooklyn server. -Let's take a look in the log file: - -{% highlight console %} - -less /tmp/brooklyn-martin/apps/c3bmrlC3/entities/TomcatServer_C1TAjYia/logs/catalina.out - -Jul 21, 2015 4:12:12 PM org.apache.tomcat.util.digester.Digester fatalError -SEVERE: Parse Fatal Error at line 143 column 3: The element type "unmatched-element" must be terminated by the matching end-tag "</unmatched-element>". - org.xml.sax.SAXParseException; systemId: file:/tmp/brooklyn-martin/apps/c3bmrlC3/entities/TomcatServer_C1TAjYia/conf/server.xml; lineNumber: 143; columnNumber: 3; The element type "unmatched-element" must be terminated by the matching end-tag "</unmatched-element>". - at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) - at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) - at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) - at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) - at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1437) - at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1749) - at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973) - at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) - at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) - at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) - at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) - at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) - at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) - at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649) - at org.apache.tomcat.util.digester.Digester.parse(Digester.java:1561) - at org.apache.catalina.startup.Catalina.load(Catalina.java:615) - at org.apache.catalina.startup.Catalina.start(Catalina.java:677) - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) - at java.lang.reflect.Method.invoke(Method.java:497) - at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:321) -at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:455) - - Jul 21, 2015 4:12:12 PM org.apache.catalina.startup.Catalina load - WARNING: Catalina.start using conf/server.xml: The element type "unmatched-element" must be terminated by the matching end-tag "</unmatched-element>". - Jul 21, 2015 4:12:12 PM org.apache.catalina.startup.Catalina start - SEVERE: Cannot start server. Server instance is not configured. - -{% endhighlight %} - -As expected, we can see here that the `unmatched-element` element has not been terminated in the `server.xml` file http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d930107/docs/guide/ops/troubleshooting/connectivity.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/connectivity.md b/docs/guide/ops/troubleshooting/connectivity.md new file mode 100644 index 0000000..07874c0 --- /dev/null +++ b/docs/guide/ops/troubleshooting/connectivity.md @@ -0,0 +1,143 @@ +--- +layout: website-normal +title: Troubleshooting Server Connectivity Issues in the Cloud +toc: /guide/toc.json +--- + +A common problem when setting up an application in the cloud is getting the basic connectivity right - how +do I get my service (e.g. a TCP host:port) publicly accessible over the internet? + +This varies a lot - e.g. Is the VM public or in a private network? Is the service only accessible through +a load balancer? Should the service be globally reachable or only to a particular CIDR? + +This guide gives some general tips for debugging connectivity issues, which are applicable to a +range of different service types. Choose those that are appropriate for your use-case. + +## VM reachable +If the VM is supposed to be accessible directly (e.g. from the public internet, or if in a private network +then from a jump host)... + +### ping +Can you `ping` the VM from the machine you are trying to reach it from? + +However, ping is over ICMP. If the VM is unreachable, it could be that the firewall forbids ICMP but still +lets TCP traffic through. + +### telnet to TCP port +You can check if a given TCP port is reachable and listening using `telnet <host> <port>`, such as +`telnet www.google.com 80`, which gives output like: + +``` + Trying 31.55.163.219... + Connected to www.google.com. + Escape character is '^]'. +``` + +If this is very slow to respond, it can be caused by a firewall blocking access. If it is fast, it could +be that the server is just not listening on that port. + +### DNS and routing +If using a hostname rather than IP, then is it resolving to a sensible IP? + +Is the route to the server sensible? (e.g. one can hit problems with proxy servers in a corporate +network, or ISPs returning a default result for unknown hosts). + +The following commands can be useful: + +* `host` is a DNS lookup utility. e.g. `host www.google.com`. +* `dig` stands for "domain information groper". e.g. `dig www.google.com`. +* `traceroute` prints the route that packets take to a network host. e.g. `traceroute www.google.com`. + +## Service is listening + +### Service responds +Try connecting to the service from the VM itself. For example, `curl http://localhost:8080` for a +web-service. + +On dev/test VMs, don't be afraid to install the utilities you need such as `curl`, `telnet`, `nc`, +etc. Cloud VMs often have a very cut-down set of packages installed. For example, execute +`sudo apt-get update; sudo apt-get install -y curl` or `sudo yum install -y curl`. + +### Listening on port +Check that the service is listening on the port, and on the correct NIC(s). + +Execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports in use (or use +`-anup` for UDP). You should expect to see the something like the output below for a service. + +``` +Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name +tcp 0 0 :::8080 :::* LISTEN 8276/java +``` + +In this case a Java process with pid 8276 is listening on port 8080. The local address `:::8080` +format means all NICs (in IPv6 address format). You may also see `0.0.0.0:8080` for IPv4 format. +If it says 127.0.0.1:8080 then your service will most likely not be reachable externally. + +Use `ip addr show` (or the obsolete `ifconfig -a`) to see the network interfaces on your server. + +For `netstat`, run with `sudo` to see the pid for all listed ports. + +## Firewalls +On Linux, check if `iptables` is preventing the remote connection. On Windows, check the Windows Firewall. + +If it is acceptable (e.g. it is not a server in production), try turning off the firewall temporarily, +and testing connectivity again. Remember to re-enable it afterwards! On CentOS, this is `sudo service +iptables stop`. On Ubuntu, use `sudo ufw disable`. On Windows, press the Windows key and type 'Windows +Firewall with Advanced Security' to open the firewall tools, then click 'Windows Firewall Properties' +and set the firewall state to 'Off' in the Domain, Public and Private profiles. + +If you cannot temporarily turn off the firewall, then look carefully at the firewall settings. For +example, execute `sudo iptables -n --list` and `iptables -t nat -n --list`. + +## Cloud firewalls +Some clouds offer a firewall service, where ports need to be explicitly listed to be reachable. + +For example, [security groups for EC2-classic] +(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#ec2-classic-security-groups) +have rules for the protocols and ports to be reachable from specific CIDRs. + +Check these settings via the cloud provider's web-console (or API). + +## Quick test of a listener port +It can be useful to start listening on a given port, and to then check if that port is reachable. +This is useful for testing basic connectivity when your service is not yet running, or to a +different port to compare behaviour, or to compare with another VM in the network. + +The `nc` netcat tool is useful for this. For example, `nc -l 0.0.0.0 8080` will listen on port +TCP 8080 on all network interfaces. On another server, you can then run `echo hello from client +| nc <hostname> 8080`. If all works well, this will send "hello from client" over the TCP port 8080, +which will be written out by the `nc -l` process before exiting. + +Similarly for UDP, you use `-lU`. + +You may first have to install `nc`, e.g. with `sudo yum install -y nc` or `sudo apt-get install netcat`. + +### Cloud load balancers +For some use-cases, it is good practice to use the load balancer service offered by the cloud provider +(e.g. [ELB in AWS](http://aws.amazon.com/elasticloadbalancing/) or the [Cloudstack Load Balancer] +(http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/latest/network_setup.html#management-server-load-balancing)) + +The VMs can all be isolated within a private network, with access only through the load balancer service. + +Debugging techniques here include ensuring connectivity from another jump server within the private +network, and careful checking of the load-balancer configuration from the Cloud Provider's web-console. + +### DNAT +Use of DNAT is appropriate for some use-cases, where a particular port on a particular VM is to be +made available. + +Debugging connectivity issues here is similar to the steps for a cloud load balancer. Ensure +connectivity from another jump server within the private network. Carefully check the NAT rules from +the Cloud Provider's web-console. + +### Guest wifi +It is common for guest wifi to restrict access to only specific ports (e.g. 80 and 443, restricting +ssh over port 22 etc). + +Normally your best bet is then to abandon the guest wifi (e.g. to tether to a mobile phone instead). + +There are some unconventional workarounds such as [configuring sshd to listen on port 80 so you can +use an ssh tunnel](http://askubuntu.com/questions/107173/is-it-possible-to-ssh-through-port-80). +However, the firewall may well inspect traffic so sending non-http traffic over port 80 may still fail. + + http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d930107/docs/guide/ops/troubleshooting/deployment.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/deployment.md b/docs/guide/ops/troubleshooting/deployment.md new file mode 100644 index 0000000..c343762 --- /dev/null +++ b/docs/guide/ops/troubleshooting/deployment.md @@ -0,0 +1,88 @@ +--- +layout: website-normal +title: Troubleshooting Deployment +toc: /guide/toc.json +--- + +This guide describes common problems encountered when deploying applications. + + +## YAML deployment errors + +The error `Invalid YAML: Plan not in acceptable format: Cannot convert ...` means that the text is not +valid YAML. Common reasons include that the indentation is incorrect, or that there are non-matching +brackets. + +The error `Unrecognized application blueprint format: no services defined` means that the `services:` +section is missing. + +An error like `Deployment plan item io.brooklyn.camp.spi.pdp.Service@23c159e2[name=<null>,description=<null>,serviceType=com.acme.Foo,characteristics=[],customAttributes={}] cannot be matched` means that the given entity type (in this case com.acme.Foo) is not in the catalog or on the classpath. + +An error like `Illegal parameter for 'location' (aws-ec3); not resolvable: java.util.NoSuchElementException: Unknown location 'aws-ec3': either this location is not recognised or there is a problem with location resolver configuration` means that the given location (in this case aws-ec3) +was unknown. This means it does not match any of the named locations in brooklyn.properties, nor any of the +clouds enabled in the jclouds support, nor any of the locations added dynamically through the catalog API. + + +## VM Provisioning Failures + +There are many stages at which VM provisioning can fail! An error `Failure running task provisioning` +means there was some problem obtaining or connecting to the machine. + +An error like `... Not authorized to access cloud ...` usually means the wrong identity/credential was used. + +An error like `Unable to match required VM template constraints` means that a matching image (e.g. AMI in AWS terminology) could not be found. This +could be because an incorrect explicit image id was supplied, or because the match-criteria could not +be satisfied using the given images available in the given cloud. The first time this error is +encountered, a listing of all images in that cloud/region will be written to the debug log. + +Failure to form an ssh connection to the newly provisioned VM can be reported in several different ways, +depending on the nature of the error. This breaks down into failures at different points: + +* Failure to reach the ssh port (e.g. `... could not connect to any ip address port 22 on node ...`). +* Failure to do the very initial ssh login (e.g. `... Exhausted available authentication methods ...`). +* Failure to ssh using the newly created user. + +There are many possible reasons for this ssh failure, which include: + +* The VM was "dead on arrival" (DOA) - sometimes a cloud will return an unusable VM. One can work around + this using the `machineCreateAttempts` configuration option, to automatically retry with a new VM. +* Local network restrictions. On some guest wifis, external access to port 22 is forbidden. + Check by manually trying to reach port 22 on a different machine that you have access it. +* NAT rules not set up correctly. On some clouds that have only private IPs, Brooklyn can automatically + create NAT rules to provide access to port 22. If this NAT rule creation fails for some reason, + then Brooklyn will not be able to reach the VM. If NAT rules are being created for your cloud, then + check the logs for warnings or errors about the NAT rule creation. +* ssh credentials incorrectly configured. The Brooklyn configuration is very flexible in how ssh + credentials can be configured. However, if a more advanced configuration is used incorrectly (e.g. + the wrong login user, or invalid ssh keys) then this will fail. +* Wrong login user. The initial login user to use when first logging into the new VM is inferred from + the metadata provided by the cloud provider about that image. This can sometimes be incomplete, so + the wrong user may be used. This can be explicitly set using the `loginUser` configuration option. + An example of this is with some Ubuntu VMs, where the "ubuntu" user should be used. However, on some clouds + it defaults to trying to ssh as "root". +* Bad choice of user. By default, Brooklyn will create a user with the same name as the user running the + Brooklyn process; the choice of user name is configurable. If this user already exists on the machine, + then the user setup will not behave as expected. Subsequent attempts to ssh using this user could then fail. +* Custom credentials on the VM. Most clouds will automatically set the ssh login details (e.g. in AWS using + the key-pair, or in CloudStack by auto-generating a password). However, with some custom images the VM + will have hard-coded credentials that must be used. If Brooklyn's configuration does not match that, + then it will fail. +* Guest customisation by the cloud. On some clouds (e.g. vCloud Air), the VM can be configured to do + guest customisation immediately after the VM starts. This can include changing the root password. + If Brooklyn is not configured with the expected changed password, then the VM provisioning may fail + (depending if Brooklyn connects before or after the password is changed!). + +A very useful debug configuration is to set `destroyOnFailure` to false. This will allow ssh failures to +be more easily investigated. + + +## Timeout Waiting For Service-Up + +A common generic error message is that there was a timeout waiting for service-up. + +This just means that the entity did not get to service-up in the pre-defined time period (the default is +two minutes, and can be configured using the `start.timeout` config key; the timer begins after the +start tasks are completed). + +See the guide on [runtime errors](troubleshooting-runtime-errors.html) for where to find additional information, especially the section on +"Entity's Error Status". http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d930107/docs/guide/ops/troubleshooting/going-deep-in-java-and-logs.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/going-deep-in-java-and-logs.md b/docs/guide/ops/troubleshooting/going-deep-in-java-and-logs.md new file mode 100644 index 0000000..613a15a --- /dev/null +++ b/docs/guide/ops/troubleshooting/going-deep-in-java-and-logs.md @@ -0,0 +1,488 @@ +--- +layout: website-normal +title: "Troubleshooting: Going Deep in Java and Logs" +toc: /guide/toc.json +--- + +This guide takes a deep look at the Java and log messages for some failure scenarios, +giving common steps used to identify the issues. + +## Script Failure + +Many blueprints run bash scripts as part of the installation. This section highlights how to identify a problem with +a bash script. + +First let's take a look at the `customize()` method of the Tomcat server blueprint: + +{% highlight java %} + @Override + public void customize() { + newScript(CUSTOMIZING) + .body.append("mkdir -p conf logs webapps temp") + .failOnNonZeroResultCode() + .execute(); + + copyTemplate(entity.getConfig(TomcatServer.SERVER_XML_RESOURCE), Os.mergePaths(getRunDir(), "conf", "server.xml")); + copyTemplate(entity.getConfig(TomcatServer.WEB_XML_RESOURCE), Os.mergePaths(getRunDir(), "conf", "web.xml")); + + if (isProtocolEnabled("HTTPS")) { + String keystoreUrl = Preconditions.checkNotNull(getSslKeystoreUrl(), "keystore URL must be specified if using HTTPS for " + entity); + String destinationSslKeystoreFile = getHttpsSslKeystoreFile(); + InputStream keystoreStream = resource.getResourceFromUrl(keystoreUrl); + getMachine().copyTo(keystoreStream, destinationSslKeystoreFile); + } + + getEntity().deployInitialWars(); + } +{% endhighlight %} + +Here we can see that it's running a script to create four directories before continuing with the customization. Let's +introduce an error by changing `mkdir` to `mkrid`: + +{% highlight java %} + newScript(CUSTOMIZING) + .body.append("mkrid -p conf logs webapps temp") // `mkdir` changed to `mkrid` + .failOnNonZeroResultCode() + .execute(); +{% endhighlight %} + +Now let's try deploying this using the following YAML: + +{% highlight yaml %} + +name: Tomcat failure test +location: localhost +services: +- type: brooklyn.entity.webapp.tomcat.TomcatServer + +{% endhighlight %} + +Shortly after deployment, the entity fails with the following error: + +`Failure running task ssh: customizing TomcatServerImpl{id=e1HP2s8x} (HmyPAozV): +Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x}` + +[](images/script-failure-large.png) + +By selecting the `Activities` tab, we can drill into the task that failed. The list of tasks shown (where the +effectors are shown as top-level tasks) are clickable links. Selecting that row will show the details of +that particular task, including its sub-tasks. We can eventually get to the specific sub-task that failed: + +[](images/failed-task-large.png) + +By clicking on the `stderr` link, we can see the script failed with the following error: + +{% highlight console %} +/tmp/brooklyn-20150721-132251052-l4b9-customizing_TomcatServerImpl_i.sh: line 10: mkrid: command not found +{% endhighlight %} + +This tells us *what* went wrong, but doesn't tell us *where*. In order to find that, we'll need to look at the +stack trace that was logged when the exception was thrown. + +It's always worth looking at the Detailed Status section as sometimes this will give you the information you need. +In this case, the stack trace is limited to the thread that was used to execute the task that ran the script: + +{% highlight console %} +Failed after 40ms + +STDERR +/tmp/brooklyn-20150721-132251052-l4b9-customizing_TomcatServerImpl_i.sh: line 10: mkrid: command not found + + +STDOUT +Executed /tmp/brooklyn-20150721-132251052-l4b9-customizing_TomcatServerImpl_i.sh, result 127: Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x} + +java.lang.IllegalStateException: Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x} + at brooklyn.entity.basic.lifecycle.ScriptHelper.logWithDetailsAndThrow(ScriptHelper.java:390) + at brooklyn.entity.basic.lifecycle.ScriptHelper.executeInternal(ScriptHelper.java:379) + at brooklyn.entity.basic.lifecycle.ScriptHelper$8.call(ScriptHelper.java:289) + at brooklyn.entity.basic.lifecycle.ScriptHelper$8.call(ScriptHelper.java:287) + at brooklyn.util.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:343) + at brooklyn.util.task.BasicExecutionManager$SubmissionCallable.call(BasicExecutionManager.java:469) + at java.util.concurrent.FutureTask.run(FutureTask.java:262) + at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) + at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) +at java.lang.Thread.run(Thread.java:745) +{% endhighlight %} + +In order to find the exception, we'll need to look in Brooklyn's debug log file. By default, the debug log file +is named `brooklyn.debug.log`. Usually the easiest way to navigate the log file is to use `less`, e.g. +`less brooklyn.debug.log`. We can quickly find find the stack trace by first navigating to the end of the log file +with `Shift-G`, then performing a reverse-lookup by typing `?Tomcat` and pressing `Enter`. If searching for the +blueprint type (in this case Tomcat) simply matches tasks unrelated to the exception, you can also search for +the text of the error message, in this case `? invalid result 127`. You can make the search case-insensitivity by +typing `-i` before performing the search. To skip the current match and move to the next on (i.e. 'up' as we're +performing a reverse-lookup), simply press `n` + +In this case, the `?Tomcat` search takes us directly to the full stack trace (Only the last part of the trace +is shown here): + +{% highlight console %} + +at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:63) ~[guava-17.0.jar:na] +at brooklyn.util.task.BasicTask.get(BasicTask.java:343) ~[classes/:na] +at brooklyn.util.task.BasicTask.getUnchecked(BasicTask.java:352) ~[classes/:na] +... 9 common frames omitted +Caused by: brooklyn.util.exceptions.PropagatedRuntimeException: +at brooklyn.util.exceptions.Exceptions.propagate(Exceptions.java:97) ~[classes/:na] +at brooklyn.util.task.BasicTask.getUnchecked(BasicTask.java:354) ~[classes/:na] +at brooklyn.entity.basic.lifecycle.ScriptHelper.execute(ScriptHelper.java:339) ~[classes/:na] +at brooklyn.entity.webapp.tomcat.TomcatSshDriver.customize(TomcatSshDriver.java:72) ~[classes/:na] +at brooklyn.entity.basic.AbstractSoftwareProcessDriver$8.run(AbstractSoftwareProcessDriver.java:150) ~[classes/:na] +at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_71] +at brooklyn.util.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:343) ~[classes/:na] +... 5 common frames omitted +Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x} +at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.7.0_71] +at java.util.concurrent.FutureTask.get(FutureTask.java:188) [na:1.7.0_71] +at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:63) ~[guava-17.0.jar:na] +at brooklyn.util.task.BasicTask.get(BasicTask.java:343) ~[classes/:na] +at brooklyn.util.task.BasicTask.getUnchecked(BasicTask.java:352) ~[classes/:na] +... 10 common frames omitted +Caused by: java.lang.IllegalStateException: Execution failed, invalid result 127 for customizing TomcatServerImpl{id=e1HP2s8x} +at brooklyn.entity.basic.lifecycle.ScriptHelper.logWithDetailsAndThrow(ScriptHelper.java:390) ~[classes/:na] +at brooklyn.entity.basic.lifecycle.ScriptHelper.executeInternal(ScriptHelper.java:379) ~[classes/:na] +at brooklyn.entity.basic.lifecycle.ScriptHelper$8.call(ScriptHelper.java:289) ~[classes/:na] +at brooklyn.entity.basic.lifecycle.ScriptHelper$8.call(ScriptHelper.java:287) ~[classes/:na] +... 6 common frames omitted + +{% endhighlight %} + +Brooklyn's use of tasks and helper classes can make the stack trace a little harder than usual to follow, but a good +place to start is to look through the stack trace for the node's implementation or ssh driver classes (usually +named `FooNodeImpl` or `FooSshDriver`). In this case we can see the following: + +{% highlight console %} +at brooklyn.entity.webapp.tomcat.TomcatSshDriver.customize(TomcatSshDriver.java:72) ~[classes/:na] +{% endhighlight %} + +Combining this with the error message of `mkrid: command not found` we can see that indeed `mkdir` has been +misspelled `mkrid` on line 72 of `TomcatSshDriver.java`. + + +## Non-Script Failure + +The section above gives an example of a failure that occurs when a script is run. In this section we will look at +a failure in a non-script related part of the code. We'll use the `customize()` method of the Tomcat server again, +but this time, we'll correct the spelling of 'mkdir' and add a line that attempts to copy a nonexistent resource +to the remote server: + +{% highlight java %} + +newScript(CUSTOMIZING) + .body.append("mkdir -p conf logs webapps temp") + .failOnNonZeroResultCode() + .execute(); + +copyTemplate(entity.getConfig(TomcatServer.SERVER_XML_RESOURCE), Os.mergePaths(getRunDir(), "conf", "server.xml")); +copyTemplate(entity.getConfig(TomcatServer.WEB_XML_RESOURCE), Os.mergePaths(getRunDir(), "conf", "web.xml")); +copyTemplate("classpath://nonexistent.xml", Os.mergePaths(getRunDir(), "conf", "nonexistent.xml")); // Resource does not exist! + +{% endhighlight %} + +Let's deploy this using the same YAML from above. Here's the resulting error in the Brooklyn debug console: + +[](images/resource-exception-large.png) + +Again, this tells us *what* the error is, but we need to find *where* the code is that attempts to copy this file. In +this case it's shown in the Detailed Status section, and we don't need to go to the log file: + +{% highlight console %} + +Failed after 221ms: Error getting resource 'classpath://nonexistent.xml' for TomcatServerImpl{id=PVZxDKU1}: java.io.IOException: Error accessing classpath://nonexistent.xml: java.io.IOException: nonexistent.xml not found on classpath + +java.lang.RuntimeException: Error getting resource 'classpath://nonexistent.xml' for TomcatServerImpl{id=PVZxDKU1}: java.io.IOException: Error accessing classpath://nonexistent.xml: java.io.IOException: nonexistent.xml not found on classpath + at brooklyn.util.ResourceUtils.getResourceFromUrl(ResourceUtils.java:297) + at brooklyn.util.ResourceUtils.getResourceAsString(ResourceUtils.java:475) + at brooklyn.entity.basic.AbstractSoftwareProcessDriver.getResourceAsString(AbstractSoftwareProcessDriver.java:447) + at brooklyn.entity.basic.AbstractSoftwareProcessDriver.processTemplate(AbstractSoftwareProcessDriver.java:469) + at brooklyn.entity.basic.AbstractSoftwareProcessDriver.copyTemplate(AbstractSoftwareProcessDriver.java:390) + at brooklyn.entity.basic.AbstractSoftwareProcessDriver.copyTemplate(AbstractSoftwareProcessDriver.java:379) + at brooklyn.entity.webapp.tomcat.TomcatSshDriver.customize(TomcatSshDriver.java:79) + at brooklyn.entity.basic.AbstractSoftwareProcessDriver$8.run(AbstractSoftwareProcessDriver.java:150) + at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) + at brooklyn.util.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:343) + at brooklyn.util.task.BasicExecutionManager$SubmissionCallable.call(BasicExecutionManager.java:469) + at java.util.concurrent.FutureTask.run(FutureTask.java:262) + at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) + at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) +at java.lang.Thread.run(Thread.java:745) + Caused by: java.io.IOException: Error accessing classpath://nonexistent.xml: java.io.IOException: nonexistent.xml not found on classpath +at brooklyn.util.ResourceUtils.getResourceFromUrl(ResourceUtils.java:233) + ... 14 more + Caused by: java.io.IOException: nonexistent.xml not found on classpath + at brooklyn.util.ResourceUtils.getResourceViaClasspath(ResourceUtils.java:372) +at brooklyn.util.ResourceUtils.getResourceFromUrl(ResourceUtils.java:230) + ... 14 more + +{% endhighlight %} + +Looking for `Tomcat` in the stack trace, we can see in this case the problem lies at line 79 of `TomcatSshDriver.java` + + +## External Failure + +Sometimes an entity will fail outside the direct commands issues by Brooklyn. When installing and launching an entity, +Brooklyn will check the return code of scripts that were run to ensure that they completed successfully (i.e. the +return code of the script is zero). It is possible, for example, that a launch script completes successfully, but +the entity fails to start. + +We can simulate this type of failure by launching Tomcat with an invalid configuration file. As seen in the previous +examples, Brooklyn copies two xml configuration files to the server: `server.xml` and `web.xml` + +The first few non-comment lines of `server.xml` are as follows (you can see the full file [here] +(https://github.com/apache/incubator-brooklyn/blob/master/software/webapp/src/main/resources/brooklyn/entity/webapp/tomcat/server.xml)): + +{% highlight xml %} + +<Server port="${driver.shutdownPort?c}" shutdown="SHUTDOWN"> + <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on" /> + <Listener className="org.apache.catalina.core.JasperListener" /> + +{% endhighlight%} + +Let's add an unmatched XML element, which will make this XML file invalid: + +{% highlight xml %} + +<Server port="${driver.shutdownPort?c}" shutdown="SHUTDOWN"> + <unmatched-element> <!-- This is invalid XML as we won't add </unmatched-element> --> + <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on" /> + <Listener className="org.apache.catalina.core.JasperListener" /> + +{% endhighlight%} + +As Brooklyn doesn't know how these types of resources are used, they're not validated as they're copied to the remote machine. +As far as Brooklyn is concerned, the file will have copied successfully. + +Let's deploy Tomcat again, using the same YAML as before. This time, the deployment runs for a few minutes before failing +with `Timeout waiting for SERVICE_UP`: + +[](images/external-error-large.png) + +If we drill down into the tasks in the `Activities` tab, we can see that all of the installation and launch tasks +completed successfully, and stdout of the `launch` script is as follows: + +{% highlight console %} + +Executed /tmp/brooklyn-20150721-153049139-fK2U-launching_TomcatServerImpl_id_.sh, result 0 + +{% endhighlight %} + +The task that failed was the `post-start` task, and the stack trace from the Detailed Status section is as follows: + +{% highlight console %} + +Failed after 5m 1s: Timeout waiting for SERVICE_UP from TomcatServerImpl{id=BUHgQeOs} + +java.lang.IllegalStateException: Timeout waiting for SERVICE_UP from TomcatServerImpl{id=BUHgQeOs} + at brooklyn.entity.basic.Entities.waitForServiceUp(Entities.java:1073) + at brooklyn.entity.basic.SoftwareProcessImpl.waitForServiceUp(SoftwareProcessImpl.java:388) + at brooklyn.entity.basic.SoftwareProcessImpl.waitForServiceUp(SoftwareProcessImpl.java:385) + at brooklyn.entity.basic.SoftwareProcessDriverLifecycleEffectorTasks.postStartCustom(SoftwareProcessDriverLifecycleEffectorTasks.java:164) + at brooklyn.entity.software.MachineLifecycleEffectorTasks$7.run(MachineLifecycleEffectorTasks.java:433) + at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) + at brooklyn.util.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:343) + at brooklyn.util.task.BasicExecutionManager$SubmissionCallable.call(BasicExecutionManager.java:469) + at java.util.concurrent.FutureTask.run(FutureTask.java:262) + at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) + at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) +at java.lang.Thread.run(Thread.java:745) + +{% endhighlight %} + +This doesn't really tell us what we need to know, and looking in the `brooklyn.debug.log` file yields no further +clues. The key here is the error message `Timeout waiting for SERVICE_UP`. After running the installation and +launch scripts, assuming all scripts completed successfully, Brooklyn will periodically check the health of the node +and will set the node on fire if the health check does not pass within a pre-prescribed period (the default is +two minutes, and can be configured using the `start.timeout` config key). The periodic health check also continues +after the successful launch in order to check continued operation of the node, but in this case it fails to pass +at all. + +The first thing we need to do is to find out how Brooklyn determines the health of the node. The health-check is +often implemented in the `isRunning()` method in the entity's ssh driver. Tomcat's implementation of `isRunning()` +is as follows: + +{% highlight java %} +@Override +public boolean isRunning() { + return newScript(MutableMap.of(USE_PID_FILE, "pid.txt"), CHECK_RUNNING).execute() == 0; +} +{% endhighlight %} + +The `newScript` method has conveniences for default scripts to check if a process is running based on its PID. In this +case, it will look for Tomcat's PID in the `pid.txt` file and check if the PID is the PID of a running process + +It's worth a quick sanity check at this point to check if the PID file exists, and if the process is running. +By default, the pid file is located in the run directory of the entity. You can find the location of the entity's run +directory by looking at the `run.dir` sensor. In this case it is `/tmp/brooklyn-martin/apps/jIzIHXtP/entities/TomcatServer_BUHgQeOs`. +To find the pid, you simply cat the pid.txt file in this directory: + +{% highlight console %} +$ cat /tmp/brooklyn-martin/apps/jIzIHXtP/entities/TomcatServer_BUHgQeOs/pid.txt +73714 +{% endhighlight %} + +In this case, the PID in the file is 73714. You can then check if the process is running using `ps`. You can also +pipe the output to `fold` so the full launch command is visible: + +{% highlight console %} + +$ ps -p 73714 | fold -w 120 +PID TTY TIME CMD +73714 ?? 0:08.03 /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/bin/java -Dnop -Djava.util.logg +ing.manager=org.apache.juli.ClassLoaderLogManager -javaagent:/tmp/brooklyn-martin/apps/jIzIHXtP/entities/TomcatServer_BU +HgQeOs/brooklyn-jmxmp-agent-shaded-0.8.0-SNAPSHOT.jar -Xms200m -Xmx800m -XX:MaxPermSize=400m -Dcom.sun.management.jmxrem +ote -Dbrooklyn.jmxmp.rmi-port=1099 -Dbrooklyn.jmxmp.port=31001 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.manage +ment.jmxremote.authenticate=false -Djava.endorsed.dirs=/tmp/brooklyn-martin/installs/TomcatServer_7.0.56/apache-tomcat-7 +.0.56/endorsed -classpath /tmp/brooklyn-martin/installs/TomcatServer_7.0.56/apache-tomcat-7.0.56/bin/bootstrap.jar:/tmp/ +brooklyn-martin/installs/TomcatServer_7.0.56/apache-tomcat-7.0.56/bin/tomcat-juli.jar -Dcatalina.base=/tmp/brooklyn-mart +in/apps/jIzIHXtP/entities/TomcatServer_BUHgQeOs -Dcatalina.home=/tmp/brooklyn-martin/installs/TomcatServer_7.0.56/apache +-tomcat-7.0.56 -Djava.io.tmpdir=/tmp/brooklyn-martin/apps/jIzIHXtP/entities/TomcatServer_BUHgQeOs/temp org.apache.catali +na.startup.Bootstrap start + +{% endhighlight %} + +This confirms that the process is running. The next thing we can look at is the `service.notUp.indicators` sensor. This +reads as follows: + +{% highlight json %} + +{"service.process.isRunning":"The software process for this entity does not appear to be running"} + +{% endhighlight %} + +This confirms that the problem is indeed due to the `service.process.isRunning` sensor. We assumed earlier that this was +set by the `isRunning()` method in `TomcatSshDriver.java`, but this isn't always the case. The `service.process.isRunning` +sensor is wired up by the `connectSensors()` method in the node's implementation class, in this case +`TomcatServerImpl.java`. Tomcat's implementation of `connectSensors()` is as follows: + +{% highlight java %} + +@Override +public void connectSensors() { + super.connectSensors(); + + if (getDriver().isJmxEnabled()) { + String requestProcessorMbeanName = "Catalina:type=GlobalRequestProcessor,name=\"http-*\""; + + Integer port = isHttpsEnabled() ? getAttribute(HTTPS_PORT) : getAttribute(HTTP_PORT); + String connectorMbeanName = format("Catalina:type=Connector,port=%s", port); + + jmxWebFeed = JmxFeed.builder() + .entity(this) + .period(3000, TimeUnit.MILLISECONDS) + .pollAttribute(new JmxAttributePollConfig<Integer>(ERROR_COUNT) + .objectName(requestProcessorMbeanName) + .attributeName("errorCount")) + .pollAttribute(new JmxAttributePollConfig<Integer>(REQUEST_COUNT) + .objectName(requestProcessorMbeanName) + .attributeName("requestCount")) + .pollAttribute(new JmxAttributePollConfig<Integer>(TOTAL_PROCESSING_TIME) + .objectName(requestProcessorMbeanName) + .attributeName("processingTime")) + .pollAttribute(new JmxAttributePollConfig<String>(CONNECTOR_STATUS) + .objectName(connectorMbeanName) + .attributeName("stateName")) + .pollAttribute(new JmxAttributePollConfig<Boolean>(SERVICE_PROCESS_IS_RUNNING) + .objectName(connectorMbeanName) + .attributeName("stateName") + .onSuccess(Functions.forPredicate(Predicates.<Object>equalTo("STARTED"))) + .setOnFailureOrException(false)) + .build(); + + jmxAppFeed = JavaAppUtils.connectMXBeanSensors(this); + } else { + // if not using JMX + LOG.warn("Tomcat running without JMX monitoring; limited visibility of service available"); + connectServiceUpIsRunning(); + } +} + +{% endhighlight %} + +We can see here that if jmx is not enabled, the method will call `connectServiceUpIsRunning()` which will use the +default PID-based method of determining if a process is running. However, as JMX *is* running, the `service.process.isRunning` +sensor (denoted here by the `SERVICE_PROCESS_IS_RUNNING` variable) is set to true if and only if the +`stateName` JMX attribute equals `STARTED`. We can see from the previous call to `.pollAttribute` that this +attribute is also published to the `CONNECTOR_STATUS` sensor. The `CONNECTOR_STATUS` sensor is defined as follows: + +{% highlight java %} + +AttributeSensor<String> CONNECTOR_STATUS = + new BasicAttributeSensor<String>(String.class, "webapp.tomcat.connectorStatus", "Catalina connector state name"); + +{% endhighlight %} + +Let's go back to the Brooklyn debug console and look for the `webapp.tomcat.connectorStatus`: + +[](images/jmx-sensors-large.png) + +As the sensor is not shown, it's likely that it's simply null or not set. We can check this by clicking +the `Show/hide empty records` icon (highlighted in yellow above): + +[](images/jmx-sensors-all-large.png) + +We know from previous steps that the installation and launch scripts completed, and we know the procecess is running, +but we can see here that the server is not responding to JMX requests. A good thing to check here would be that the +JMX port is not being blocked by iptables, firewalls or security groups +(see the (troubleshooting connectivity guide)[troubleshooting-connectivity.html]). +Let's assume that we've checked that and they're all open. There is still one more thing that Brooklyn can tell us. + + +Still on the `Sensors` tab, let's take a look at the `log.location` sensor: + +{% highlight console %} + +/tmp/brooklyn-martin/apps/c3bmrlC3/entities/TomcatServer_C1TAjYia/logs/catalina.out + +{% endhighlight %} + +This is the location of Tomcat's own log file. The location of the log file will differ from process to process +and when writing a custom entity you will need to check the software's own documentation. If your blueprint's +ssh driver extends `JavaSoftwareProcessSshDriver`, the value returned by the `getLogFileLocation()` method will +automatically be published to the `log.location` sensor. Otherwise, you can publish the value yourself by calling +`entity.setAttribute(Attributes.LOG_FILE_LOCATION, getLogFileLocation());` in your ssh driver + +**Note:** The log file will be on the server to which you have deployed Tomcat, and not on the Brooklyn server. +Let's take a look in the log file: + +{% highlight console %} + +less /tmp/brooklyn-martin/apps/c3bmrlC3/entities/TomcatServer_C1TAjYia/logs/catalina.out + +Jul 21, 2015 4:12:12 PM org.apache.tomcat.util.digester.Digester fatalError +SEVERE: Parse Fatal Error at line 143 column 3: The element type "unmatched-element" must be terminated by the matching end-tag "</unmatched-element>". + org.xml.sax.SAXParseException; systemId: file:/tmp/brooklyn-martin/apps/c3bmrlC3/entities/TomcatServer_C1TAjYia/conf/server.xml; lineNumber: 143; columnNumber: 3; The element type "unmatched-element" must be terminated by the matching end-tag "</unmatched-element>". + at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) + at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) + at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) + at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) + at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1437) + at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1749) + at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973) + at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) + at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) + at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) + at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) + at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) + at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) + at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649) + at org.apache.tomcat.util.digester.Digester.parse(Digester.java:1561) + at org.apache.catalina.startup.Catalina.load(Catalina.java:615) + at org.apache.catalina.startup.Catalina.start(Catalina.java:677) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:497) + at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:321) +at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:455) + + Jul 21, 2015 4:12:12 PM org.apache.catalina.startup.Catalina load + WARNING: Catalina.start using conf/server.xml: The element type "unmatched-element" must be terminated by the matching end-tag "</unmatched-element>". + Jul 21, 2015 4:12:12 PM org.apache.catalina.startup.Catalina start + SEVERE: Cannot start server. Server instance is not configured. + +{% endhighlight %} + +As expected, we can see here that the `unmatched-element` element has not been terminated in the `server.xml` file http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d930107/docs/guide/ops/troubleshooting/index.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/index.md b/docs/guide/ops/troubleshooting/index.md index ca0b8a9..ee8dfd7 100644 --- a/docs/guide/ops/troubleshooting/index.md +++ b/docs/guide/ops/troubleshooting/index.md @@ -2,10 +2,11 @@ title: Troubleshooting layout: website-normal children: -- troubleshooting-runtime-errors.md -- troubleshooting-deployment.md -- troubleshooting-softwareprocess.md -- troubleshooting-connectivity.md +- { path: overview.md, title: Overview } +- { path: deployment.md, title: Deployment } +- { path: connectivity.md, title: Server Connectivity } +- { path: softwareprocess.md, title: SoftwareProcess Entities } +- { path: going-deep-in-java-and-logs.md, title: Going Deep in Java and Logs } --- -{% include list-children.html %} \ No newline at end of file +{% include list-children.html %} http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d930107/docs/guide/ops/troubleshooting/overview.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/overview.md b/docs/guide/ops/troubleshooting/overview.md new file mode 100644 index 0000000..815d33c --- /dev/null +++ b/docs/guide/ops/troubleshooting/overview.md @@ -0,0 +1,116 @@ +--- +layout: website-normal +title: Troubleshooting Overview +toc: /guide/toc.json +--- + +This guide describes sources of information for understanding when things go wrong. + +Whether you're customizing out-of-the-box blueprints, or developing your own custom blueprints, you will +inevitably have to deal with entity failure. Thankfully Brooklyn provides plenty of information to help +you locate and resolve any issues you may encounter. + + +## Web-console Runtime Error Information + +### Entity Hierarchy + +The Brooklyn web-console includes a tree view of the entities within an application. Errors within the +application are represented visually, showing a "fire" image on the entity. + +When an error causes an entire application to be unexpectedly down, the error is generally propagated to the +top-level entity - i.e. marking it as "on fire". To find the underlying error, one should expand the entity +hierarchy tree to find the specific entities that have actually failed. + + +### Entity's Error Status + +Many entities have some common sensors (i.e. attributes) that give details of the error status: + +* `service.isUp` (often referred to as "service up") is a boolean, saying whether the service is up. For many + software processes, this is inferred from whether the "service.notUp.indicators" is empty. It is also + possible for some entities to set this attribute directly. +* `service.notUp.indicators` is a map of errors. This often gives much more information than the single + `service.isUp` attribute. For example, there may be many health-check indicators for a component: + is the root URL reachable, it the management api reporting healthy, is the process running, etc. +* `service.problems` is a map of namespaced indicators of problems with a service. +* `service.state` is the actual state of the service - e.g. CREATED, STARTING, RUNNING, STOPPING, STOPPED, + DESTROYED and ON_FIRE. +* `service.state.expected` indicates the state the service is expected to be in (and when it transitioned to that). + For example, is the service expected to be starting, running, stopping, etc. + +These sensor values are shown in the "sensors" tab - see below. + + +### Sensors View + +The "Sensors" tab in the Brooklyn web-console shows the attribute values of a particular entity. +This gives lots of runtime information, including about the health of the entity - the +set of attributes will vary between different entity types. + +[](images/jmx-sensors-large.png) + +Note that null (or not set) sensors are hidden by default. You can click on the `Show/hide empty records` +icon (highlighted in yellow above) to see these sensors as well. + +The sensors view is also tabulated. You can configure the numbers of sensors shown per page +(at the bottom). There is also a search bar (at the top) to filter the sensors shown. + + +### Activity View + +The activity view shows the tasks executed by a given entity. The top-level tasks are the effectors +(i.e. operations) invoked on that entity. This view allows one to drill into the task, to +see details of errors. + +Select the entity, and then click on the `Activities` tab. + +In the table showing the tasks, each row is a link - clicking on the row will drill into the details of that task, +including sub-tasks: + +[](images/failed-task-large.png) + +For ssh tasks, this allows one to drill down to see the env, stdin, stdout and stderr. That is, you can see the +commands executed (stdin) and environment variables (env), and the output from executing that (stdout and stderr). + +For tasks that did not fail, one can still drill into the tasks to see what was done. + +It's always worth looking at the Detailed Status section as sometimes that will give you the information you need. +For example, it can show the exception stack trace in the thread that was executing the task that failed. + + +## Log Files + +Brooklyn's logging is configurable, for the files created, the logging levels, etc. +See [Logging docs](../logging.html). + +With out-of-the-box logging, `brooklyn.info.log` and `brooklyn.debug.log` files are created. These are by default +rolling log files: when the log reaches a given size, it is compressed and a new log file is started. +Therefore check the timestamps of the log files to ensure you are looking in the correct file for the +time of your error. + +With out-of-the-box logging, info, warnings and errors are written to the `brooklyn.info.log` file. This gives +a summary of the important actions and errors. However, it does not contain full stacktraces for errors. + +To find the exception, we'll need to look in Brooklyn's debug log file. By default, the debug log file +is named `brooklyn.debug.log`. You can use your favourite tools for viewing large text files. + +One possible tool is `less`, e.g. `less brooklyn.debug.log`. We can quickly find the last exception +by navigating to the end of the log file (using `Shift-G`), then performing a reverse-lookup by typing `?Exception` +and pressing `Enter`. Sometimes an error results in multiple exceptions being logged (e.g. first for the +entity, then for the cluster, then for the app). If you know the text of the error message (e.g. copy-pasted +from the Activities view of the web-console) then one can search explicitly for that text. + +The `grep` command is also extremely helpful. Useful things to grep for include: + +* The entity id (see the "summary" tab of the entity in the web-console for the id). +* The entity type name (if there are only a small number of entities of that type). +* The VM IP address. +* A particular error message (e.g. copy-pasted from the Activities view of the web-console). +* The word WARN etc, such as `grep -E "WARN|ERROR" brooklyn.info.log`. + +Grep'ing for particular log messages is also useful. Some examples are shown below: + +* INFO: "Started application", "Stopping application" and "Stopped application" +* INFO: "Creating VM " +* DEBUG: "Finished VM " http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d930107/docs/guide/ops/troubleshooting/softwareprocess.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/softwareprocess.md b/docs/guide/ops/troubleshooting/softwareprocess.md new file mode 100644 index 0000000..5e7f883 --- /dev/null +++ b/docs/guide/ops/troubleshooting/softwareprocess.md @@ -0,0 +1,50 @@ +--- +layout: website-normal +title: Troubleshooting SoftwareProcess Entities +toc: /guide/toc.json +--- + +The [troubleshooting overview](overview.html) in Brooklyn gives +information for how to find more information about errors. + +If that doesn't give enough information to diagnose, fix or workaround the problem, then it can be required +to login to the machine, to investigate further. This guide applies to entities that are types +of "SoftwareProcess" in Brooklyn, or that follows those conventions. + + +## VM connection details + +The ssh connection details for an entity is published to a sensor `host.sshAddress`. The login +credentials will depend on the Brooklyn configuration. The default is to use the `~/.ssh/id_rsa` +or `~/.ssh/id_dsa` on the Brooklyn host (uploading the associated `~/.ssh/id_rsa.pub` to the machine's +authorised_keys). However, this can be overridden (e.g. with specific passwords etc) in the +location's configuration. + +For Windows, there is a similar sensor with the name `host.winrmAddress`. (TODO sensor for password?) + + +## Install and Run Directories + +For ssh-based software processes, the install directory and the run directory are published as sensors +`install.dir` and `run.dir` respectively. + +For some entities, files are unpacked into the install dir; configuration files are written to the +run dir along with log files. For some other entities, these directories may be mostly empty - +e.g. if installing RPMs, and that software writes its logs to a different standard location. + +Most entities have a sensor `log.location`. It is generally worth checking this, along with other files +in the run directory (such as console output). + + +## Process and OS Health + +It is worth checking that the process is running, e.g. using `ps aux` to look for the desired process. +Some entities also write the pid of the process to `pid.txt` in the run directory. + +It is also worth checking if the required port is accessible. This is discussed in the guide +"Troubleshooting Server Connectivity Issues in the Cloud", including listing the ports in use: +execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports in use (or use +`-anup` for UDP). + +It is also worth checking the disk space on the server, e.g. using `df -m`, to check that there +is sufficient space on each of the required partitions.
