troubleshooting

aledsage Tue, 28 Jul 2015 08:46:43 -0700

Repository: incubator-brooklyn
Updated Branches:
  refs/heads/master 726cae34d -> a7b3d8e99



Adds docs ops/troubleshooting

- Moves troubleshooting-connectivity from dev/tips to ops
- Adds troubleshooting guides for:
  - runtime-errors
  - deployment
  - software process


Project: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/commit/3d29c8b5
Tree: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/tree/3d29c8b5
Diff: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/diff/3d29c8b5

Branch: refs/heads/master
Commit: 3d29c8b5db0925149181853f5380ea00add57f82
Parents: 2013f2f
Author: Aled Sage <[email protected]>
Authored: Thu Jul 23 19:47:23 2015 -0700
Committer: Alex Heneveld <[email protected]>
Committed: Fri Jul 24 15:24:58 2015 +0100

----------------------------------------------------------------------
 docs/guide/dev/index.md                         |   2 -
 .../dev/tips/troubleshooting-connectivity.md    | 143 -------------------
 docs/guide/ops/index.md                         |   1 +
 .../images/failed-task-large.png                | Bin 0 -> 169079 bytes
 .../images/jmx-sensors-large.png                | Bin 0 -> 197177 bytes
 docs/guide/ops/troubleshooting/index.md         |  11 ++
 .../troubleshooting-connectivity.md             | 143 +++++++++++++++++++
 .../troubleshooting-deployment.md               |  88 ++++++++++++
 .../troubleshooting-runtime-errors.md           | 116 +++++++++++++++
 .../troubleshooting-softwareprocess.md          |  50 +++++++
 10 files changed, 409 insertions(+), 145 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/dev/index.md
----------------------------------------------------------------------
diff --git a/docs/guide/dev/index.md b/docs/guide/dev/index.md
index aa04ef9..0a7acfd 100644
--- a/docs/guide/dev/index.md
+++ b/docs/guide/dev/index.md
@@ -14,8 +14,6 @@ children:
 - tips/
 - tips/logging.md
 - tips/debugging-remote-brooklyn.md
-- tips/troubleshooting-exceptions.md
-- tips/troubleshooting-connectivity.md
 - rest/rest-api-doc.md
 ---
 

http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/dev/tips/troubleshooting-connectivity.md
----------------------------------------------------------------------
diff --git a/docs/guide/dev/tips/troubleshooting-connectivity.md 
b/docs/guide/dev/tips/troubleshooting-connectivity.md
deleted file mode 100644
index 07874c0..0000000
--- a/docs/guide/dev/tips/troubleshooting-connectivity.md
+++ /dev/null
@@ -1,143 +0,0 @@
----
-layout: website-normal
-title: Troubleshooting Server Connectivity Issues in the Cloud
-toc: /guide/toc.json
----
-
-A common problem when setting up an application in the cloud is getting the 
basic connectivity right - how
-do I get my service (e.g. a TCP host:port) publicly accessible over the 
internet?
-
-This varies a lot - e.g. Is the VM public or in a private network? Is the 
service only accessible through
-a load balancer? Should the service be globally reachable or only to a 
particular CIDR?
-
-This guide gives some general tips for debugging connectivity issues, which 
are applicable to a 
-range of different service types. Choose those that are appropriate for your 
use-case.
-
-## VM reachable
-If the VM is supposed to be accessible directly (e.g. from the public 
internet, or if in a private network
-then from a jump host)...
-
-### ping
-Can you `ping` the VM from the machine you are trying to reach it from?
-
-However, ping is over ICMP. If the VM is unreachable, it could be that the 
firewall forbids ICMP but still
-lets TCP traffic through.
-
-### telnet to TCP port
-You can check if a given TCP port is reachable and listening using `telnet 
<host> <port>`, such as
-`telnet www.google.com 80`, which gives output like:
-
-```
-    Trying 31.55.163.219...
-    Connected to www.google.com.
-    Escape character is '^]'.
-```
-
-If this is very slow to respond, it can be caused by a firewall blocking 
access. If it is fast, it could
-be that the server is just not listening on that port.
-
-### DNS and routing
-If using a hostname rather than IP, then is it resolving to a sensible IP?
-
-Is the route to the server sensible? (e.g. one can hit problems with proxy 
servers in a corporate
-network, or ISPs returning a default result for unknown hosts).
-
-The following commands can be useful:
-
-* `host` is a DNS lookup utility. e.g. `host www.google.com`.
-* `dig` stands for "domain information groper". e.g. `dig www.google.com`.
-* `traceroute` prints the route that packets take to a network host. e.g. 
`traceroute www.google.com`.
-
-## Service is listening
-
-### Service responds
-Try connecting to the service from the VM itself. For example, `curl 
http://localhost:8080` for a
-web-service.
-
-On dev/test VMs, don't be afraid to install the utilities you need such as 
`curl`, `telnet`, `nc`,
-etc. Cloud VMs often have a very cut-down set of packages installed. For 
example, execute
-`sudo apt-get update; sudo apt-get install -y curl` or `sudo yum install -y 
curl`.
-
-### Listening on port
-Check that the service is listening on the port, and on the correct NIC(s).
-
-Execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports 
in use (or use
-`-anup` for UDP). You should expect to see the something like the output below 
for a service.
-
-```
-Proto Recv-Q Send-Q Local Address               Foreign Address             
State       PID/Program name   
-tcp        0      0 :::8080                     :::*                        
LISTEN      8276/java           
-```
-
-In this case a Java process with pid 8276 is listening on port 8080. The local 
address `:::8080`
-format means all NICs (in IPv6 address format). You may also see 
`0.0.0.0:8080` for IPv4 format.
-If it says 127.0.0.1:8080 then your service will most likely not be reachable 
externally.
-
-Use `ip addr show` (or the obsolete `ifconfig -a`) to see the network 
interfaces on your server.
-
-For `netstat`, run with `sudo` to see the pid for all listed ports.
-
-## Firewalls
-On Linux, check if `iptables` is preventing the remote connection. On Windows, 
check the Windows Firewall.
-
-If it is acceptable (e.g. it is not a server in production), try turning off 
the firewall temporarily,
-and testing connectivity again. Remember to re-enable it afterwards! On 
CentOS, this is `sudo service
-iptables stop`. On Ubuntu, use `sudo ufw disable`. On Windows, press the 
Windows key and type 'Windows
-Firewall with Advanced Security' to open the firewall tools, then click 
'Windows Firewall Properties'
-and set the firewall state to 'Off' in the Domain, Public and Private profiles.
-
-If you cannot temporarily turn off the firewall, then look carefully at the 
firewall settings. For
-example, execute `sudo iptables -n --list` and `iptables -t nat -n --list`.
-
-## Cloud firewalls
-Some clouds offer a firewall service, where ports need to be explicitly listed 
to be reachable.
-
-For example, [security groups for EC2-classic]
-(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#ec2-classic-security-groups)
-have rules for the protocols and ports to be reachable from specific CIDRs.
-
-Check these settings via the cloud provider's web-console (or API).
-
-## Quick test of a listener port
-It can be useful to start listening on a given port, and to then check if that 
port is reachable.
-This is useful for testing basic connectivity when your service is not yet 
running, or to a
-different port to compare behaviour, or to compare with another VM in the 
network.
-
-The `nc` netcat tool is useful for this. For example, `nc -l 0.0.0.0 8080` 
will listen on port
-TCP 8080 on all network interfaces. On another server, you can then run `echo 
hello from client
-| nc <hostname> 8080`. If all works well, this will send "hello from client" 
over the TCP port 8080,
-which will be written out by the `nc -l` process before exiting.
-
-Similarly for UDP, you use `-lU`.
-
-You may first have to install `nc`, e.g. with `sudo yum install -y nc` or 
`sudo apt-get install netcat`.
-
-### Cloud load balancers
-For some use-cases, it is good practice to use the load balancer service 
offered by the cloud provider
-(e.g. [ELB in AWS](http://aws.amazon.com/elasticloadbalancing/) or the 
[Cloudstack Load Balancer]
-(http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/latest/network_setup.html#management-server-load-balancing))
-
-The VMs can all be isolated within a private network, with access only through 
the load balancer service.
-
-Debugging techniques here include ensuring connectivity from another jump 
server within the private
-network, and careful checking of the load-balancer configuration from the 
Cloud Provider's web-console.
-
-### DNAT
-Use of DNAT is appropriate for some use-cases, where a particular port on a 
particular VM is to be
-made available.
-
-Debugging connectivity issues here is similar to the steps for a cloud load 
balancer. Ensure
-connectivity from another jump server within the private network. Carefully 
check the NAT rules from
-the Cloud Provider's web-console.
-
-### Guest wifi
-It is common for guest wifi to restrict access to only specific ports (e.g. 80 
and 443, restricting
-ssh over port 22 etc).
-
-Normally your best bet is then to abandon the guest wifi (e.g. to tether to a 
mobile phone instead).
-
-There are some unconventional workarounds such as [configuring sshd to listen 
on port 80 so you can
-use an ssh 
tunnel](http://askubuntu.com/questions/107173/is-it-possible-to-ssh-through-port-80).
-However, the firewall may well inspect traffic so sending non-http traffic 
over port 80 may still fail.
-
-  

http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/index.md
----------------------------------------------------------------------
diff --git a/docs/guide/ops/index.md b/docs/guide/ops/index.md
index dae3071..1cb28aa 100644
--- a/docs/guide/ops/index.md
+++ b/docs/guide/ops/index.md
@@ -11,6 +11,7 @@ children:
 - high-availability.md
 - catalog/
 - logging.md
+- troubleshooting/
 ---
 
 {% include list-children.html %}
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/images/failed-task-large.png
----------------------------------------------------------------------
diff --git a/docs/guide/ops/troubleshooting/images/failed-task-large.png 
b/docs/guide/ops/troubleshooting/images/failed-task-large.png
new file mode 100644
index 0000000..1c264c4
Binary files /dev/null and 
b/docs/guide/ops/troubleshooting/images/failed-task-large.png differ

http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/images/jmx-sensors-large.png
----------------------------------------------------------------------
diff --git a/docs/guide/ops/troubleshooting/images/jmx-sensors-large.png 
b/docs/guide/ops/troubleshooting/images/jmx-sensors-large.png
new file mode 100644
index 0000000..d9322c6
Binary files /dev/null and 
b/docs/guide/ops/troubleshooting/images/jmx-sensors-large.png differ

http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/index.md
----------------------------------------------------------------------
diff --git a/docs/guide/ops/troubleshooting/index.md 
b/docs/guide/ops/troubleshooting/index.md
new file mode 100644
index 0000000..ca0b8a9
--- /dev/null
+++ b/docs/guide/ops/troubleshooting/index.md
@@ -0,0 +1,11 @@
+---
+title: Troubleshooting
+layout: website-normal
+children:
+- troubleshooting-runtime-errors.md
+- troubleshooting-deployment.md
+- troubleshooting-softwareprocess.md
+- troubleshooting-connectivity.md
+---
+
+{% include list-children.html %}
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/troubleshooting-connectivity.md
----------------------------------------------------------------------
diff --git a/docs/guide/ops/troubleshooting/troubleshooting-connectivity.md 
b/docs/guide/ops/troubleshooting/troubleshooting-connectivity.md
new file mode 100644
index 0000000..07874c0
--- /dev/null
+++ b/docs/guide/ops/troubleshooting/troubleshooting-connectivity.md
@@ -0,0 +1,143 @@
+---
+layout: website-normal
+title: Troubleshooting Server Connectivity Issues in the Cloud
+toc: /guide/toc.json
+---
+
+A common problem when setting up an application in the cloud is getting the 
basic connectivity right - how
+do I get my service (e.g. a TCP host:port) publicly accessible over the 
internet?
+
+This varies a lot - e.g. Is the VM public or in a private network? Is the 
service only accessible through
+a load balancer? Should the service be globally reachable or only to a 
particular CIDR?
+
+This guide gives some general tips for debugging connectivity issues, which 
are applicable to a 
+range of different service types. Choose those that are appropriate for your 
use-case.
+
+## VM reachable
+If the VM is supposed to be accessible directly (e.g. from the public 
internet, or if in a private network
+then from a jump host)...
+
+### ping
+Can you `ping` the VM from the machine you are trying to reach it from?
+
+However, ping is over ICMP. If the VM is unreachable, it could be that the 
firewall forbids ICMP but still
+lets TCP traffic through.
+
+### telnet to TCP port
+You can check if a given TCP port is reachable and listening using `telnet 
<host> <port>`, such as
+`telnet www.google.com 80`, which gives output like:
+
+```
+    Trying 31.55.163.219...
+    Connected to www.google.com.
+    Escape character is '^]'.
+```
+
+If this is very slow to respond, it can be caused by a firewall blocking 
access. If it is fast, it could
+be that the server is just not listening on that port.
+
+### DNS and routing
+If using a hostname rather than IP, then is it resolving to a sensible IP?
+
+Is the route to the server sensible? (e.g. one can hit problems with proxy 
servers in a corporate
+network, or ISPs returning a default result for unknown hosts).
+
+The following commands can be useful:
+
+* `host` is a DNS lookup utility. e.g. `host www.google.com`.
+* `dig` stands for "domain information groper". e.g. `dig www.google.com`.
+* `traceroute` prints the route that packets take to a network host. e.g. 
`traceroute www.google.com`.
+
+## Service is listening
+
+### Service responds
+Try connecting to the service from the VM itself. For example, `curl 
http://localhost:8080` for a
+web-service.
+
+On dev/test VMs, don't be afraid to install the utilities you need such as 
`curl`, `telnet`, `nc`,
+etc. Cloud VMs often have a very cut-down set of packages installed. For 
example, execute
+`sudo apt-get update; sudo apt-get install -y curl` or `sudo yum install -y 
curl`.
+
+### Listening on port
+Check that the service is listening on the port, and on the correct NIC(s).
+
+Execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports 
in use (or use
+`-anup` for UDP). You should expect to see the something like the output below 
for a service.
+
+```
+Proto Recv-Q Send-Q Local Address               Foreign Address             
State       PID/Program name   
+tcp        0      0 :::8080                     :::*                        
LISTEN      8276/java           
+```
+
+In this case a Java process with pid 8276 is listening on port 8080. The local 
address `:::8080`
+format means all NICs (in IPv6 address format). You may also see 
`0.0.0.0:8080` for IPv4 format.
+If it says 127.0.0.1:8080 then your service will most likely not be reachable 
externally.
+
+Use `ip addr show` (or the obsolete `ifconfig -a`) to see the network 
interfaces on your server.
+
+For `netstat`, run with `sudo` to see the pid for all listed ports.
+
+## Firewalls
+On Linux, check if `iptables` is preventing the remote connection. On Windows, 
check the Windows Firewall.
+
+If it is acceptable (e.g. it is not a server in production), try turning off 
the firewall temporarily,
+and testing connectivity again. Remember to re-enable it afterwards! On 
CentOS, this is `sudo service
+iptables stop`. On Ubuntu, use `sudo ufw disable`. On Windows, press the 
Windows key and type 'Windows
+Firewall with Advanced Security' to open the firewall tools, then click 
'Windows Firewall Properties'
+and set the firewall state to 'Off' in the Domain, Public and Private profiles.
+
+If you cannot temporarily turn off the firewall, then look carefully at the 
firewall settings. For
+example, execute `sudo iptables -n --list` and `iptables -t nat -n --list`.
+
+## Cloud firewalls
+Some clouds offer a firewall service, where ports need to be explicitly listed 
to be reachable.
+
+For example, [security groups for EC2-classic]
+(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#ec2-classic-security-groups)
+have rules for the protocols and ports to be reachable from specific CIDRs.
+
+Check these settings via the cloud provider's web-console (or API).
+
+## Quick test of a listener port
+It can be useful to start listening on a given port, and to then check if that 
port is reachable.
+This is useful for testing basic connectivity when your service is not yet 
running, or to a
+different port to compare behaviour, or to compare with another VM in the 
network.
+
+The `nc` netcat tool is useful for this. For example, `nc -l 0.0.0.0 8080` 
will listen on port
+TCP 8080 on all network interfaces. On another server, you can then run `echo 
hello from client
+| nc <hostname> 8080`. If all works well, this will send "hello from client" 
over the TCP port 8080,
+which will be written out by the `nc -l` process before exiting.
+
+Similarly for UDP, you use `-lU`.
+
+You may first have to install `nc`, e.g. with `sudo yum install -y nc` or 
`sudo apt-get install netcat`.
+
+### Cloud load balancers
+For some use-cases, it is good practice to use the load balancer service 
offered by the cloud provider
+(e.g. [ELB in AWS](http://aws.amazon.com/elasticloadbalancing/) or the 
[Cloudstack Load Balancer]
+(http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/latest/network_setup.html#management-server-load-balancing))
+
+The VMs can all be isolated within a private network, with access only through 
the load balancer service.
+
+Debugging techniques here include ensuring connectivity from another jump 
server within the private
+network, and careful checking of the load-balancer configuration from the 
Cloud Provider's web-console.
+
+### DNAT
+Use of DNAT is appropriate for some use-cases, where a particular port on a 
particular VM is to be
+made available.
+
+Debugging connectivity issues here is similar to the steps for a cloud load 
balancer. Ensure
+connectivity from another jump server within the private network. Carefully 
check the NAT rules from
+the Cloud Provider's web-console.
+
+### Guest wifi
+It is common for guest wifi to restrict access to only specific ports (e.g. 80 
and 443, restricting
+ssh over port 22 etc).
+
+Normally your best bet is then to abandon the guest wifi (e.g. to tether to a 
mobile phone instead).
+
+There are some unconventional workarounds such as [configuring sshd to listen 
on port 80 so you can
+use an ssh 
tunnel](http://askubuntu.com/questions/107173/is-it-possible-to-ssh-through-port-80).
+However, the firewall may well inspect traffic so sending non-http traffic 
over port 80 may still fail.
+
+  

http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/troubleshooting-deployment.md
----------------------------------------------------------------------
diff --git a/docs/guide/ops/troubleshooting/troubleshooting-deployment.md 
b/docs/guide/ops/troubleshooting/troubleshooting-deployment.md
new file mode 100644
index 0000000..c343762
--- /dev/null
+++ b/docs/guide/ops/troubleshooting/troubleshooting-deployment.md
@@ -0,0 +1,88 @@
+---
+layout: website-normal
+title: Troubleshooting Deployment
+toc: /guide/toc.json
+---
+
+This guide describes common problems encountered when deploying applications.
+
+
+## YAML deployment errors
+
+The error `Invalid YAML: Plan not in acceptable format: Cannot convert ...` 
means that the text is not 
+valid YAML. Common reasons include that the indentation is incorrect, or that 
there are non-matching
+brackets.
+
+The error `Unrecognized application blueprint format: no services defined` 
means that the `services:`
+section is missing.
+
+An error like `Deployment plan item 
io.brooklyn.camp.spi.pdp.Service@23c159e2[name=<null>,description=<null>,serviceType=com.acme.Foo,characteristics=[],customAttributes={}]
 cannot be matched` means that the given entity type (in this case 
com.acme.Foo) is not in the catalog or on the classpath.
+
+An error like `Illegal parameter for 'location' (aws-ec3); not resolvable: 
java.util.NoSuchElementException: Unknown location 'aws-ec3': either this 
location is not recognised or there is a problem with location resolver 
configuration` means that the given location (in this case aws-ec3) 
+was unknown. This means it does not match any of the named locations in 
brooklyn.properties, nor any of the
+clouds enabled in the jclouds support, nor any of the locations added 
dynamically through the catalog API.
+
+
+## VM Provisioning Failures
+
+There are many stages at which VM provisioning can fail! An error `Failure 
running task provisioning` 
+means there was some problem obtaining or connecting to the machine.
+
+An error like `... Not authorized to access cloud ...` usually means the wrong 
identity/credential was used.
+
+An error like `Unable to match required VM template constraints` means that a 
matching image (e.g. AMI in AWS terminology) could not be found. This 
+could be because an incorrect explicit image id was supplied, or because the 
match-criteria could not
+be satisfied using the given images available in the given cloud. The first 
time this error is 
+encountered, a listing of all images in that cloud/region will be written to 
the debug log.
+
+Failure to form an ssh connection to the newly provisioned VM can be reported 
in several different ways, 
+depending on the nature of the error. This breaks down into failures at 
different points:
+
+* Failure to reach the ssh port (e.g. `... could not connect to any ip address 
port 22 on node ...`).
+* Failure to do the very initial ssh login (e.g. `... Exhausted available 
authentication methods ...`).
+* Failure to ssh using the newly created user.
+
+There are many possible reasons for this ssh failure, which include:
+
+* The VM was "dead on arrival" (DOA) - sometimes a cloud will return an 
unusable VM. One can work around
+  this using the `machineCreateAttempts` configuration option, to 
automatically retry with a new VM.
+* Local network restrictions. On some guest wifis, external access to port 22 
is forbidden.
+  Check by manually trying to reach port 22 on a different machine that you 
have access it.
+* NAT rules not set up correctly. On some clouds that have only private IPs, 
Brooklyn can automatically
+  create NAT rules to provide access to port 22. If this NAT rule creation 
fails for some reason,
+  then Brooklyn will not be able to reach the VM. If NAT rules are being 
created for your cloud, then
+  check the logs for warnings or errors about the NAT rule creation.
+* ssh credentials incorrectly configured. The Brooklyn configuration is very 
flexible in how ssh
+  credentials can be configured. However, if a more advanced configuration is 
used incorrectly (e.g. 
+  the wrong login user, or invalid ssh keys) then this will fail.
+* Wrong login user. The initial login user to use when first logging into the 
new VM is inferred from 
+  the metadata provided by the cloud provider about that image. This can 
sometimes be incomplete, so
+  the wrong user may be used. This can be explicitly set using the `loginUser` 
configuration option.
+  An example of this is with some Ubuntu VMs, where the "ubuntu" user should 
be used. However, on some clouds
+  it defaults to trying to ssh as "root".
+* Bad choice of user. By default, Brooklyn will create a user with the same 
name as the user running the
+  Brooklyn process; the choice of user name is configurable. If this user 
already exists on the machine, 
+  then the user setup will not behave as expected. Subsequent attempts to ssh 
using this user could then fail.
+* Custom credentials on the VM. Most clouds will automatically set the ssh 
login details (e.g. in AWS using  
+  the key-pair, or in CloudStack by auto-generating a password). However, with 
some custom images the VM
+  will have hard-coded credentials that must be used. If Brooklyn's 
configuration does not match that,
+  then it will fail.
+* Guest customisation by the cloud. On some clouds (e.g. vCloud Air), the VM 
can be configured to do
+  guest customisation immediately after the VM starts. This can include 
changing the root password.
+  If Brooklyn is not configured with the expected changed password, then the 
VM provisioning may fail
+  (depending if Brooklyn connects before or after the password is changed!).
+ 
+A very useful debug configuration is to set `destroyOnFailure` to false. This 
will allow ssh failures to
+be more easily investigated.
+
+
+## Timeout Waiting For Service-Up
+
+A common generic error message is that there was a timeout waiting for 
service-up.
+
+This just means that the entity did not get to service-up in the pre-defined 
time period (the default is 
+two minutes, and can be configured using the `start.timeout` config key; the 
timer begins after the 
+start tasks are completed).
+
+See the guide on [runtime errors](troubleshooting-runtime-errors.html) for 
where to find additional information, especially the section on
+"Entity's Error Status".

http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/troubleshooting-runtime-errors.md
----------------------------------------------------------------------
diff --git a/docs/guide/ops/troubleshooting/troubleshooting-runtime-errors.md 
b/docs/guide/ops/troubleshooting/troubleshooting-runtime-errors.md
new file mode 100644
index 0000000..8b657fc
--- /dev/null
+++ b/docs/guide/ops/troubleshooting/troubleshooting-runtime-errors.md
@@ -0,0 +1,116 @@
+---
+layout: website-normal
+title: Troubleshooting Runtime Errors
+toc: /guide/toc.json
+---
+
+This guide describes sources of information for runtime errors.
+
+Whether you're customizing out-of-the-box blueprints, or developing your own 
custom blueprints, you will
+inevitably have to deal with entity failure. Thankfully Brooklyn provides 
plenty of information to help 
+you locate and resolve any issues you may encounter.
+
+
+## Web-console Runtime Error Information
+ 
+### Entity Hierarchy
+
+The Brooklyn web-console includes a tree view of the entities within an 
application. Errors within the
+application are represented visually, showing a "fire" image on the entity.
+
+When an error causes an entire application to be unexpectedly down, the error 
is generally propagated to the
+top-level entity - i.e. marking it as "on fire". To find the underlying error, 
one should expand the entity
+hierarchy tree to find the specific entities that have actually failed.
+
+
+### Entity's Error Status
+
+Many entities have some common sensors (i.e. attributes) that give details of 
the error status:
+
+* `service.isUp` (often referred to as "service up") is a boolean, saying 
whether the service is up. For many 
+  software processes, this is inferred from whether the 
"service.notUp.indicators" is empty. It is also
+  possible for some entities to set this attribute directly.
+* `service.notUp.indicators` is a map of errors. This often gives much more 
information than the single 
+  `service.isUp` attribute. For example, there may be many health-check 
indicators for a component: 
+  is the root URL reachable, it the management api reporting healthy, is the 
process running, etc.
+* `service.problems` is a map of namespaced indicators of problems with a 
service.
+* `service.state` is the actual state of the service - e.g. CREATED, STARTING, 
RUNNING, STOPPING, STOPPED, 
+  DESTROYED and ON_FIRE.
+* `service.state.expected` indicates the state the service is expected to be 
in (and when it transitioned to that).
+  For example, is the service expected to be starting, running, stopping, etc.
+
+These sensor values are shown in the "sensors" tab - see below.
+
+
+### Sensors View
+
+The "Sensors" tab in the Brooklyn web-console shows the attribute values of a 
particular entity.
+This gives lots of runtime information, including about the health of the 
entity - the 
+set of attributes will vary between different entity types.
+
+[![Sensors view in the Brooklyn debug 
console.](images/jmx-sensors.png)](images/jmx-sensors-large.png)
+
+Note that null (or not set) sensors are hidden by default. You can click on 
the `Show/hide empty records` 
+icon (highlighted in yellow above) to see these sensors as well.
+
+The sensors view is also tabulated. You can configure the numbers of sensors 
shown per page 
+(at the bottom). There is also a search bar (at the top) to filter the sensors 
shown.
+
+
+### Activity View
+
+The activity view shows the tasks executed by a given entity. The top-level 
tasks are the effectors
+(i.e. operations) invoked on that entity. This view allows one to drill into 
the task, to 
+see details of errors.
+
+Select the entity, and then click on the `Activities` tab.
+
+In the table showing the tasks, each row is a link - clicking on the row will 
drill into the details of that task, 
+including sub-tasks:
+
+[![Task failure error in the Brooklyn debug 
console.](images/failed-task.png)](images/failed-task-large.png)
+
+For ssh tasks, this allows one to drill down to see the env, stdin, stdout and 
stderr. That is, you can see the
+commands executed (stdin) and environment variables (env), and the output from 
executing that (stdout and stderr). 
+
+For tasks that did not fail, one can still drill into the tasks to see what 
was done.
+
+It's always worth looking at the Detailed Status section as sometimes that 
will give you the information you need.
+For example, it can show the exception stack trace in the thread that was 
executing the task that failed.
+
+
+## Log Files
+
+Brooklyn's logging is configurable, for the files created, the logging levels, 
etc. 
+See [Logging docs](/guide/ops/logging.html).
+
+With out-of-the-box logging, `brooklyn.info.log` and `brooklyn.debug.log` 
files are created. These are by default 
+rolling log files: when the log reaches a given size, it is compressed and a 
new log file is started.
+Therefore check the timestamps of the log files to ensure you are looking in 
the correct file for the 
+time of your error.
+
+With out-of-the-box logging, info, warnings and errors are written to the 
`brooklyn.info.log` file. This gives
+a summary of the important actions and errors. However, it does not contain 
full stacktraces for errors.
+
+To find the exception, we'll need to look in Brooklyn's debug log file. By 
default, the debug log file
+is named `brooklyn.debug.log`. You can use your favourite tools for viewing 
large text files. 
+
+One possible tool is `less`, e.g. `less brooklyn.debug.log`. We can quickly 
find the last exception 
+by navigating to the end of the log file (using `Shift-G`), then performing a 
reverse-lookup by typing `?Exception` 
+and pressing `Enter`. Sometimes an error results in multiple exceptions being 
logged (e.g. first for the
+entity, then for the cluster, then for the app). If you know the text of the 
error message (e.g. copy-pasted
+from the Activities view of the web-console) then one can search explicitly 
for that text.
+
+The `grep` command is also extremely helpful. Useful things to grep for 
include:
+
+* The entity id (see the "summary" tab of the entity in the web-console for 
the id).
+* The entity type name (if there are only a small number of entities of that 
type). 
+* The VM IP address.
+* A particular error message (e.g. copy-pasted from the Activities view of the 
web-console).
+* The word WARN etc, such as `grep -E "WARN|ERROR" brooklyn.info.log`.
+
+Grep'ing for particular log messages is also useful. Some examples are shown 
below:
+
+* INFO: "Started application", "Stopping application" and "Stopped application"
+* INFO: "Creating VM "
+* DEBUG: "Finished VM "

http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/troubleshooting-softwareprocess.md
----------------------------------------------------------------------
diff --git a/docs/guide/ops/troubleshooting/troubleshooting-softwareprocess.md 
b/docs/guide/ops/troubleshooting/troubleshooting-softwareprocess.md
new file mode 100644
index 0000000..a09f902
--- /dev/null
+++ b/docs/guide/ops/troubleshooting/troubleshooting-softwareprocess.md
@@ -0,0 +1,50 @@
+---
+layout: website-normal
+title: Troubleshooting SoftwareProcess Entities
+toc: /guide/toc.json
+---
+
+The [guide for troubleshooting runtime 
errors](troubleshooting-runtime-errors.html) in Brooklyn gives 
+information for how to find more information about errors.
+
+If that doesn't give enough information to diagnose, fix or workaround the 
problem, then it can be required
+to login to the machine, to investigate further. This guide applies to 
entities that are types
+of "SoftwareProcess" in Brooklyn, or that follows those conventions.
+
+
+## VM connection details
+
+The ssh connection details for an entity is published to a sensor 
`host.sshAddress`. The login 
+credentials will depend on the Brooklyn configuration. The default is to use 
the `~/.ssh/id_rsa` 
+or `~/.ssh/id_dsa` on the Brooklyn host (uploading the associated 
`~/.ssh/id_rsa.pub` to the machine's 
+authorised_keys). However, this can be overridden (e.g. with specific 
passwords etc) in the 
+location's configuration.
+
+For Windows, there is a similar sensor with the name `host.winrmAddress`. 
(TODO sensor for password?) 
+
+
+## Install and Run Directories
+
+For ssh-based software processes, the install directory and the run directory 
are published as sensors
+`install.dir` and `run.dir` respectively.
+
+For some entities, files are unpacked into the install dir; configuration 
files are written to the
+run dir along with log files. For some other entities, these directories may 
be mostly empty - 
+e.g. if installing RPMs, and that software writes its logs to a different 
standard location.
+
+Most entities have a sensor `log.location`. It is generally worth checking 
this, along with other files
+in the run directory (such as console output).
+
+
+## Process and OS Health
+
+It is worth checking that the process is running, e.g. using `ps aux` to look 
for the desired process.
+Some entities also write the pid of the process to `pid.txt` in the run 
directory.
+
+It is also worth checking if the required port is accessible. This is 
discussed in the guide 
+"Troubleshooting Server Connectivity Issues in the Cloud", including listing 
the ports in use:
+execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports 
in use (or use
+`-anup` for UDP).
+
+It is also worth checking the disk space on the server, e.g. using `df -m`, to 
check that there
+is sufficient space on each of the required partitions.

[1/4] incubator-brooklyn git commit: Adds docs ops/troubleshooting

Reply via email to