We use a statsd metric reporter into a graphite cluster, and have built out extensive graphs shown in Grafana. On top of that we use seyren to do alerting. Right now we have alerts on the following:
- Spout lag greater than our defined SLAs - Null reported spout lag - IE if the topology stops reporting metrics (or just isn't deployed) for a period of time. - Failed tuple percentage, if this exceeds a threshold - Thru-put / number of executes - Our topologies should always be doing something, they're never completely idle. If we see thru-put drop below a threshold we'll be alerted. Hope this helps! Curious to what others monitor/alert on. Stephen On Thu, Oct 27, 2016 at 2:49 AM, Chen Junfeng <k-2f...@hotmail.com> wrote: > What specifications will you use to measure it ? > > > > > > Regard > > Junfeng Chen >