Hi Peeps

I'm trying to learn to use Kapacitor and hitting a few snags in my 
understanding, trying to solve this simple problem has surfaced all sorts 
of questions, and I'm hoping to get some of my misunderstandings sorted out.

I've got a measurement called process_count that shows a count of the 
number of a given process running by host, and there's a 'system' table 
which comes from telegraph and is essentaily output of `uptime`.

If that process stops running (process_count goes to 0), I want to be 
alerted. But when a new box comes up, I want to allow some breathing space 
before we get alerts.

There's obviously a few ways to solve this  (i've even tried some!) and 
keen to learn better ways, but I'm running with this as a sample for asking 
questions

Samples are sent to influx at about 30 second intervales (+/- jitter).

I'm trying to join process_count records onto the system record (1 system 
record for many process_count records), so that there's an uptime field 
available when i determine my critical alert.


Here's a sample from my process_count table and from my system

```
> select count, host, name  from process_count where 
instance_id='i-0xxxxx3e078a04f20' group by instance_id order by time desc 
limit 10;
name: process_count
tags: instance_id=i-0xxxxx3e078a04f20
time                           count host                                   
       name                  
----                           ----- ----                                   
       ----                  
2017-02-10T03:25:56.004751872Z 1     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
2017-02-10T03:25:55.984448256Z 6     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
2017-02-10T03:25:25.92088576Z  1     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
2017-02-10T03:25:25.900282368Z 6     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
2017-02-10T03:24:55.834618368Z 1     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
2017-02-10T03:24:55.814406144Z 6     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
2017-02-10T03:24:25.751718144Z 1     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
2017-02-10T03:24:25.7313984Z   6     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
2017-02-10T03:23:55.66639104Z  1     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
2017-02-10T03:23:55.64570112Z  6     
someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 

> select uptime, host from system  where 
host='someapp-production-web-i-0xxxxx3e078a04f20' order by time desc limit 
5 ;
name: system
time                 uptime host
----                 ------ ----
2017-02-10T03:26:04Z 55399  someapp-production-web-i-0xxxxx3e078a04f20
2017-02-10T03:25:30Z 55365  someapp-production-web-i-0xxxxx3e078a04f20
2017-02-10T03:25:01Z 55336  someapp-production-web-i-0xxxxx3e078a04f20
2017-02-10T03:24:35Z 55309  someapp-production-web-i-0xxxxx3e078a04f20
2017-02-10T03:24:01Z 55276  someapp-production-web-i-0xxxxx3e078a04f20
```

And here's the tickscript. The problem I seem to be having is nothing is 
 coming out from the join. I'm not getting any logging out of the .log on 
the join or the subsequent stream. I'm hoping that the process_counts zip
to the nearest time (i have a tolerance of 14s) based on host. Also the 
annotations int he DOT script seem to suggest nothing is processed through 
these streams.


```
ID: someapp_production_process_not_running
Error: 
Template: 
Type: stream
Status: enabled
Executng: true
Created: 09 Feb 17 12:20 UTC
Modified: 10 Feb 17 03:35 UTC
LastEnabled: 10 Feb 17 03:35 UTC
Databases Retenton Policies: ["someapp_production"."default"]
TICKscript:


var process_counts = stream
    |from()
        .measurement('process_count')
    |window()
        .period(10m)
        .every(30s)
    |groupBy('host')
    |log()

var box = stream
    |from()
        .measurement('system')
    |window()
        .period(10m)
        .every(30s)
    |log()
    |groupBy('host')

var process_with_uptime = process_counts
    |join(box)
        .as('process', 'sys')
        .tolerance(14s)
        .streamName('process_with_uptime')
        .on('host')
    |log()
        .prefix('** JOIN')
    |eval(lambda: "process.time" - "sys.uptime")
        .as('boot_time')

process_with_uptime
    |log()
        .level('DEBUG')
        .prefix('** PROCSESS WITH UPTIME')
    |alert()
        .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ 
index .Tags "host" }}')
        .message('{{ index .Tags "name" }} has {{index .Fields "count" }} 
 processes running for {{ .ID }}. System has been up for {{ index .Fields 
"sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.')
        .info(lambda: "count" >= 0)
        .warn(lambda: "count" == 0)
        .crit(lambda: ("count" == 0) AND ("sys.uptime" < 120)) possible 
alternative
        .victorOps()

DOT:
digraph someapp_production_process_not_running {
graph [throughput="0.00 points/s"];

stream0 [avg_exec_time_ns="0" ];
stream0 -> from5 [processed="23"];
stream0 -> from1 [processed="23"];

from5 [avg_exec_time_ns="157ns" ];
from5 -> window6 [processed="6"];

window6 [avg_exec_time_ns="565ns" ];
window6 -> log7 [processed="0"];

log7 [avg_exec_time_ns="0" ];
log7 -> groupby8 [processed="0"];

groupby8 [avg_exec_time_ns="0" ];
groupby8 -> join10 [processed="0"];

from1 [avg_exec_time_ns="452ns" ];
from1 -> window2 [processed="17"];

window2 [avg_exec_time_ns="1.05µs" ];
window2 -> groupby3 [processed="1"];

groupby3 [avg_exec_time_ns="0" ];
groupby3 -> log4 [processed="0"];

log4 [avg_exec_time_ns="0" ];
log4 -> join10 [processed="0"];

join10 [avg_exec_time_ns="0" ];
join10 -> log11 [processed="0"];

log11 [avg_exec_time_ns="0" ];
log11 -> eval12 [processed="0"];

eval12 [avg_exec_time_ns="0" eval_errors="0" ];
eval12 -> log13 [processed="0"];

log13 [avg_exec_time_ns="0" ];
log13 -> alert14 [processed="0"];

alert14 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0" 
infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
}
```

My questions are:
1) Am I right in that this is failing at the join? Or is there 
fundamentally bigger problems
2) What have I done wrong for this join to be failing? Am I completely mis 
understanding the join (or even more general), or is there just a small 
implementation issue?
3) In order to use the result of the join, am I wrong to name it with a var 
for reuse below? I thought .streamName('..') might do this with out setting 
a var, but I simply get an error that 'process_with_uptime' isn't somethign 
thats in scope.
4) Is my overall approach just fundamentally wrong? 
5) Apart from using joins, whats the correct way to take the result of 1 
stream (or batch) and use it in another?
6) Should i have approached this some totally different way?

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to influxdb+unsubscr...@googlegroups.com.
To post to this group, send email to influxdb@googlegroups.com.
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/b049f621-99a4-423c-ab32-0226cff659c7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to