Hi Peeps I'm trying to learn to use Kapacitor and hitting a few snags in my understanding, trying to solve this simple problem has surfaced all sorts of questions, and I'm hoping to get some of my misunderstandings sorted out.
I've got a measurement called process_count that shows a count of the number of a given process running by host, and there's a 'system' table which comes from telegraph and is essentaily output of `uptime`. If that process stops running (process_count goes to 0), I want to be alerted. But when a new box comes up, I want to allow some breathing space before we get alerts. There's obviously a few ways to solve this (i've even tried some!) and keen to learn better ways, but I'm running with this as a sample for asking questions Samples are sent to influx at about 30 second intervales (+/- jitter). I'm trying to join process_count records onto the system record (1 system record for many process_count records), so that there's an uptime field available when i determine my critical alert. Here's a sample from my process_count table and from my system ``` > select count, host, name from process_count where instance_id='i-0xxxxx3e078a04f20' group by instance_id order by time desc limit 10; name: process_count tags: instance_id=i-0xxxxx3e078a04f20 time count host name ---- ----- ---- ---- 2017-02-10T03:25:56.004751872Z 1 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 2017-02-10T03:25:55.984448256Z 6 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 2017-02-10T03:25:25.92088576Z 1 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 2017-02-10T03:25:25.900282368Z 6 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 2017-02-10T03:24:55.834618368Z 1 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 2017-02-10T03:24:55.814406144Z 6 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 2017-02-10T03:24:25.751718144Z 1 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 2017-02-10T03:24:25.7313984Z 6 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 2017-02-10T03:23:55.66639104Z 1 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 2017-02-10T03:23:55.64570112Z 6 someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes > select uptime, host from system where host='someapp-production-web-i-0xxxxx3e078a04f20' order by time desc limit 5 ; name: system time uptime host ---- ------ ---- 2017-02-10T03:26:04Z 55399 someapp-production-web-i-0xxxxx3e078a04f20 2017-02-10T03:25:30Z 55365 someapp-production-web-i-0xxxxx3e078a04f20 2017-02-10T03:25:01Z 55336 someapp-production-web-i-0xxxxx3e078a04f20 2017-02-10T03:24:35Z 55309 someapp-production-web-i-0xxxxx3e078a04f20 2017-02-10T03:24:01Z 55276 someapp-production-web-i-0xxxxx3e078a04f20 ``` And here's the tickscript. The problem I seem to be having is nothing is coming out from the join. I'm not getting any logging out of the .log on the join or the subsequent stream. I'm hoping that the process_counts zip to the nearest time (i have a tolerance of 14s) based on host. Also the annotations int he DOT script seem to suggest nothing is processed through these streams. ``` ID: someapp_production_process_not_running Error: Template: Type: stream Status: enabled Executng: true Created: 09 Feb 17 12:20 UTC Modified: 10 Feb 17 03:35 UTC LastEnabled: 10 Feb 17 03:35 UTC Databases Retenton Policies: ["someapp_production"."default"] TICKscript: var process_counts = stream |from() .measurement('process_count') |window() .period(10m) .every(30s) |groupBy('host') |log() var box = stream |from() .measurement('system') |window() .period(10m) .every(30s) |log() |groupBy('host') var process_with_uptime = process_counts |join(box) .as('process', 'sys') .tolerance(14s) .streamName('process_with_uptime') .on('host') |log() .prefix('** JOIN') |eval(lambda: "process.time" - "sys.uptime") .as('boot_time') process_with_uptime |log() .level('DEBUG') .prefix('** PROCSESS WITH UPTIME') |alert() .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ index .Tags "host" }}') .message('{{ index .Tags "name" }} has {{index .Fields "count" }} processes running for {{ .ID }}. System has been up for {{ index .Fields "sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.') .info(lambda: "count" >= 0) .warn(lambda: "count" == 0) .crit(lambda: ("count" == 0) AND ("sys.uptime" < 120)) possible alternative .victorOps() DOT: digraph someapp_production_process_not_running { graph [throughput="0.00 points/s"]; stream0 [avg_exec_time_ns="0" ]; stream0 -> from5 [processed="23"]; stream0 -> from1 [processed="23"]; from5 [avg_exec_time_ns="157ns" ]; from5 -> window6 [processed="6"]; window6 [avg_exec_time_ns="565ns" ]; window6 -> log7 [processed="0"]; log7 [avg_exec_time_ns="0" ]; log7 -> groupby8 [processed="0"]; groupby8 [avg_exec_time_ns="0" ]; groupby8 -> join10 [processed="0"]; from1 [avg_exec_time_ns="452ns" ]; from1 -> window2 [processed="17"]; window2 [avg_exec_time_ns="1.05µs" ]; window2 -> groupby3 [processed="1"]; groupby3 [avg_exec_time_ns="0" ]; groupby3 -> log4 [processed="0"]; log4 [avg_exec_time_ns="0" ]; log4 -> join10 [processed="0"]; join10 [avg_exec_time_ns="0" ]; join10 -> log11 [processed="0"]; log11 [avg_exec_time_ns="0" ]; log11 -> eval12 [processed="0"]; eval12 [avg_exec_time_ns="0" eval_errors="0" ]; eval12 -> log13 [processed="0"]; log13 [avg_exec_time_ns="0" ]; log13 -> alert14 [processed="0"]; alert14 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" ]; } ``` My questions are: 1) Am I right in that this is failing at the join? Or is there fundamentally bigger problems 2) What have I done wrong for this join to be failing? Am I completely mis understanding the join (or even more general), or is there just a small implementation issue? 3) In order to use the result of the join, am I wrong to name it with a var for reuse below? I thought .streamName('..') might do this with out setting a var, but I simply get an error that 'process_with_uptime' isn't somethign thats in scope. 4) Is my overall approach just fundamentally wrong? 5) Apart from using joins, whats the correct way to take the result of 1 stream (or batch) and use it in another? 6) Should i have approached this some totally different way? -- Remember to include the version number! --- You received this message because you are subscribed to the Google Groups "InfluxData" group. To unsubscribe from this group and stop receiving emails from it, send an email to influxdb+unsubscr...@googlegroups.com. To post to this group, send email to influxdb@googlegroups.com. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/b049f621-99a4-423c-ab32-0226cff659c7%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.