[influxdb] Re: [Kapacitor] Questions about my tick script (joins and other things)

Glenn Davy Fri, 10 Feb 2017 17:11:17 -0800

On Saturday, 11 February 2017 07:05:14 UTC+13, nath...@influxdb.com wrote:

> Thanks for a detailed question! 
>


Welcome! Thanks for a detailed answer :)


> Not quite, the join node has two parent nodes log4 and groupBy8. Neither 
> parent has sent any points on to the join node, so the join node has not 
> had an opportunity to do anything yet. If you follow the trail back up, the 
> window6 node has not emitted any values either. Meaning that not enough 
> data has arrived for it to trigger emitting a window. The other window node 
> did get enough data to trigger one emit but that was it. 
>
>
I don't really understand this, in as much as, whats enough data to trigger 
an emit?
 

> Looks like you are windowing the data so that you can have the grace 
> period you were talking about for new hosts. In that case you will want to 
> configure the alert node with `.all()` so that all points in the window 
> have to meat the conditions in order to trigger an alert.
> If you are not using the window for that purpose then just remove it as 
> its not doing anything otherwise.
>
> Nope that wasn't the purpose, it was really just to give me the illusion 
of understanding what was happening :D

So, then what is the purpose of the window? Is it jus a way of saying 
confine your processing to whats in this group? So that for example, if I'd 
have done a first()/last()/sum()/count()/max()/min()/other() it would have 
only applied to what was in the window? or does it have some other use?

 

>
> 2) What have I done wrong for this join to be failing? Am I completely 
> misunderstanding the join (or even more general), or is there just a small 
> implementation issue?
>
> Understanding the join .`on` property here is the key. The way the `.on` 
> property work is it expects one of the parents to be grouped by a set of 
> specific tags and one of the other parents to be grouped by less specific 
> tags.
> For example in your case the process data should be grouped by name and 
> host while the uptime data is only grouped by host. The resulting data is 
> grouped by the more specific set of tags (i..e name and host).  I'll show 
> an example below.
>
> OK, great thanks!  that makes sense, and seems to work now!

Other than that your eval looks correct.
>
>  The eval gives me this error in the logs:

eval9] 2017/02/11 00:23:45 E! no field or tag exists for process.time

When I look at the data sent to victor I see this snippet listing the 
columns

["time","process.count","sys.load1","sys.load15","sys.load5","sys.n_cpus","sys.n_users","sys.uptime","sys.uptime_format"]

If I change from "process.time" to "time" (which seems to be the correct 
thing to do) I get:

 E! invalid math operator - for type time

I'm guess this is because I see these values associated with the above 
columns when i peek into the victor message:

[["2017-02-11T00:51:00Z",0,0.05,0.05,0.09,2,1,201," 0:03"]]}]

I'm guessing the time is maths is choking on that? Whats the handing so 
that times are processable inside kapacitor, but get sent out in a readable 
format?

All that aside, for now, I've remarked out the eval for now as it seemed to 
stopped data flowing through, but apart from this eval, everything is now 
working as hoped!(tm). 

Thanks for all your explanations Nathan.



>
> var process_counts = stream
>     |from()
>         .measurement('process_count')
>         // I am assuming that you want tag name and fully_qualified_role 
> as well since you referenced it below in the alert.
>         .groupBy('name', 'fully_qualified_role', 'host')
>     |log()
>
> var box = stream
>     |from()
>         .measurement('system')
>         // Only group by host here, since that is all the tag info we have.
>         .groupBy('host')
>     |log()
>
> var process_with_uptime = process_counts
>     |join(box)
>         .as('process', 'sys')
>         .tolerance(15s)
>         .on('host')
>     |log()
>         .prefix('** JOIN')
>     |eval(lambda: "process.time" - "sys.uptime")
>         .as('boot_time')
>
> process_with_uptime
>     |log()
>         .level('DEBUG')
>         .prefix('** PROCSESS WITH UPTIME')
>     |alert()
>         .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ 
> index .Tags "host" }}')
>         .message('{{ index .Tags "name" }} has {{index .Fields "count" }} 
>  processes running for {{ .ID }}. System has been up for {{ index .Fields 
> "sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.')
>         .info(lambda: "process.count" >= 0)
>         .warn(lambda: "process.count" == 0)
>         .crit(lambda: ("process.count" == 0) AND ("sys.uptime" > 120)) 
> possible alternative
>         .victorOps()
>
>
> On Friday, February 10, 2017 at 3:29:56 AM UTC-7, Glenn Davy wrote:
>>
>> Hi Peeps
>>
>> I'm trying to learn to use Kapacitor and hitting a few snags in my 
>> understanding, trying to solve this simple problem has surfaced all sorts 
>> of questions, and I'm hoping to get some of my misunderstandings sorted out.
>>
>> I've got a measurement called process_count that shows a count of the 
>> number of a given process running by host, and there's a 'system' table 
>> which comes from telegraph and is essentaily output of `uptime`.
>>
>> If that process stops running (process_count goes to 0), I want to be 
>> alerted. But when a new box comes up, I want to allow some breathing space 
>> before we get alerts.
>>
>> There's obviously a few ways to solve this  (i've even tried some!) and 
>> keen to learn better ways, but I'm running with this as a sample for asking 
>> questions
>>
>> Samples are sent to influx at about 30 second intervales (+/- jitter).
>>
>> I'm trying to join process_count records onto the system record (1 system 
>> record for many process_count records), so that there's an uptime field 
>> available when i determine my critical alert.
>>
>>
>> Here's a sample from my process_count table and from my system
>>
>> ```
>> > select count, host, name  from process_count where 
>> instance_id='i-0xxxxx3e078a04f20' group by instance_id order by time desc 
>> limit 10;
>> name: process_count
>> tags: instance_id=i-0xxxxx3e078a04f20
>> time                           count host                                 
>>          name                  
>> ----                           ----- ----                                 
>>          ----                  
>> 2017-02-10T03:25:56.004751872Z 1     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>> 2017-02-10T03:25:55.984448256Z 6     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>> 2017-02-10T03:25:25.92088576Z  1     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>> 2017-02-10T03:25:25.900282368Z 6     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>> 2017-02-10T03:24:55.834618368Z 1     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>> 2017-02-10T03:24:55.814406144Z 6     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>> 2017-02-10T03:24:25.751718144Z 1     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>> 2017-02-10T03:24:25.7313984Z   6     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>> 2017-02-10T03:23:55.66639104Z  1     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>> 2017-02-10T03:23:55.64570112Z  6     
>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>>
>> > select uptime, host from system  where 
>> host='someapp-production-web-i-0xxxxx3e078a04f20' order by time desc limit 
>> 5 ;
>> name: system
>> time                 uptime host
>> ----                 ------ ----
>> 2017-02-10T03:26:04Z 55399  someapp-production-web-i-0xxxxx3e078a04f20
>> 2017-02-10T03:25:30Z 55365  someapp-production-web-i-0xxxxx3e078a04f20
>> 2017-02-10T03:25:01Z 55336  someapp-production-web-i-0xxxxx3e078a04f20
>> 2017-02-10T03:24:35Z 55309  someapp-production-web-i-0xxxxx3e078a04f20
>> 2017-02-10T03:24:01Z 55276  someapp-production-web-i-0xxxxx3e078a04f20
>> ```
>>
>> And here's the tickscript. The problem I seem to be having is nothing is 
>>  coming out from the join. I'm not getting any logging out of the .log on 
>> the join or the subsequent stream. I'm hoping that the process_counts zip
>> to the nearest time (i have a tolerance of 14s) based on host. Also the 
>> annotations int he DOT script seem to suggest nothing is processed through 
>> these streams.
>>
>>
>> ```
>> ID: someapp_production_process_not_running
>> Error: 
>> Template: 
>> Type: stream
>> Status: enabled
>> Executng: true
>> Created: 09 Feb 17 12:20 UTC
>> Modified: 10 Feb 17 03:35 UTC
>> LastEnabled: 10 Feb 17 03:35 UTC
>> Databases Retenton Policies: ["someapp_production"."default"]
>> TICKscript:
>>
>>
>> var process_counts = stream
>>     |from()
>>         .measurement('process_count')
>>     |window()
>>         .period(10m)
>>         .every(30s)
>>     |groupBy('host')
>>     |log()
>>
>> var box = stream
>>     |from()
>>         .measurement('system')
>>     |window()
>>         .period(10m)
>>         .every(30s)
>>     |log()
>>     |groupBy('host')
>>
>> var process_with_uptime = process_counts
>>     |join(box)
>>         .as('process', 'sys')
>>         .tolerance(14s)
>>         .streamName('process_with_uptime')
>>         .on('host')
>>     |log()
>>         .prefix('** JOIN')
>>     |eval(lambda: "process.time" - "sys.uptime")
>>         .as('boot_time')
>>
>> process_with_uptime
>>     |log()
>>         .level('DEBUG')
>>         .prefix('** PROCSESS WITH UPTIME')
>>     |alert()
>>         .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ 
>> index .Tags "host" }}')
>>         .message('{{ index .Tags "name" }} has {{index .Fields "count" }} 
>>  processes running for {{ .ID }}. System has been up for {{ index .Fields 
>> "sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.')
>>         .info(lambda: "count" >= 0)
>>         .warn(lambda: "count" == 0)
>>         .crit(lambda: ("count" == 0) AND ("sys.uptime" < 120)) possible 
>> alternative
>>         .victorOps()
>>
>> DOT:
>> digraph someapp_production_process_not_running {
>> graph [throughput="0.00 points/s"];
>>
>> stream0 [avg_exec_time_ns="0" ];
>> stream0 -> from5 [processed="23"];
>> stream0 -> from1 [processed="23"];
>>
>> from5 [avg_exec_time_ns="157ns" ];
>> from5 -> window6 [processed="6"];
>>
>> window6 [avg_exec_time_ns="565ns" ];
>> window6 -> log7 [processed="0"];
>>
>> log7 [avg_exec_time_ns="0" ];
>> log7 -> groupby8 [processed="0"];
>>
>> groupby8 [avg_exec_time_ns="0" ];
>> groupby8 -> join10 [processed="0"];
>>
>> from1 [avg_exec_time_ns="452ns" ];
>> from1 -> window2 [processed="17"];
>>
>> window2 [avg_exec_time_ns="1.05µs" ];
>> window2 -> groupby3 [processed="1"];
>>
>> groupby3 [avg_exec_time_ns="0" ];
>> groupby3 -> log4 [processed="0"];
>>
>> log4 [avg_exec_time_ns="0" ];
>> log4 -> join10 [processed="0"];
>>
>> join10 [avg_exec_time_ns="0" ];
>> join10 -> log11 [processed="0"];
>>
>> log11 [avg_exec_time_ns="0" ];
>> log11 -> eval12 [processed="0"];
>>
>> eval12 [avg_exec_time_ns="0" eval_errors="0" ];
>> eval12 -> log13 [processed="0"];
>>
>> log13 [avg_exec_time_ns="0" ];
>> log13 -> alert14 [processed="0"];
>>
>> alert14 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0" 
>> infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
>> }
>> ```
>>
>> My questions are:
>> 1) Am I right in that this is failing at the join? Or is there 
>> fundamentally bigger problems
>> 2) What have I done wrong for this join to be failing? Am I completely 
>> mis understanding the join (or even more general), or is there just a small 
>> implementation issue?
>> 3) In order to use the result of the join, am I wrong to name it with a 
>> var for reuse below? I thought .streamName('..') might do this with out 
>> setting a var, but I simply get an error that 'process_with_uptime' isn't 
>> somethign thats in scope.
>> 4) Is my overall approach just fundamentally wrong? 
>> 5) Apart from using joins, whats the correct way to take the result of 1 
>> stream (or batch) and use it in another?
>> 6) Should i have approached this some totally different way?
>>
>>

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to influxdb+unsubscr...@googlegroups.com.
To post to this group, send email to influxdb@googlegroups.com.
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/73e672ef-3e73-4a11-8a9d-fa74adba96fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[influxdb] Re: [Kapacitor] Questions about my tick script (joins and other things)

Reply via email to