Re: [lopsa-discuss] Systemtap / Dtrace and those great looking graphs

Florian Heigl Tue, 11 Mar 2014 10:47:26 -0700

Thanks Eliijah for the explanations!

rrdtool etc. won’t cut it unfortunately - at least thats what I think.


I would like to trace activity on a whole server, for about 2 hours - and for 
all/most processes it has running...
Naigos being involved means:
tracking what a few 100k processes did during their short lifetime.
(i.e. afaik RRDtool will be run 1400x30 times during that period _with_ 
rrdcached enabled.
RRDcached will update / flush it’s journal many many times during that time.
Check_MK will do a lot of different things even under the same name (inventory 
updates, precompiling / linking, and be called by nagios)

The end result I’m after would show (example):
2% of IO fell to Check_MK cleaning up the auto checks
56% were updates of the RRD journal
5% were actual RRD updates
-> indicating misconfiguration, and that it’s better to fix RRDcached & tune 
for sequential IO than worry about the RRD IO grinder.

I suppose I should split this into two problems:
1.
Normalize / Better formatting of the output, so that I get a CSV like file, 
with tricks like full and split path

2. 
Hire someone who knows R to do reporting / graphs… (OK, that can be done using 
RRD but you have no flexibility querying this)


The way I understand it it would still be re-usable for others that way.



On 11.03.2014, at 16:11, Elijah Wright <[email protected]> wrote:

> Hi Florian,
> 
> Pretty sure you're going to need more data;  Brendan's scripts are
> expecting *stack traces*, not just the latency numbers and the source
> process - the stacks are where the 'layers' in the flame graph
> visualizations come from.
> 
> If you're just collecting the disk latency numbers, and not the
> function hierarchy of the process, you might want to just use graphite
> or rrdtool or something to plot that data - it should be pretty
> understandable, but I don't think you'll have something as useful as
> the flame graph output might be.  You really want to know *which part*
> of some process is jamming away at the disk - not just that it is
> happening, the processname, and when.
> 
> [This sort of correlation - "what program feature on my system is
> making disk latency blow chunks" - is extremely useful and just beyond
> the edge of what most people's monitoring tools and approaches can
> deal with.  There's a really good reason that the flame graph pages
> are littered with DTrace code... ;-) ]
> 
> best,
> 
> --e
> 
> 
> 
> On Mon, Mar 10, 2014 at 9:33 PM, Florian Heigl <[email protected]> 
> wrote:
>> Hi,
>> 
>> I'm trying to get some dependable data on Nagios IO.
>> Nagios does a lot of disk IO, which is known, but there's no hard numbers to
>> it.
>> It gets especially for systems that _have_ best practices applied:
>> - rrdcached is running, volatile data is written to a RAM disk, etc.
>> 
>> My current approach is using systemtap and collecting only write accesses
>> and their latencies.
>> 
>> This I have, using the sys call to IO probe here:
>> https://sourceware.org/systemtap/examples/keyword-index.html#FILE
>> ...and grep, since I don't really understand all of it.
>> 
>> To turn it into something more worthwhile that can be used by more people
>> and show results easily,
>> I want to use the flame graph thing as described at
>> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
>> 
>> The whole toolkit seems be able to work with system tap.
>> 
>> The problem:
>> I'm apparently just too stupid. I don't know how to get started.
>> I do not remotely grasp how to take the flamegraph git repo and the script I
>> have and make them do "something"
>> (something being, a sort on IO time spend per path element of the files
>> written to)
>> 
>> Did any of you try something similar?
>> Did any of you work with flame graphs and can give some advice?
>> 
>> 
>> Florian
>> 
>> _______________________________________________
>> Discuss mailing list
>> [email protected]
>> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
>> This list provided by the League of Professional System Administrators
>> http://lopsa.org/
>> 

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Systemtap / Dtrace and those great looking graphs

Reply via email to