Thanks Eliijah for the explanations! rrdtool etc. won’t cut it unfortunately - at least thats what I think.
I would like to trace activity on a whole server, for about 2 hours - and for all/most processes it has running... Naigos being involved means: tracking what a few 100k processes did during their short lifetime. (i.e. afaik RRDtool will be run 1400x30 times during that period _with_ rrdcached enabled. RRDcached will update / flush it’s journal many many times during that time. Check_MK will do a lot of different things even under the same name (inventory updates, precompiling / linking, and be called by nagios) The end result I’m after would show (example): 2% of IO fell to Check_MK cleaning up the auto checks 56% were updates of the RRD journal 5% were actual RRD updates -> indicating misconfiguration, and that it’s better to fix RRDcached & tune for sequential IO than worry about the RRD IO grinder. I suppose I should split this into two problems: 1. Normalize / Better formatting of the output, so that I get a CSV like file, with tricks like full and split path 2. Hire someone who knows R to do reporting / graphs… (OK, that can be done using RRD but you have no flexibility querying this) The way I understand it it would still be re-usable for others that way. On 11.03.2014, at 16:11, Elijah Wright <[email protected]> wrote: > Hi Florian, > > Pretty sure you're going to need more data; Brendan's scripts are > expecting *stack traces*, not just the latency numbers and the source > process - the stacks are where the 'layers' in the flame graph > visualizations come from. > > If you're just collecting the disk latency numbers, and not the > function hierarchy of the process, you might want to just use graphite > or rrdtool or something to plot that data - it should be pretty > understandable, but I don't think you'll have something as useful as > the flame graph output might be. You really want to know *which part* > of some process is jamming away at the disk - not just that it is > happening, the processname, and when. > > [This sort of correlation - "what program feature on my system is > making disk latency blow chunks" - is extremely useful and just beyond > the edge of what most people's monitoring tools and approaches can > deal with. There's a really good reason that the flame graph pages > are littered with DTrace code... ;-) ] > > best, > > --e > > > > On Mon, Mar 10, 2014 at 9:33 PM, Florian Heigl <[email protected]> > wrote: >> Hi, >> >> I'm trying to get some dependable data on Nagios IO. >> Nagios does a lot of disk IO, which is known, but there's no hard numbers to >> it. >> It gets especially for systems that _have_ best practices applied: >> - rrdcached is running, volatile data is written to a RAM disk, etc. >> >> My current approach is using systemtap and collecting only write accesses >> and their latencies. >> >> This I have, using the sys call to IO probe here: >> https://sourceware.org/systemtap/examples/keyword-index.html#FILE >> ...and grep, since I don't really understand all of it. >> >> To turn it into something more worthwhile that can be used by more people >> and show results easily, >> I want to use the flame graph thing as described at >> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html >> >> The whole toolkit seems be able to work with system tap. >> >> The problem: >> I'm apparently just too stupid. I don't know how to get started. >> I do not remotely grasp how to take the flamegraph git repo and the script I >> have and make them do "something" >> (something being, a sort on IO time spend per path element of the files >> written to) >> >> Did any of you try something similar? >> Did any of you work with flame graphs and can give some advice? >> >> >> Florian >> >> _______________________________________________ >> Discuss mailing list >> [email protected] >> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss >> This list provided by the League of Professional System Administrators >> http://lopsa.org/ >> _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
