On Wed, Jun 9, 2010 at 10:09 PM, Willy Tarreau <w...@1wt.eu> wrote: > Hi David, > > On Wed, Jun 09, 2010 at 04:37:28PM -0700, David Birdsong wrote: >> I'm pretty excited to start using halog, but dumping out the usage is >> about the only documentation I can turn up -which is not explaining >> anything to me. Is there anything more substantial on how to use >> halog? > > you're right. At the beginning, it was just a tool to help me spot > production issues, then I have added features and explained a few > people how to use it. But obviously some doc is missing. > > I'll be quick here but I hope it will help you to start. First, you > should see it as a haproxy-specific grep with a few enhanced filters > and outputs. It can only do one output format at a time, but you can > combine several input filters. > > Input filters : > -e : only consider lines which don't report an error (timeout, connect, > 5xx, ...) > > -E : only consider lines which do report an error (timeout, connect, 5xx, > ...) > > -rt XXX : only consider lines with server response times higher than XXX ms > > -RT XXX : only consider lines with server response times lower than XXX ms > > -ad XXX : only consider lines which indicate an accept time after a silence > of XXX ms > > -ac XXX : to be used with -ad, only consider those lines if at least XXX > lines are > grouped after the silence. > > -v : invert the selection > > Some filters are incompatible. You can have only one of -e and -E, and you can > have only one of -rt and -RT. > > Since some syslogs add a field for the sender's host and others don't, you can > adjust the fields offsets with -s. By default, "-s 1" is assumed, to skip one > field for the origin host. You can use -s 0 if your syslog does not add it (or > if you use netcat to log). Or you can use -s 2 if your syslog adds other > fields. > Negative values are also permitted if that help. > > The output format can be selected with the following flags : > > -q : don't show a warning for unparsable lines (eg: "server XXX is UP") > > -c : only report the number of lines which match > > -gt : outputs a list of x,y values to be used with gnuplot to visually > check if everything's OK. It was its first use, but it's not used > anymore, as it was not very convenient to export values. > > -pct: report a percentile table of request time, connect time, response > time, data time. The output contains the percent and absolute number > of requests served in less than XXX ms for each field. It's very > helpful to quickly spot TCP retransmits because you can see if you > have large 3 seconds steps. Also, it is convenient to use on prod > when you suspect a site is slow. Just a quick check and you can > tell if your timers are slower than other days. > > -st : report the distribution of the status codes (200, 302, ...). Again, > this is meant as a quick help. You run that when you suspect an > issue and you immediately see if some files are missing (404) or > some errors are reported. > > -srv: enumerate all servers found in the logs with their respective > status codes distribution (2xx, 3xx, 4xx, 5xx), the number of > errors (-1 anywhere in a timer), the error ratio, the average > response time (without data) and the average connect time. > > > -ad and -ac provide a special output. I don't remember the format, they > were developped to track an issue with huge packet losses, I seem > to remember they only report the time of the accept of requests > matching the criteria, the length of the silence as well as the > number of requests accepted at once. The goal was to find abnormally > long silences. For instance, if you have a load between 500 and > 2000 hits/s 24h a day, you're almost certain that a one second > silence indicates an issue. Being able to spot the end of silences > and compare them on several machines helps find the origin of the > trouble (switch, machine swapping, etc...) > wow, thanks for the run-down. there's a lot here; plenty to get me started.
> In practice, you generally just want to run -st when you think you may be > encountering a trouble. If you see an abnormal error distribution, then > you'll rerun with -srv to find what server is the culprit (if any). I know > some people who run that continuously coupled with a tail -5000. That way > they get a realtime stats distribution for their servers. > thanks, -srv i think is what i've been hoping for to track down bad backends in a backend section that has roughly 400 servers. > The percentile output is more to be used on full day logs, it helps check > how heavy days compare with calm ones in terms of response times. But it > can be used by prod people to quickly check if there are any errors. At > least from what I have observed, sometimes people are not sure about the > fields, but they're quite sure that two outputs don't look similar and > that one of them indicates a problem. That's already a good thing because > they can say in one second "everything looks OK to me". > > Last point, I found that -rt/-RT can be used for debugging, as they help > spot abnormally long requests. In this case, you'll end up running the > tool several times in a row. I found it very convenient to first do a > "halog -e < file > /dev/shm/file" then run all research from /dev/shm/file > to ensure there's no disk activity anymore. It requires that your file > fits in /dev/shm though, which is not often the case. > actually, we've configured syslog-ng to log to /dev/shm already ;) > Hoping this helps, > Willy > >