I've done a lot of log-munging in Perl, and I must report that for any
significant amount of logs, regexps just aren't fast enough. In some cases
I've found a solution using index() and rindex() that was adequate. But once
you get to that level of optimisation, Perl becomes as ugly as C, and the C
solution is generally more flexible (because it doesn't have to be
hand-optimised to death to achieve acceptable speed).

If you have to use regexps, it's worth tinkering with them. Often with
careful use of character classes, you can save Perl from having to do
backtracking. Try to avoid anchoring from the end of the string... it looks
like it should be fast, but in my experience it isn't. Anchor to the start
of the string where it makes sense, but not if it makes the regexp more
complicated. Complex regular expressions are really slow, so try breaking
them down into several smaller ones.

On the other hand, for doing ad-hoc queries against server logs, Perl is
usually the language of choice. Cute tip: since the grep variants are way
faster than Perl, use them to narrow the field before Perl does the grunt
work. Say you want a list of JPEG files larger than 200k, together with how
often they were served:

zgrep -F ".jpg" logfile.gz | egrep ' [0-9][0-9][0-9][0-9][0-9][0-9] ' | perl
-ne 'print "$1\t$2\n" if / "GET (\/[^\s\"]+)[^"]*" \d+ (\d+) / && $2>200*1024'
| sort | uniq -c

Always test on a subset of your logs first! Where I work, a command like
this will take an hour on a full month's logs, and you'll be very annoyed if
you wait that long to discover you made a typo.

Tip 3: "top" is good for getting an immediate idea of how efficient your
command is. Ideally you want the "gzip" process using 80% or more of the
CPU. If it's only pulling 20%, it'll take four times as long. With a
multi-stage pipe like this, you can easily see which stage is the bottleneck.


