On Sun, 3 Apr 2005 13:58:56 -0400 (Eastern Daylight Time), Joshua Slive <[EMAIL PROTECTED]> wrote:

Does someone with a high-traffic, general-interest web site want to take a look through their logs for these user-agent strings. I don't mind keeping them if they make up even 1/100 of a percent of the trafic, but it seems silly to keep these extra regexes on every single request if these clients don't exist anymore in the wild.



Regexes are pretty cheap for a 'normal' apache setup.

In the initial testing of a production server (2x 3.2Ghz Xeon, 6 GB RAM;) we found that, serving static pages, the overhead of processing regexes didn't become noticable until we had >1000 rewriting rules. Even then, at least 30% of the hits on this server are cgi-scripts, so the overhead of regexes is really nothing compared to the other ways we abuse our machine.

In doing this testing I did notice that Apache's handling of regexes is pretty simplistic. Much of the time you can consolidate a large stack of regexes into a single state machine, and that could give vast (factors of hundreds or thousands) improvements in performance for handling large rule sets. On the other hand, it doesn't really matter.

The people we've inherited this server from left us several very large regexes with a few hundred pipe symbols each that match UA's of non-browser clients that we don't want using our service. The trouble is that inevitably this kind of regex starts mutating into malignant forms as people start using parens, also we have no documentation for the rules; on slow days I think about breaking these up into 500-1000 rules, which we could in principle comment one-by-one... This wouldn't really impact the performance of our machine under 'real' circumstances, but we could measure the impact under specialized testing.

Reply via email to