Justin Mason wrote:
John D. Hardin writes:
On Tue, 22 Jan 2008, George Georgalis wrote:

On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:

Neither am I. Another thing to consider is the fraction of defined
rules that actually hit and affect the score is rather small. The
greatest optimization would be to not test REs you know will fail; but how do you do *that*?
thanks for all the followups on my inquiry. I'm glad the topic is/was
considered and it looks like there is some room for development, but
I now realize it is not as simple as I thought it might have been.
In answer to above question, maybe the tests need their own scoring?
eg fast tests and with big spam scores get a higher test score than
slow tests with low spam scores.

maybe if there was some way to establish a hierachy at startup
which groups rule processing into nodes. some nodes finish
quickly, some have dependencies, some are negative, etc.
Loren mentioned to me in a private email: "common subexpressions".

It would be theoretically possible to analyze all the rules in a given
set (e.g. body rules) to extract common subexpressions and develop a
processing/pruning tree based on that. You'd probably gain some
performance scanning messages, but at the cost of how much
startup/compiling time?

I experimented with this concept in my sa-compile work, but I could
achieve any speedup on real-world mixed spam/ham datasets.

Feel free to give it a try though ;)

--j.



You do mean *couldn't* achieve any speedup, correct?

-Jim

Reply via email to