Summary:
*Should/could any consideration be given to having ASSP scan the entire
message at the time it is received for Bombs (only), while still using
MaxBytes for Bayesian/HMM?*

We've been having some cleverly crafted messages slipping through all
filters that would be easy to catch with Bombs if only the catchable
content came before MaxBytes.  These messages are 20kb+, They have a scam
phone number at the very end of the larger than MaxBytes messages.  I
want/need to use bombs to catch the scam phone numbers.

With MaxBytes set to 3000, which is useful for faster RebuildSpamDB, these
BombDataRE matches just aren't being caught.  If I increase MaxBytes, my
BombDataRE catches them, but then rebuildspamdb is (probably? see below)
longer than it needs to be.

So, is there any value in considering a* MaxBytesAdditionalForBombs *variable
which would be *added to MaxBytes *and only used when scanning for bombs as
messages arrive?   Would that kill performance??  Other downsides?

We could still only look at MaxBytes for Bayesian/HMM since it's only
MaxBytes used when building those databases.

What do you think?

And while we're talking MaxBytes:
I've asked this before, is the guidance for 3kb for MaxBytes once there's a
mature corpus still a valid recommendation?  With unlimited horsepower and
ram, sure, why not, do 30kb or 100kb.  That's not my reality, so I want to
see where to best allocate resources. If 3kb is still the guidance, even
though the spam files I'm seeing have a median size around 20kb, so be it.
I feel like when that guidance was written, html wasn't used as
prolifically in spam.  The median size of notspam in my corpus is about
40kb.  That's determined unscientifically by sorting by size and scrolling
to approximately half way down.

Thanks.  Have a good weekend.
Ken
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to