[Bug 7022] normalize_charset

bugzilla-daemon Wed, 12 Mar 2014 15:48:07 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7022


--- Comment #7 from John Hardin <[email protected]> ---
(In reply to Ivo Truxa from comment #6)
> There is also the possibility to append the ASCII normalization after the
> Unicode version (or the original). That would satisfy both needs, but would
> increase the memory needs and the database growth.

I'd recommend against doing that. That could have serious negative effects on
"tflags multiple" rules.

I think a better approach would be to keep both unnormalized and normalized
versions separate in memory, and rules would be run against the normalized
version by default unless they had a tflag specifying they should run against
the unnormalized version.

That way the relatively few rules that look for accent obfuscation can detect
it, while the majority of rules (and bayes) get the normalized version and
yield better overall results. There would be memory impact, but little
additional impact on the scan time, and bayes wouldn't double-token.

As an efficiency hack, the unnormalized version could be discarded after
normalization if there were no active rules that had the tflag specifying to
run against the unnormalized message text. That would minimize the memory
impact.

> However, the normalizing is optional, and the administrator can choose what
> is better for his case. In my case (the vast majority of email on the server
> is Czech, German or French with a big multitude of diverse charsets), I know
> I want the plain ASCII normalizing, already because writing the rules is a
> nightmare otherwise. But I am sure that many other administrators will opt
> for Unicode, or no normalizing at all.

Right. So the tflag for "run against unnormalized message body" would not have
any effect on the rule if normalizing was disabled.

An admin might also disable it to avoid the memory pressure and/or performance
hit from normalization.

This sounds like it might be a big improvement.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7022] normalize_charset

Reply via email to