https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7022
--- Comment #7 from John Hardin <[email protected]> --- (In reply to Ivo Truxa from comment #6) > There is also the possibility to append the ASCII normalization after the > Unicode version (or the original). That would satisfy both needs, but would > increase the memory needs and the database growth. I'd recommend against doing that. That could have serious negative effects on "tflags multiple" rules. I think a better approach would be to keep both unnormalized and normalized versions separate in memory, and rules would be run against the normalized version by default unless they had a tflag specifying they should run against the unnormalized version. That way the relatively few rules that look for accent obfuscation can detect it, while the majority of rules (and bayes) get the normalized version and yield better overall results. There would be memory impact, but little additional impact on the scan time, and bayes wouldn't double-token. As an efficiency hack, the unnormalized version could be discarded after normalization if there were no active rules that had the tflag specifying to run against the unnormalized message text. That would minimize the memory impact. > However, the normalizing is optional, and the administrator can choose what > is better for his case. In my case (the vast majority of email on the server > is Czech, German or French with a big multitude of diverse charsets), I know > I want the plain ASCII normalizing, already because writing the rules is a > nightmare otherwise. But I am sure that many other administrators will opt > for Unicode, or no normalizing at all. Right. So the tflag for "run against unnormalized message body" would not have any effect on the rule if normalizing was disabled. An admin might also disable it to avoid the memory pressure and/or performance hit from normalization. This sounds like it might be a big improvement. -- You are receiving this mail because: You are the assignee for the bug.
