[PR] Improve A-Z replace_tag definitions [spamassassin]

via GitHub Sat, 18 Oct 2025 03:02:01 -0700


fkoyer opened a new pull request, #19:
URL: https://github.com/apache/spamassassin/pull/19


   Problems with old definitions:
   
   * Tries to match UTF-8 and Latin-1 characters in same expression. e.g. \<A\> 
includes the byte sequence for "ã" in Latin-1 (\xE3) and UTF-8 (\xC3\xA3). This 
seems like a good thing at first but it can cause false positives if the text 
is in UTF-8 and the pattern is looking for Latin-1
   * Contains redundant characters. e.g. \xE3 appears multiple times in \<A\>
   * Contains unnecessary characters. e.g. \xE3 also appears in \<V\> and \<Y\>
   * Patterns are case-insensitive. e.g. \<I\> attempts to match lowercase L 
but because it's case-insensitive, it also matches uppercase L
   * Some look-alike characters aren't matched e.g. \xEA\x93\xAE  = LISU LETTER 
A (U+A4EE)
   
   Changes:
   
   * All byte sequences are UTF-8 only (no Latin-1)
   * All patterns are case-sensitive 
   * Removed redundant and unnecessary characters
   * Added additional look-alike characters
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Improve A-Z replace_tag definitions [spamassassin]

Reply via email to