[Bug 6042] Malformed UTF-8 character

bugzilla-daemon Sat, 25 Sep 2010 11:15:58 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6042


John Hardin <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #3 from John Hardin <[email protected]> 2010-09-25 14:15:30 UTC ---
(In reply to comment #2)
> Help, I get warn: Malformed UTF-8 character (unexpected continuation byte 
> 0xac,
> with no preceding start byte) in pattern match (m//) at
> /home/jidanni/.spamassassin/user_prefs, rule J_BODY_US_BIG5, line 1.
> And that rule is not meant to be UTF-8 at all. That rule is
> body J_BODY_US_BIG5
> /\xBFn\xA5\xFD\xA5\xCD|\xA4\xA6(\xA5\xA7|\xA5\xFD\xA5\xCD)|\xAC\xD5\xA5\xC9|\xA4G\xAB\xD7\xA4\xC0\xB1a|\xAA\xEA\xA4l\xA4s|\xBD\xBA\xB6\xE9/

Perl can sometimes get confused by REs like that, and it's not consistent
either.

The safest thing to do when coding strings of 8-bit characters like that is to
enclose each character in a run in square brackets to make it a character
class. This prevents Perl from trying to interpret pairs as a UTF-8 character.
For example:


body J_BODY_US_BIG5
/[\xBFn][\xA5][\xFD][\xA5][\xCD]|[\xA4][\xA6](?:[\xA5][\xA7]|[\xA5][\xFD][\xA5][\xCD])|[\xAC][\xD5][\xA5][\xC9]|\xA4G[\xAB][\xD7][\xA4][\xC0][\xB1a]|[\xAA][\xEA][\xA4l][\xA4s]|[\xBD][\xBA][\xB6][\xE9]/

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6042] Malformed UTF-8 character

Reply via email to