Re: UTF8 character in [] doesn't match

Matus UHLAR - fantomas Mon, 24 Dec 2018 04:26:43 -0800

> On Sun, 23 Dec 2018 20:04:28 +0100
> Matus UHLAR - fantomas wrote:
> > I have tried to create rule that will match names "ján" and
> > "jano" (john and johnny in slovak languages).
> >
> > I have created rule:
> >
> > body     LOCAL_JANO      /\bJ[aá]no\b/i
> >
> > however, it does not match.
> >
> > Apparently the [á] does not match even when normalize_charset is set
> > to '1'.
> >
> > any idea what can cause this?

On Sun, Dec 23, 2018 at 11:11:39PM +0000, RW wrote:
> normalize_charset converts to UTF-8 but the tests are still done on
> bytes, so á isn't a character, it's a string. You need (?:a|á) instead
> of [aá].


yes, luckily variation of this works as expected...

On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote:

Nope, that makes no difference.  If message is UTF-8, neither of them will
match anyway.

The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1
or utf8, regardless of normalize_charset setting.  Config contents are never
converted to anything, you need to make sure regex contains raw bytes for
both cases.

perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))'

No one should use normalize_charset until things are possibly simplified for
4.0.0, it's a terrible mess.  Also it was broken for html as per bug 7656.


On 24.12.18 10:18, Henrik K wrote:

Oh and SpamAssassin assumes config files are latin1/bytes, so do not use
UTF-8 encoding on config files itself, it will only create more confusion.


i assumed normalize_charset to expect rules in utf8, otherwise it would be a
little strange ...

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
2B|!2B, that's a question!

Re: UTF8 character in [] doesn't match

Reply via email to