Re: UTF8 character in [] doesn't match
On Sun, 23 Dec 2018, Matus UHLAR - fantomas wrote: While it doesn't directly answer your question about normalize-charset, this might work a little better: ifplugin Mail::SpamAssassin::Plugin::ReplaceTags body LOCAL_JANO /\bjno?\b/i replace_rules LOCAL_JANO endif bodyLOCAL_JANO /\bjno?\b/i replace_rules LOCAL_JANO unfortunately, this doesn't work for Ján, only for Jano. I find that odd. Okay, review of the replacetags replacements shows that a lot of basic UTF8 codepoints are missing - I wasn't thorough enough the last time I added sequences. Fixing (most of) the omissions... That's better: Dec 24 12:21:58.354 [11460] dbg: rules-all: running body rule LOCAL_JANO Dec 24 12:21:58.355 [11460] dbg: rules: ran body rule LOCAL_JANO ==> got hit: "Ján" Dec 24 12:21:58.355 [11460] dbg: rules: ran body rule LOCAL_JANO ==> got hit: "Jano" -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never does quite what I want. I wish Christopher Robin was here." -- Peter da Silva in a.s.r --- Tomorrow: Christmas
Re: UTF8 character in [] doesn't match
On Mon, Dec 24, 2018 at 06:48:51PM +, RW wrote: > On Mon, 24 Dec 2018 10:16:58 +0200 > Henrik K wrote: > > > On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote: > > > On Sun, 23 Dec 2018 20:04:28 +0100 > > > Matus UHLAR - fantomas wrote: > > > > > > > Hello, > > > > > > > > I have tried to create rule that will match names "ján" and > > > > "jano" (john and johnny in slovak languages). > > > > > > > > I have created rule: > > > > > > > > body LOCAL_JANO /\bJ[aá]no\b/i > > > > > > > > however, it does not match. > > > > > > > > Apparently the [á] does not match even when normalize_charset is > > > > set to '1'. > > > > > > > > any idea what can cause this? > > > > > > normalize_charset converts to UTF-8 but the tests are still done on > > > bytes, so á isn't a character, it's a string. You need (?:a|á) > > > instead of [aá]. > > > > Nope, that makes no difference. If message is UTF-8, neither of them > > will match anyway. > > I don't see why it wouldn't if the rules are edited in UTF-8, and the > text is in or converted to UTF-8. Might or might not, I'd still advocate using portable and sure to work methods..
Re: UTF8 character in [] doesn't match
On Mon, 24 Dec 2018 10:16:58 +0200 Henrik K wrote: > On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote: > > On Sun, 23 Dec 2018 20:04:28 +0100 > > Matus UHLAR - fantomas wrote: > > > > > Hello, > > > > > > I have tried to create rule that will match names "ján" and > > > "jano" (john and johnny in slovak languages). > > > > > > I have created rule: > > > > > > body LOCAL_JANO /\bJ[aá]no\b/i > > > > > > however, it does not match. > > > > > > Apparently the [á] does not match even when normalize_charset is > > > set to '1'. > > > > > > any idea what can cause this? > > > > normalize_charset converts to UTF-8 but the tests are still done on > > bytes, so á isn't a character, it's a string. You need (?:a|á) > > instead of [aá]. > > Nope, that makes no difference. If message is UTF-8, neither of them > will match anyway. I don't see why it wouldn't if the rules are edited in UTF-8, and the text is in or converted to UTF-8.
Re: UTF8 character in [] doesn't match
> On Sun, 23 Dec 2018 20:04:28 +0100 > Matus UHLAR - fantomas wrote: > > I have tried to create rule that will match names "ján" and > > "jano" (john and johnny in slovak languages). > > > > I have created rule: > > > > body LOCAL_JANO /\bJ[aá]no\b/i > > > > however, it does not match. > > > > Apparently the [á] does not match even when normalize_charset is set > > to '1'. > > > > any idea what can cause this? On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote: > normalize_charset converts to UTF-8 but the tests are still done on > bytes, so á isn't a character, it's a string. You need (?:a|á) instead > of [aá]. yes, luckily variation of this works as expected... On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote: Nope, that makes no difference. If message is UTF-8, neither of them will match anyway. The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1 or utf8, regardless of normalize_charset setting. Config contents are never converted to anything, you need to make sure regex contains raw bytes for both cases. perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))' No one should use normalize_charset until things are possibly simplified for 4.0.0, it's a terrible mess. Also it was broken for html as per bug 7656. On 24.12.18 10:18, Henrik K wrote: Oh and SpamAssassin assumes config files are latin1/bytes, so do not use UTF-8 encoding on config files itself, it will only create more confusion. i assumed normalize_charset to expect rules in utf8, otherwise it would be a little strange ... -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. 2B|!2B, that's a question!
Re: UTF8 character in [] doesn't match
On Monday, December 24, 2018, 9:49:11 AM GMT+1, Henrik K wrote: >... so for general file portability this would be even better: > >(?:[a\xe1]|\xc3\xa1) I fully agree with Henrik, but would add a small detail... in some cases i have found problems using BODY to locate special chars (most likley, to my understanding, due to how HTML parser manages words).Using RAWBODY as long as possible shows better results to me... >Merry Christmas all. ;-) Thanks Henrik... the same for you and everybody... PedroD
Re: UTF8 character in [] doesn't match
On Mon, Dec 24, 2018 at 10:18:31AM +0200, Henrik K wrote: > On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote: > > On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote: > > > On Sun, 23 Dec 2018 20:04:28 +0100 > > > Matus UHLAR - fantomas wrote: > > > > > > > Hello, > > > > > > > > I have tried to create rule that will match names "ján" and > > > > "jano" (john and johnny in slovak languages). > > > > > > > > I have created rule: > > > > > > > > body LOCAL_JANO /\bJ[aá]no\b/i > > > > > > > > however, it does not match. > > > > > > > > Apparently the [á] does not match even when normalize_charset is set > > > > to '1'. > > > > > > > > any idea what can cause this? > > > > > > normalize_charset converts to UTF-8 but the tests are still done on > > > bytes, so á isn't a character, it's a string. You need (?:a|á) instead > > > of [aá]. > > > > Nope, that makes no difference. If message is UTF-8, neither of them will > > match anyway. > > > > The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1 > > or utf8, regardless of normalize_charset setting. Config contents are never > > converted to anything, you need to make sure regex contains raw bytes for > > both cases. > > > > perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))' > > > > No one should use normalize_charset until things are possibly simplified for > > 4.0.0, it's a terrible mess. Also it was broken for html as per bug 7656. > > Oh and SpamAssassin assumes config files are latin1/bytes, so do not use > UTF-8 encoding on config files itself, it will only create more confusion. ... so for general file portability this would be even better: (?:[a\xe1]|\xc3\xa1) Merry Christmas all. ;-)
Re: UTF8 character in [] doesn't match
On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote: > On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote: > > On Sun, 23 Dec 2018 20:04:28 +0100 > > Matus UHLAR - fantomas wrote: > > > > > Hello, > > > > > > I have tried to create rule that will match names "ján" and > > > "jano" (john and johnny in slovak languages). > > > > > > I have created rule: > > > > > > body LOCAL_JANO /\bJ[aá]no\b/i > > > > > > however, it does not match. > > > > > > Apparently the [á] does not match even when normalize_charset is set > > > to '1'. > > > > > > any idea what can cause this? > > > > normalize_charset converts to UTF-8 but the tests are still done on > > bytes, so á isn't a character, it's a string. You need (?:a|á) instead > > of [aá]. > > Nope, that makes no difference. If message is UTF-8, neither of them will > match anyway. > > The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1 > or utf8, regardless of normalize_charset setting. Config contents are never > converted to anything, you need to make sure regex contains raw bytes for > both cases. > > perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))' > > No one should use normalize_charset until things are possibly simplified for > 4.0.0, it's a terrible mess. Also it was broken for html as per bug 7656. Oh and SpamAssassin assumes config files are latin1/bytes, so do not use UTF-8 encoding on config files itself, it will only create more confusion.
Re: UTF8 character in [] doesn't match
On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote: > On Sun, 23 Dec 2018 20:04:28 +0100 > Matus UHLAR - fantomas wrote: > > > Hello, > > > > I have tried to create rule that will match names "ján" and > > "jano" (john and johnny in slovak languages). > > > > I have created rule: > > > > body LOCAL_JANO /\bJ[aá]no\b/i > > > > however, it does not match. > > > > Apparently the [á] does not match even when normalize_charset is set > > to '1'. > > > > any idea what can cause this? > > normalize_charset converts to UTF-8 but the tests are still done on > bytes, so á isn't a character, it's a string. You need (?:a|á) instead > of [aá]. Nope, that makes no difference. If message is UTF-8, neither of them will match anyway. The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1 or utf8, regardless of normalize_charset setting. Config contents are never converted to anything, you need to make sure regex contains raw bytes for both cases. perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))' No one should use normalize_charset until things are possibly simplified for 4.0.0, it's a terrible mess. Also it was broken for html as per bug 7656.
Re: UTF8 character in [] doesn't match
On Sun, 23 Dec 2018 20:04:28 +0100 Matus UHLAR - fantomas wrote: > Hello, > > I have tried to create rule that will match names "ján" and > "jano" (john and johnny in slovak languages). > > I have created rule: > > body LOCAL_JANO /\bJ[aá]no\b/i > > however, it does not match. > > Apparently the [á] does not match even when normalize_charset is set > to '1'. > > any idea what can cause this? normalize_charset converts to UTF-8 but the tests are still done on bytes, so á isn't a character, it's a string. You need (?:a|á) instead of [aá]. I think this may be changing in 4.
Re: UTF8 character in [] doesn't match
On Sun, 23 Dec 2018, Matus UHLAR - fantomas wrote: I have tried to create rule that will match names "ján" and "jano" (john and johnny in slovak languages). I have created rule: body LOCAL_JANO /\bJ[aá]no\b/i fixed: bodyLOCAL_JANO /\bJ[aá]no?\b/i however, it does not match. On 23.12.18 11:49, John Hardin wrote: The "o" is not optional in that RE, so it would never match "ján". Sorry! I have pasted in the mittle of editing, trying to make it work or narrow the problem. The fixed version above didn't match too. even J[á]n did not match "Ján" - is there a problem with perl and utf8? # echo Ján | perl -ne 'if (/J[á]n/) {print "OK\n"} else { print "KO\n"}' KO after consulting google: https://stackoverflow.com/questions/21092427/perl-regex-replace-with-utf-8-characters#21092679 I found this combination to work: # echo Ján | perl -Mutf8 -mopen=':std',':encoding(UTF-8)' -ne 'if (/J[á]n/) {print "OK\n"} else { print "KO\n"}' OK now, does SA use these "use"s above? While it doesn't directly answer your question about normalize-charset, this might work a little better: ifplugin Mail::SpamAssassin::Plugin::ReplaceTags body LOCAL_JANO /\bjno?\b/i replace_rules LOCAL_JANO endif bodyLOCAL_JANO /\bjno?\b/i replace_rules LOCAL_JANO unfortunately, this doesn't work for Ján, only for Jano. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. "Where do you want to go to die?" [Microsoft]
Re: UTF8 character in [] doesn't match
On Sun, 23 Dec 2018, Matus UHLAR - fantomas wrote: Hello, I have tried to create rule that will match names "ján" and "jano" (john and johnny in slovak languages). I have created rule: body LOCAL_JANO /\bJ[aá]no\b/i however, it does not match. The "o" is not optional in that RE, so it would never match "ján". While it doesn't directly answer your question about normalize-charset, this might work a little better: ifplugin Mail::SpamAssassin::Plugin::ReplaceTags body LOCAL_JANO /\bjno?\b/i replace_rules LOCAL_JANO endif -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never does quite what I want. I wish Christopher Robin was here." -- Peter da Silva in a.s.r --- 2 days until Christmas