subject:"Re\: UTF8 character in \[\] doesn't match"

Re: UTF8 character in [] doesn't match

2018-12-24 Thread John Hardin


On Sun, 23 Dec 2018, Matus UHLAR - fantomas wrote:

While it doesn't directly answer your question about normalize-charset, 
this might work a little better:


 ifplugin Mail::SpamAssassin::Plugin::ReplaceTags
   body   LOCAL_JANO   /\bjno?\b/i
   replace_rules  LOCAL_JANO
 endif


bodyLOCAL_JANO  /\bjno?\b/i
replace_rules   LOCAL_JANO

unfortunately, this doesn't work for Ján, only for Jano.


I find that odd.

Okay, review of the replacetags replacements shows that a lot of basic 
UTF8 codepoints are missing - I wasn't thorough enough the last time I 
added sequences.


Fixing (most of) the omissions...

That's better:

Dec 24 12:21:58.354 [11460] dbg: rules-all: running body rule LOCAL_JANO
Dec 24 12:21:58.355 [11460] dbg: rules: ran body rule LOCAL_JANO ==> got hit: 
"Ján"
Dec 24 12:21:58.355 [11460] dbg: rules: ran body rule LOCAL_JANO ==> got hit: 
"Jano"


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
  does quite what I want. I wish Christopher Robin was here."
   -- Peter da Silva in a.s.r
---
 Tomorrow: Christmas

Re: UTF8 character in [] doesn't match

2018-12-24 Thread Henrik K

On Mon, Dec 24, 2018 at 06:48:51PM +, RW wrote:
> On Mon, 24 Dec 2018 10:16:58 +0200
> Henrik K wrote:
> 
> > On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> > > On Sun, 23 Dec 2018 20:04:28 +0100
> > > Matus UHLAR - fantomas wrote:
> > >   
> > > > Hello,
> > > > 
> > > > I have tried to create rule that will match names "ján" and
> > > > "jano" (john and johnny in slovak languages).
> > > > 
> > > > I have created rule:
> > > > 
> > > > body LOCAL_JANO  /\bJ[aá]no\b/i
> > > > 
> > > > however, it does not match.
> > > > 
> > > > Apparently the [á] does not match even when normalize_charset is
> > > > set to '1'.
> > > > 
> > > > any idea what can cause this?  
> > > 
> > > normalize_charset converts to UTF-8 but the tests are still done on
> > > bytes, so á isn't a character, it's a string. You need (?:a|á)
> > > instead of [aá].  
> > 
> > Nope, that makes no difference.  If message is UTF-8, neither of them
> > will match anyway.
> 
> I don't see why it wouldn't if the rules are edited in UTF-8, and the
> text is in or converted to UTF-8.

Might or might not, I'd still advocate using portable and sure to work
methods..

Re: UTF8 character in [] doesn't match

2018-12-24 Thread RW

On Mon, 24 Dec 2018 10:16:58 +0200
Henrik K wrote:

> On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> > On Sun, 23 Dec 2018 20:04:28 +0100
> > Matus UHLAR - fantomas wrote:
> >   
> > > Hello,
> > > 
> > > I have tried to create rule that will match names "ján" and
> > > "jano" (john and johnny in slovak languages).
> > > 
> > > I have created rule:
> > > 
> > > body LOCAL_JANO  /\bJ[aá]no\b/i
> > > 
> > > however, it does not match.
> > > 
> > > Apparently the [á] does not match even when normalize_charset is
> > > set to '1'.
> > > 
> > > any idea what can cause this?  
> > 
> > normalize_charset converts to UTF-8 but the tests are still done on
> > bytes, so á isn't a character, it's a string. You need (?:a|á)
> > instead of [aá].  
> 
> Nope, that makes no difference.  If message is UTF-8, neither of them
> will match anyway.

I don't see why it wouldn't if the rules are edited in UTF-8, and the
text is in or converted to UTF-8.

Re: UTF8 character in [] doesn't match

2018-12-24 Thread Matus UHLAR - fantomas

> On Sun, 23 Dec 2018 20:04:28 +0100
> Matus UHLAR - fantomas wrote:
> > I have tried to create rule that will match names "ján" and
> > "jano" (john and johnny in slovak languages).
> >
> > I have created rule:
> >
> > body LOCAL_JANO  /\bJ[aá]no\b/i
> >
> > however, it does not match.
> >
> > Apparently the [á] does not match even when normalize_charset is set
> > to '1'.
> >
> > any idea what can cause this?

On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> normalize_charset converts to UTF-8 but the tests are still done on
> bytes, so á isn't a character, it's a string. You need (?:a|á) instead
> of [aá].

yes, luckily variation of this works as expected...

On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote:

Nope, that makes no difference.  If message is UTF-8, neither of them will
match anyway.

The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1
or utf8, regardless of normalize_charset setting.  Config contents are never
converted to anything, you need to make sure regex contains raw bytes for
both cases.

perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))'

No one should use normalize_charset until things are possibly simplified for
4.0.0, it's a terrible mess.  Also it was broken for html as per bug 7656.

On 24.12.18 10:18, Henrik K wrote:

Oh and SpamAssassin assumes config files are latin1/bytes, so do not use
UTF-8 encoding on config files itself, it will only create more confusion.

i assumed normalize_charset to expect rules in utf8, otherwise it would be a
little strange ...

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
2B|!2B, that's a question!

Re: UTF8 character in [] doesn't match

2018-12-24 Thread Pedro David Marco

 On Monday, December 24, 2018, 9:49:11 AM GMT+1, Henrik K  wrote:
 
 
>... so for general file portability this would be even better:
>
>(?:[a\xe1]|\xc3\xa1)

I fully agree with Henrik, but would add a small detail... in some cases i have 
found problems using BODY to locate special chars  (most likley, to my 
understanding, due to how HTML parser manages words).Using RAWBODY as long as 
possible shows better results to me...
>Merry Christmas all. ;-)
Thanks Henrik... the same for you and everybody...

PedroD

Re: UTF8 character in [] doesn't match

2018-12-24 Thread Henrik K

On Mon, Dec 24, 2018 at 10:18:31AM +0200, Henrik K wrote:
> On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote:
> > On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> > > On Sun, 23 Dec 2018 20:04:28 +0100
> > > Matus UHLAR - fantomas wrote:
> > > 
> > > > Hello,
> > > > 
> > > > I have tried to create rule that will match names "ján" and
> > > > "jano" (john and johnny in slovak languages).
> > > > 
> > > > I have created rule:
> > > > 
> > > > body LOCAL_JANO  /\bJ[aá]no\b/i
> > > > 
> > > > however, it does not match.
> > > > 
> > > > Apparently the [á] does not match even when normalize_charset is set
> > > > to '1'.
> > > > 
> > > > any idea what can cause this?
> > > 
> > > normalize_charset converts to UTF-8 but the tests are still done on
> > > bytes, so á isn't a character, it's a string. You need (?:a|á) instead
> > > of [aá].
> > 
> > Nope, that makes no difference.  If message is UTF-8, neither of them will
> > match anyway.
> > 
> > The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1
> > or utf8, regardless of normalize_charset setting.  Config contents are never
> > converted to anything, you need to make sure regex contains raw bytes for
> > both cases.
> > 
> > perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))'
> > 
> > No one should use normalize_charset until things are possibly simplified for
> > 4.0.0, it's a terrible mess.  Also it was broken for html as per bug 7656.
> 
> Oh and SpamAssassin assumes config files are latin1/bytes, so do not use
> UTF-8 encoding on config files itself, it will only create more confusion.

... so for general file portability this would be even better:

(?:[a\xe1]|\xc3\xa1)

Merry Christmas all. ;-)

Re: UTF8 character in [] doesn't match

2018-12-24 Thread Henrik K

On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote:
> On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> > On Sun, 23 Dec 2018 20:04:28 +0100
> > Matus UHLAR - fantomas wrote:
> > 
> > > Hello,
> > > 
> > > I have tried to create rule that will match names "ján" and
> > > "jano" (john and johnny in slovak languages).
> > > 
> > > I have created rule:
> > > 
> > > body LOCAL_JANO  /\bJ[aá]no\b/i
> > > 
> > > however, it does not match.
> > > 
> > > Apparently the [á] does not match even when normalize_charset is set
> > > to '1'.
> > > 
> > > any idea what can cause this?
> > 
> > normalize_charset converts to UTF-8 but the tests are still done on
> > bytes, so á isn't a character, it's a string. You need (?:a|á) instead
> > of [aá].
> 
> Nope, that makes no difference.  If message is UTF-8, neither of them will
> match anyway.
> 
> The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1
> or utf8, regardless of normalize_charset setting.  Config contents are never
> converted to anything, you need to make sure regex contains raw bytes for
> both cases.
> 
> perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))'
> 
> No one should use normalize_charset until things are possibly simplified for
> 4.0.0, it's a terrible mess.  Also it was broken for html as per bug 7656.

Oh and SpamAssassin assumes config files are latin1/bytes, so do not use
UTF-8 encoding on config files itself, it will only create more confusion.

Re: UTF8 character in [] doesn't match

2018-12-24 Thread Henrik K

On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> On Sun, 23 Dec 2018 20:04:28 +0100
> Matus UHLAR - fantomas wrote:
> 
> > Hello,
> > 
> > I have tried to create rule that will match names "ján" and
> > "jano" (john and johnny in slovak languages).
> > 
> > I have created rule:
> > 
> > body LOCAL_JANO  /\bJ[aá]no\b/i
> > 
> > however, it does not match.
> > 
> > Apparently the [á] does not match even when normalize_charset is set
> > to '1'.
> > 
> > any idea what can cause this?
> 
> normalize_charset converts to UTF-8 but the tests are still done on
> bytes, so á isn't a character, it's a string. You need (?:a|á) instead
> of [aá].

Nope, that makes no difference.  If message is UTF-8, neither of them will
match anyway.

The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1
or utf8, regardless of normalize_charset setting.  Config contents are never
converted to anything, you need to make sure regex contains raw bytes for
both cases.

perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))'

No one should use normalize_charset until things are possibly simplified for
4.0.0, it's a terrible mess.  Also it was broken for html as per bug 7656.

Re: UTF8 character in [] doesn't match

2018-12-23 Thread RW

On Sun, 23 Dec 2018 20:04:28 +0100
Matus UHLAR - fantomas wrote:

> Hello,
> 
> I have tried to create rule that will match names "ján" and
> "jano" (john and johnny in slovak languages).
> 
> I have created rule:
> 
> body LOCAL_JANO  /\bJ[aá]no\b/i
> 
> however, it does not match.
> 
> Apparently the [á] does not match even when normalize_charset is set
> to '1'.
> 
> any idea what can cause this?

normalize_charset converts to UTF-8 but the tests are still done on
bytes, so á isn't a character, it's a string. You need (?:a|á) instead
of [aá].

I think this may be changing in 4.

Re: UTF8 character in [] doesn't match

2018-12-23 Thread Matus UHLAR - fantomas


On Sun, 23 Dec 2018, Matus UHLAR - fantomas wrote:

I have tried to create rule that will match names "ján" and "jano" (john
and johnny in slovak languages).

I have created rule:

body LOCAL_JANO  /\bJ[aá]no\b/i


fixed:

bodyLOCAL_JANO  /\bJ[aá]no?\b/i


however, it does not match.


On 23.12.18 11:49, John Hardin wrote:

The "o" is not optional in that RE, so it would never match "ján".


Sorry!

I have pasted in the mittle of editing, trying to make it work or narrow the
problem. The fixed version above didn't match too.

even J[á]n did not match "Ján" - is there a problem with perl and utf8? 


# echo Ján | perl -ne 'if (/J[á]n/) {print "OK\n"} else { print "KO\n"}'
KO

after consulting google:
https://stackoverflow.com/questions/21092427/perl-regex-replace-with-utf-8-characters#21092679
I found this combination to work:

# echo Ján | perl -Mutf8 -mopen=':std',':encoding(UTF-8)' -ne 'if (/J[á]n/) {print 
"OK\n"} else { print "KO\n"}'
OK

now, does SA use these "use"s above?

While it doesn't directly answer your question about 
normalize-charset, this might work a little better:


 ifplugin Mail::SpamAssassin::Plugin::ReplaceTags
   body   LOCAL_JANO   /\bjno?\b/i
   replace_rules  LOCAL_JANO
 endif


bodyLOCAL_JANO  /\bjno?\b/i
replace_rules   LOCAL_JANO

unfortunately, this doesn't work for Ján, only for Jano.



--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"Where do you want to go to die?" [Microsoft]

Re: UTF8 character in [] doesn't match

2018-12-23 Thread John Hardin


On Sun, 23 Dec 2018, Matus UHLAR - fantomas wrote:


Hello,

I have tried to create rule that will match names "ján" and "jano" (john
and johnny in slovak languages).

I have created rule:

body LOCAL_JANO  /\bJ[aá]no\b/i

however, it does not match.


The "o" is not optional in that RE, so it would never match "ján".

While it doesn't directly answer your question about normalize-charset, 
this might work a little better:


  ifplugin Mail::SpamAssassin::Plugin::ReplaceTags
body   LOCAL_JANO   /\bjno?\b/i
replace_rules  LOCAL_JANO
  endif

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
  does quite what I want. I wish Christopher Robin was here."
   -- Peter da Silva in a.s.r
---
 2 days until Christmas

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

Re: UTF8 character in [] doesn't match

11 matches

Site Navigation

Mail list logo

Footer information