Re: UTF8 character in [] doesn't match

2018-12-24 Thread John Hardin

On Sun, 23 Dec 2018, Matus UHLAR - fantomas wrote:

While it doesn't directly answer your question about normalize-charset, 
this might work a little better:


 ifplugin Mail::SpamAssassin::Plugin::ReplaceTags
   body   LOCAL_JANO   /\bjno?\b/i
   replace_rules  LOCAL_JANO
 endif


bodyLOCAL_JANO  /\bjno?\b/i
replace_rules   LOCAL_JANO

unfortunately, this doesn't work for Ján, only for Jano.


I find that odd.

Okay, review of the replacetags replacements shows that a lot of basic 
UTF8 codepoints are missing - I wasn't thorough enough the last time I 
added sequences.


Fixing (most of) the omissions...

That's better:

Dec 24 12:21:58.354 [11460] dbg: rules-all: running body rule LOCAL_JANO
Dec 24 12:21:58.355 [11460] dbg: rules: ran body rule LOCAL_JANO ==> got hit: 
"Ján"
Dec 24 12:21:58.355 [11460] dbg: rules: ran body rule LOCAL_JANO ==> got hit: 
"Jano"


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
  does quite what I want. I wish Christopher Robin was here."
   -- Peter da Silva in a.s.r
---
 Tomorrow: Christmas

Re: UTF8 character in [] doesn't match

2018-12-24 Thread Henrik K
On Mon, Dec 24, 2018 at 06:48:51PM +, RW wrote:
> On Mon, 24 Dec 2018 10:16:58 +0200
> Henrik K wrote:
> 
> > On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> > > On Sun, 23 Dec 2018 20:04:28 +0100
> > > Matus UHLAR - fantomas wrote:
> > >   
> > > > Hello,
> > > > 
> > > > I have tried to create rule that will match names "ján" and
> > > > "jano" (john and johnny in slovak languages).
> > > > 
> > > > I have created rule:
> > > > 
> > > > body LOCAL_JANO  /\bJ[aá]no\b/i
> > > > 
> > > > however, it does not match.
> > > > 
> > > > Apparently the [á] does not match even when normalize_charset is
> > > > set to '1'.
> > > > 
> > > > any idea what can cause this?  
> > > 
> > > normalize_charset converts to UTF-8 but the tests are still done on
> > > bytes, so á isn't a character, it's a string. You need (?:a|á)
> > > instead of [aá].  
> > 
> > Nope, that makes no difference.  If message is UTF-8, neither of them
> > will match anyway.
> 
> I don't see why it wouldn't if the rules are edited in UTF-8, and the
> text is in or converted to UTF-8.

Might or might not, I'd still advocate using portable and sure to work
methods..



Re: UTF8 character in [] doesn't match

2018-12-24 Thread RW
On Mon, 24 Dec 2018 10:16:58 +0200
Henrik K wrote:

> On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> > On Sun, 23 Dec 2018 20:04:28 +0100
> > Matus UHLAR - fantomas wrote:
> >   
> > > Hello,
> > > 
> > > I have tried to create rule that will match names "ján" and
> > > "jano" (john and johnny in slovak languages).
> > > 
> > > I have created rule:
> > > 
> > > body LOCAL_JANO  /\bJ[aá]no\b/i
> > > 
> > > however, it does not match.
> > > 
> > > Apparently the [á] does not match even when normalize_charset is
> > > set to '1'.
> > > 
> > > any idea what can cause this?  
> > 
> > normalize_charset converts to UTF-8 but the tests are still done on
> > bytes, so á isn't a character, it's a string. You need (?:a|á)
> > instead of [aá].  
> 
> Nope, that makes no difference.  If message is UTF-8, neither of them
> will match anyway.

I don't see why it wouldn't if the rules are edited in UTF-8, and the
text is in or converted to UTF-8.


Re: Is the SA Bayes implementation mathematically sound?

2018-12-24 Thread Rick Macdougall

On 2018-12-24 12:39 p.m., Ian Zimmerman wrote:

On 2018-12-23 17:02, Rick Macdougall wrote:


I'm just going to jump in here and mention that I train my bayes in SA
and in Thunderbird email client.

Thunderbird catches 99%+ and SA catches under 60% with the same
training data.


Have you also compared the rates of False Positives?



I've seen one FP in Thunderbird in the last 5 years.

I haven't checked Bayes only FP in SA, but I will.

Regards,

Rick


Re: Is the SA Bayes implementation mathematically sound?

2018-12-24 Thread Ian Zimmerman
On 2018-12-23 17:02, Rick Macdougall wrote:

> I'm just going to jump in here and mention that I train my bayes in SA
> and in Thunderbird email client.
> 
> Thunderbird catches 99%+ and SA catches under 60% with the same
> training data.

Have you also compared the rates of False Positives?

-- 
Please don't Cc: me privately on mailing lists and Usenet,
if you also post the followup to the list or newsgroup.
To reply privately _only_ on Usenet and on broken lists
which rewrite From, fetch the TXT record for no-use.mooo.com.


Re: UTF8 character in [] doesn't match

2018-12-24 Thread Matus UHLAR - fantomas

> On Sun, 23 Dec 2018 20:04:28 +0100
> Matus UHLAR - fantomas wrote:
> > I have tried to create rule that will match names "ján" and
> > "jano" (john and johnny in slovak languages).
> >
> > I have created rule:
> >
> > body LOCAL_JANO  /\bJ[aá]no\b/i
> >
> > however, it does not match.
> >
> > Apparently the [á] does not match even when normalize_charset is set
> > to '1'.
> >
> > any idea what can cause this?



On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> normalize_charset converts to UTF-8 but the tests are still done on
> bytes, so á isn't a character, it's a string. You need (?:a|á) instead
> of [aá].


yes, luckily variation of this works as expected...


On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote:

Nope, that makes no difference.  If message is UTF-8, neither of them will
match anyway.

The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1
or utf8, regardless of normalize_charset setting.  Config contents are never
converted to anything, you need to make sure regex contains raw bytes for
both cases.

perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))'

No one should use normalize_charset until things are possibly simplified for
4.0.0, it's a terrible mess.  Also it was broken for html as per bug 7656.


On 24.12.18 10:18, Henrik K wrote:

Oh and SpamAssassin assumes config files are latin1/bytes, so do not use
UTF-8 encoding on config files itself, it will only create more confusion.


i assumed normalize_charset to expect rules in utf8, otherwise it would be a
little strange ...

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
2B|!2B, that's a question!


Re: UTF8 character in [] doesn't match

2018-12-24 Thread Pedro David Marco
 On Monday, December 24, 2018, 9:49:11 AM GMT+1, Henrik K  wrote:
 
 
>... so for general file portability this would be even better:
>
>(?:[a\xe1]|\xc3\xa1)

I fully agree with Henrik, but would add a small detail... in some cases i have 
found problems using BODY to locate special chars  (most likley, to my 
understanding, due to how HTML parser manages words).Using RAWBODY as long as 
possible shows better results to me...
>Merry Christmas all. ;-)
Thanks Henrik... the same for you and everybody...

PedroD



  

Re: UTF8 character in [] doesn't match

2018-12-24 Thread Henrik K
On Mon, Dec 24, 2018 at 10:18:31AM +0200, Henrik K wrote:
> On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote:
> > On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> > > On Sun, 23 Dec 2018 20:04:28 +0100
> > > Matus UHLAR - fantomas wrote:
> > > 
> > > > Hello,
> > > > 
> > > > I have tried to create rule that will match names "ján" and
> > > > "jano" (john and johnny in slovak languages).
> > > > 
> > > > I have created rule:
> > > > 
> > > > body LOCAL_JANO  /\bJ[aá]no\b/i
> > > > 
> > > > however, it does not match.
> > > > 
> > > > Apparently the [á] does not match even when normalize_charset is set
> > > > to '1'.
> > > > 
> > > > any idea what can cause this?
> > > 
> > > normalize_charset converts to UTF-8 but the tests are still done on
> > > bytes, so á isn't a character, it's a string. You need (?:a|á) instead
> > > of [aá].
> > 
> > Nope, that makes no difference.  If message is UTF-8, neither of them will
> > match anyway.
> > 
> > The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1
> > or utf8, regardless of normalize_charset setting.  Config contents are never
> > converted to anything, you need to make sure regex contains raw bytes for
> > both cases.
> > 
> > perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))'
> > 
> > No one should use normalize_charset until things are possibly simplified for
> > 4.0.0, it's a terrible mess.  Also it was broken for html as per bug 7656.
> 
> Oh and SpamAssassin assumes config files are latin1/bytes, so do not use
> UTF-8 encoding on config files itself, it will only create more confusion.

... so for general file portability this would be even better:

(?:[a\xe1]|\xc3\xa1)

Merry Christmas all. ;-)



Re: UTF8 character in [] doesn't match

2018-12-24 Thread Henrik K
On Mon, Dec 24, 2018 at 10:16:58AM +0200, Henrik K wrote:
> On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> > On Sun, 23 Dec 2018 20:04:28 +0100
> > Matus UHLAR - fantomas wrote:
> > 
> > > Hello,
> > > 
> > > I have tried to create rule that will match names "ján" and
> > > "jano" (john and johnny in slovak languages).
> > > 
> > > I have created rule:
> > > 
> > > body LOCAL_JANO  /\bJ[aá]no\b/i
> > > 
> > > however, it does not match.
> > > 
> > > Apparently the [á] does not match even when normalize_charset is set
> > > to '1'.
> > > 
> > > any idea what can cause this?
> > 
> > normalize_charset converts to UTF-8 but the tests are still done on
> > bytes, so á isn't a character, it's a string. You need (?:a|á) instead
> > of [aá].
> 
> Nope, that makes no difference.  If message is UTF-8, neither of them will
> match anyway.
> 
> The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1
> or utf8, regardless of normalize_charset setting.  Config contents are never
> converted to anything, you need to make sure regex contains raw bytes for
> both cases.
> 
> perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))'
> 
> No one should use normalize_charset until things are possibly simplified for
> 4.0.0, it's a terrible mess.  Also it was broken for html as per bug 7656.

Oh and SpamAssassin assumes config files are latin1/bytes, so do not use
UTF-8 encoding on config files itself, it will only create more confusion.



Re: UTF8 character in [] doesn't match

2018-12-24 Thread Henrik K
On Sun, Dec 23, 2018 at 11:11:39PM +, RW wrote:
> On Sun, 23 Dec 2018 20:04:28 +0100
> Matus UHLAR - fantomas wrote:
> 
> > Hello,
> > 
> > I have tried to create rule that will match names "ján" and
> > "jano" (john and johnny in slovak languages).
> > 
> > I have created rule:
> > 
> > body LOCAL_JANO  /\bJ[aá]no\b/i
> > 
> > however, it does not match.
> > 
> > Apparently the [á] does not match even when normalize_charset is set
> > to '1'.
> > 
> > any idea what can cause this?
> 
> normalize_charset converts to UTF-8 but the tests are still done on
> bytes, so á isn't a character, it's a string. You need (?:a|á) instead
> of [aá].

Nope, that makes no difference.  If message is UTF-8, neither of them will
match anyway.

The only correct solution: (?:[aá]|\xc3\xa1) works for any message in latin1
or utf8, regardless of normalize_charset setting.  Config contents are never
converted to anything, you need to make sure regex contains raw bytes for
both cases.

perl -MEncode -e 'print unpack("H*", encode("UTF-8", "á"))'

No one should use normalize_charset until things are possibly simplified for
4.0.0, it's a terrible mess.  Also it was broken for html as per bug 7656.