Bayes in V4 compared to V3

2024-09-12 Thread Grega via users
Hi.

I have SA 4.0.1 configured it, all is good, except for bayes. It IS working, it 
IS learning but when it classifies mail it is really not so decisive as it was 
in V3.
I have:

dbg: bayes: corpus size: nspam = 1190, nham = 12441 dbg: bayes: DB expiry: 
tokens in DB: 979401, Expiry max size: 150, Oldest atime: 1725361640, 
Newest atime: 1725888528, Last expire: 0, Current time: 1725888537
So I have enough spam/ham and really enough tokens...
What I find weird is this:
BAYES_50 and BAYES_40 have like 10.000 hits EACH which is ALOT

BAYES_80 only 600
BAYES_95 even less: 341
BAYES_99: 284
BAYES_20 only 150
BAYES_60 only 87
I have no BAYES lower than 40 at all. I am training and also use autolearn.
I have also transferred corpus trained on SA v3 where it worked correctly.
Is Spamassassin v4 really so much more conservative or am I doing something 
wrong here?

Also;
One more thing...
Some mails even dont have BAYES added in score list, confirmed on 2 installs

1.95 DATE_IN_FUTURE_06_12 Date: is 6 to 12 hours after Received: date 1.10 
DCC_CHECK Detected as bulk mail by DCC (dcc-servers.net) 0.10 DKIM_SIGNED 
Message has a DKIM or DK signature, not necessarily valid -0.50 DKIM_VALID 
Message has at least one valid DKIM or DK signature -1.00 DKIM_VALID_AU Message 
has a valid DKIM or DK signature from author's domain -0.10 DKIM_VALID_EF 
Message has a valid DKIM or DK signature from envelope-from domain -0.00 
DMARC_PASS DMARC pass policy 0.25 FREEMAIL_ENVFROM_END_DIGIT Envelope-from 
freemail username ends in digit 0.30 FREEMAIL_FROM Sender email is commonly 
abused enduser mail provider 0.00 HTML_MESSAGE HTML included in message -0.00 
RCVD_IN_DNSWL_NONE Sender listed at https://www.dnswl.org/, no trust -0.00 
SPF_HELO_PASS SPF: HELO matches SPF record 2.50 URIBL_DBL_PHISH Contains a 
Phishing URL listed in the Spamhaus DBL blocklist

But a lot of mails have bayes scores.
There is no errors in logs and all is working fine...

I also tried to empty and clear bayes db and retrain it, same results...

Am I doung somethi g wrong?

Regards,Grega


Re: Bayes "corpus" - how old?

2024-01-31 Thread Bill Cole

On 2024-01-31 at 08:16:13 UTC-0500 (Wed, 31 Jan 2024 14:16:13 +0100)
Matus UHLAR - fantomas 
is rumored to have said:


On 2024-01-30 at 12:08:18 UTC-0500 (Tue, 30 Jan 2024 18:08:18 +0100)
Matus UHLAR - fantomas 
is rumored to have said:

[...]
autolearn may help if your DB is well maintained, although I have 
disabled nearly all rules with negative scores, like


RCVD_IN_DNSWL_*
RCVD_IN_IADB_* DKIMWL_WL_*
RCVD_IN_MSPIKE_*
RCVD_IN_VALIDITY_*
USER_IN_DEF_*
ALL_TRUSTED

etc, because spammers often abuse these.
I mean, they may have negative score but don't train on them.


On 30.01.24 15:31, Bill Cole wrote:
If spammers can 'abuse' ALL_TRUSTED you have a major problem. Either 
a serious misconfiguration or compromised machines in 
trusted_networks.


Can't ALL_TRUSTED happen if spammer delivers mail directly to my 
network,

or, if last mail server removes Received: headers?

I think this happened to me in the past but I may be wrong


I just did a manual test on my personal machine to confirm: mail entered 
manually in a connection to port 25 from an unprivileged network with no 
Received headers did NOT get an ALL_TRUSTED match.


The semantics around the word 'trusted' in SA are subtle and arcane. 
There's an important distinction between trusting that a particular MTA 
writes transparent and honest Received headers and trusting that a 
particular MTA does not relay spam. For example, I have 2 address blocks 
in my trusted_networks that are used by the ASF for forwarding, which I 
needed precisely because those machines sometimes forward spam and I 
need SA to look beyond the immediate clients, which I know tell me the 
truth about where they get the spam they offer me.



--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Re: Bayes "corpus" - how old?

2024-01-31 Thread Matus UHLAR - fantomas

On 2024-01-30 at 12:08:18 UTC-0500 (Tue, 30 Jan 2024 18:08:18 +0100)
Matus UHLAR - fantomas 
is rumored to have said:

[...]
autolearn may help if your DB is well maintained, although I have 
disabled nearly all rules with negative scores, like


RCVD_IN_DNSWL_*
RCVD_IN_IADB_* DKIMWL_WL_*
RCVD_IN_MSPIKE_*
RCVD_IN_VALIDITY_*
USER_IN_DEF_*
ALL_TRUSTED

etc, because spammers often abuse these.
I mean, they may have negative score but don't train on them.


On 30.01.24 15:31, Bill Cole wrote:
If spammers can 'abuse' ALL_TRUSTED you have a major problem. Either a 
serious misconfiguration or compromised machines in trusted_networks.


Can't ALL_TRUSTED happen if spammer delivers mail directly to my network,
or, if last mail server removes Received: headers?

I think this happened to me in the past but I may be wrong
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
LSD will make your ECS screen display 16.7 million colors


Re: Bayes "corpus" - how old?

2024-01-30 Thread Bill Cole

On 2024-01-30 at 12:08:18 UTC-0500 (Tue, 30 Jan 2024 18:08:18 +0100)
Matus UHLAR - fantomas 
is rumored to have said:

[...]
autolearn may help if your DB is well maintained, although I have 
disabled nearly all rules with negative scores, like


RCVD_IN_DNSWL_*
RCVD_IN_IADB_* DKIMWL_WL_*
RCVD_IN_MSPIKE_*
RCVD_IN_VALIDITY_*
USER_IN_DEF_*
ALL_TRUSTED

etc, because spammers often abuse these.
I mean, they may have negative score but don't train on them.


If spammers can 'abuse' ALL_TRUSTED you have a major problem. Either a 
serious misconfiguration or compromised machines in trusted_networks.


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Re: Bayes "corpus" - how old?

2024-01-30 Thread Matus UHLAR - fantomas

On 30.01.24 09:59, joe a wrote:

Advisable to "prune" Bayes data based on age?

While cleaning up recent Ham/Spam, found my "saved SPAM" goes back 
to 2013.


Why that's over . . . wait, I need to take off my socks . . .

So, how old is "too old".  For saved SPAM?



On 1/30/2024 10:58:52, Matus UHLAR - fantomas wrote:

I did retrain on old spam a few times and it was working fine.
Depends on how much mail you have:

0.000  0   7542  0  non-token data: nspam
0.000  0  80869  0  non-token data: nham
0.000  0 996032  0  non-token data: ntokens
0.000  0 1172945918  0  non-token data: oldest atime

so, even old spam mey be fine. You however need much of ham to train 
otherwise everything starts looking like spam.


On 30.01.24 11:12, joe a wrote:
Recently missed spam has increased a bit, so I was dropping it into 
"missed spam" and went poking through marked spam and found lots of 
"missed ham".Which triggered my pondering.


training on false-positives/false-negatives is important to have it up to 
date.


full retraining only makes sense if you lose your DB, it gets corrupt or 
starts misclassifying too often (may the reason be known or not).


autolearn may help if your DB is well maintained, although I have disabled 
nearly all rules with negative scores, like


RCVD_IN_DNSWL_*
RCVD_IN_IADB_* 
DKIMWL_WL_*

RCVD_IN_MSPIKE_*
RCVD_IN_VALIDITY_*
USER_IN_DEF_*
ALL_TRUSTED

etc, because spammers often abuse these.
I mean, they may have negative score but don't train on them.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
M$ Win's are shit, do not use it !


Re: Bayes "corpus" - how old?

2024-01-30 Thread joe a

On 1/30/2024 10:58:52, Matus UHLAR - fantomas wrote:

On 30.01.24 09:59, joe a wrote:

Advisable to "prune" Bayes data based on age?

While cleaning up recent Ham/Spam, found my "saved SPAM" goes back to 
2013.


Why that's over . . . wait, I need to take off my socks . . .

So, how old is "too old".  For saved SPAM?



I did retrain on old spam a few times and it was working fine.
Depends on how much mail you have:

0.000  0   7542  0  non-token data: nspam
0.000  0  80869  0  non-token data: nham
0.000  0 996032  0  non-token data: ntokens
0.000  0 1172945918  0  non-token data: oldest atime

so, even old spam mey be fine. You however need much of ham to train 
otherwise everything starts looking like spam.




Recently missed spam has increased a bit, so I was dropping it into 
"missed spam" and went poking through marked spam and found lots of 
"missed ham".Which triggered my pondering.





Re: Bayes "corpus" - how old?

2024-01-30 Thread Bill Cole

On 2024-01-30 at 09:59:52 UTC-0500 (Tue, 30 Jan 2024 09:59:52 -0500)
joe a 
is rumored to have said:


Advisable to "prune" Bayes data based on age?


Yes. That is why it has an expiration model. Expiration may be de facto 
blocked on some busy systems so you may need to explicitly force it 
occasionally. The command "sa-learn --dump magic" will show you 
expiration and other Bayes metadata.


While cleaning up recent Ham/Spam, found my "saved SPAM" goes back to 
2013.


Why that's over . . . wait, I need to take off my socks . . .


I've still got some almost 3x as old. BUT: I do not use it for training 
SA today.



So, how old is "too old".  For saved SPAM?


I would suggest a year as the outer edge of Bayes usefulness.

I find it helpful to keep my decades of garbage because I use them (and 
my ham archive) in developing prospective rules. There are non-obvious 
fingerprints in some spam that imply decades-long spamming operations.



--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Re: Bayes "corpus" - how old?

2024-01-30 Thread Matus UHLAR - fantomas

On 30.01.24 09:59, joe a wrote:

Advisable to "prune" Bayes data based on age?

While cleaning up recent Ham/Spam, found my "saved SPAM" goes back to 
2013.


Why that's over . . . wait, I need to take off my socks . . .

So, how old is "too old".  For saved SPAM?



I did retrain on old spam a few times and it was working fine.
Depends on how much mail you have:

0.000  0   7542  0  non-token data: nspam
0.000  0  80869  0  non-token data: nham
0.000  0 996032  0  non-token data: ntokens
0.000  0 1172945918  0  non-token data: oldest atime

so, even old spam mey be fine. You however need much of ham to train 
otherwise everything starts looking like spam.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Linux - It's now safe to turn on your computer.
Linux - Teraz mozete pocitac bez obav zapnut.


Bayes "corpus" - how old?

2024-01-30 Thread joe a

Advisable to "prune" Bayes data based on age?

While cleaning up recent Ham/Spam, found my "saved SPAM" goes back to 
2013.


Why that's over . . . wait, I need to take off my socks . . .

So, how old is "too old".  For saved SPAM?





Re: Bayes Stopword

2023-12-29 Thread Jimmy
This is what I believe: the words need to be trimmed or separated, and
careful consideration is required to determine the language in order to
perform accurate cutoffs.

Jimmy

On Fri, Dec 29, 2023 at 5:16 PM  wrote:

> "ทุก" is not considered a word because it's part of the token
> "ทุกวันพุธเล่นชนะรับเพิ่ม".
> Words must be separated by spaces, otherwise we should skip the word
> "theme" just because "the" is in english stopword list.
> No idea if this makes sense for asian languages.
>
>   Giovanni
>
> On 12/29/23 11:04, Jimmy wrote:
> >
> > The sample email and word list should contain at least these words.
> >
> > ถูก
> > เลย
> > ทุก
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 4:47 PM  giova...@paclan.it>> wrote:
> >
> > I do not speak Thai but I cannot see any word in the sample email
> that should match that list.
> > Which word do you think should match the regexp ?
> >Giovanni
> >
> > On 12/29/23 10:08, Jimmy wrote:
> >  > You can use this word list
> >  >
> >  >
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> >>
> >  >
> >  > Jimmy
> >  >
> >  > On Fri, Dec 29, 2023 at 3:59 PM  giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>
> wrote:
> >  >
> >  > To create the stopwords regexp I used the script I shared in
> a previous email and a list of words one per line.
> >  >     Could you share the list you are using ?
> >  >
> >  > Giovanni
> >  >
> >  > On 12/29/23 09:22, Jimmy wrote:
> >  >  > I use SpamAssassin 4.0.0 (2022-12-14)
> >  >  >
> >  >  > $ spamassassin -D --lint 2>&1 | grep bayes:
> >  >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=en
> >  >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=th
> >  >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=ru
> >  >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=fr
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=ja
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=zh
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=dk
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=nl
> >  >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=de
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=es
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=fi
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=fr
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=it
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=no
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=ru
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=se
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=tr
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=vi
> >  >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=ko
> >  >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=zh
> >  >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=hi
> >  >  > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for
> languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko
> zh hi
> >  >  >
> >  >  >
> >  >  > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep
> "skipped token"
> >  >  > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token
> 'Email' because it's in stopword lis

Re: Bayes Stopword

2023-12-29 Thread giovanni

"ทุก" is not considered a word because it's part of the token 
"ทุกวันพุธเล่นชนะรับเพิ่ม".
Words must be separated by spaces, otherwise we should skip the word "theme" just because 
"the" is in english stopword list.
No idea if this makes sense for asian languages.

 Giovanni

On 12/29/23 11:04, Jimmy wrote:


The sample email and word list should contain at least these words.

ถูก
เลย
ทุก

Jimmy

On Fri, Dec 29, 2023 at 4:47 PM mailto:giova...@paclan.it>> wrote:

I do not speak Thai but I cannot see any word in the sample email that 
should match that list.
Which word do you think should match the regexp ?
   Giovanni

On 12/29/23 10:08, Jimmy wrote:
 > You can use this word list
 >
 > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
 >
 > Jimmy
 >
 > On Fri, Dec 29, 2023 at 3:59 PM mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> wrote:
 >
 >     To create the stopwords regexp I used the script I shared in a 
previous email and a list of words one per line.
 >     Could you share the list you are using ?
 >
 >         Giovanni
 >
 >     On 12/29/23 09:22, Jimmy wrote:
 >      > I use SpamAssassin 4.0.0 (2022-12-14)
 >      >
 >      > $ spamassassin -D --lint 2>&1 | grep bayes:
 >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
 >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
 >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
 >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
 >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
 >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
 >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
 >      > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages 
enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
 >      >
 >      >
 >      > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped 
token"
 >      > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' 
because it's in stopword list for language 'en'
 >      >
 >      > You can use "บาท" that was listed in regexp pattern but somehow I 
don't know why it not show skipped token in bayes.
 >      >
 >      > Jimmy
 >      >
 >      >
 >      > On Fri, Dec 29, 2023 at 2:59 PM mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> wrote:
 >      >
 >      >     Config line produces a syntax error for me:
 >      >     config: failed to parse line in /etc/mail/spamassassin/local.cf <http://local.cf> 
<http://local.cf <http://local.cf>> <http://local.cf <http://local.cf> <http://local.cf 
<http://local.cf>>> (line 1): bayes_stopword_th
 >      >
 >      >     Could you share the word list in utf8 ?
  

Re: Bayes Stopword

2023-12-29 Thread Jimmy
The sample email and word list should contain at least these words.

ถูก
เลย
ทุก

Jimmy

On Fri, Dec 29, 2023 at 4:47 PM  wrote:

> I do not speak Thai but I cannot see any word in the sample email that
> should match that list.
> Which word do you think should match the regexp ?
>   Giovanni
>
> On 12/29/23 10:08, Jimmy wrote:
> > You can use this word list
> >
> >
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> >
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 3:59 PM  giova...@paclan.it>> wrote:
> >
> > To create the stopwords regexp I used the script I shared in a
> previous email and a list of words one per line.
> > Could you share the list you are using ?
> >
> > Giovanni
> >
> > On 12/29/23 09:22, Jimmy wrote:
> >  > I use SpamAssassin 4.0.0 (2022-12-14)
> >  >
> >  > $ spamassassin -D --lint 2>&1 | grep bayes:
> >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> >  > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages
> enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> >  >
> >  >
> >  > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped
> token"
> >  > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email'
> because it's in stopword list for language 'en'
> >  >
> >  > You can use "บาท" that was listed in regexp pattern but somehow I
> don't know why it not show skipped token in bayes.
> >  >
> >  > Jimmy
> >  >
> >  >
> >  > On Fri, Dec 29, 2023 at 2:59 PM  giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>
> wrote:
> >  >
> >  > Config line produces a syntax error for me:
> >  > config: failed to parse line in /etc/mail/spamassassin/
> local.cf <http://local.cf> <http://local.cf <http://local.cf>> (line 1):
> bayes_stopword_th
> >  >
> >  > Could you share the word list in utf8 ?
> >  > I tried adding "บาท" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
> and it produces a working regexp.
> >  > Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> >  >Giovanni
> >  >
> >  > On 12/28/23 17:06, Jimmy wrote:
> >  >  > bayes_stopwor

Re: Bayes Stopword

2023-12-29 Thread giovanni

I do not speak Thai but I cannot see any word in the sample email that should 
match that list.
Which word do you think should match the regexp ?
 Giovanni

On 12/29/23 10:08, Jimmy wrote:

You can use this word list

https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>

Jimmy

On Fri, Dec 29, 2023 at 3:59 PM mailto:giova...@paclan.it>> wrote:

To create the stopwords regexp I used the script I shared in a previous 
email and a list of words one per line.
Could you share the list you are using ?

    Giovanni

On 12/29/23 09:22, Jimmy wrote:
 > I use SpamAssassin 4.0.0 (2022-12-14)
 >
 > $ spamassassin -D --lint 2>&1 | grep bayes:
 > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
 > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
 > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
     > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
     > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
 > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
 > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
 > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
 > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: 
en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
 >
 >
 > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
 > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because 
it's in stopword list for language 'en'
 >
 > You can use "บาท" that was listed in regexp pattern but somehow I don't 
know why it not show skipped token in bayes.
 >
 > Jimmy
 >
 >
 > On Fri, Dec 29, 2023 at 2:59 PM mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> wrote:
 >
 >     Config line produces a syntax error for me:
 >     config: failed to parse line in /etc/mail/spamassassin/local.cf 
<http://local.cf> <http://local.cf <http://local.cf>> (line 1): bayes_stopword_th
 >
 >     Could you share the word list in utf8 ?
 >     I tried adding "บาท" to 
https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> and it 
produces a working regexp.
 >     Bayes stopwords languages must also be enabled using 
"bayes_stopword_languages" config keyword, by default only english is enabled.
 >        Giovanni
 >
 >     On 12/28/23 17:06, Jimmy wrote:
 >      > bayes_stopword_th https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d> 
<https://pastebin.pl/view/0838138d <https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d>>>
 >      > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> 
<https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 
<https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 
<https://pastebin.pl/view/e5a2c5b8>>>
 >      >
 >      > Jimmy
 >      >
 >      >
 >      > On Thu, Dec 28, 2023 at 10:59 PM mailto:gio

Re: Bayes Stopword

2023-12-29 Thread Jimmy
You can use this word list

https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt

Jimmy

On Fri, Dec 29, 2023 at 3:59 PM  wrote:

> To create the stopwords regexp I used the script I shared in a previous
> email and a list of words one per line.
> Could you share the list you are using ?
>
>Giovanni
>
> On 12/29/23 09:22, Jimmy wrote:
> > I use SpamAssassin 4.0.0 (2022-12-14)
> >
> > $ spamassassin -D --lint 2>&1 | grep bayes:
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled:
> en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> >
> >
> > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because
> it's in stopword list for language 'en'
> >
> > You can use "บาท" that was listed in regexp pattern but somehow I don't
> know why it not show skipped token in bayes.
> >
> > Jimmy
> >
> >
> > On Fri, Dec 29, 2023 at 2:59 PM  giova...@paclan.it>> wrote:
> >
> > Config line produces a syntax error for me:
> > config: failed to parse line in /etc/mail/spamassassin/local.cf <
> http://local.cf> (line 1): bayes_stopword_th
> >
> > Could you share the word list in utf8 ?
> > I tried adding "บาท" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> and it produces a working regexp.
> > Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> >Giovanni
> >
> > On 12/28/23 17:06, Jimmy wrote:
> >  > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>>
> >  > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>>
> >  >
> >  > Jimmy
> >  >
> >  >
> >  > On Thu, Dec 28, 2023 at 10:59 PM  giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>
> wrote:
> >  >
> >  > Could you share a config line and a sample you are using ?
> >  >Giovanni
> >  >
> >  > On 12/28/23 16:26, Jimmy wrote:
> >  >  > Yes, I have done that, and I am also editing
> Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that
> if words are not separated by spaces, longer words may not match those
> patterns.
> >  >  >
> >  >  > Jimmy
> >  >  >
> >  >  > On Thu, Dec 28, 2023 at 10:13 PM  <mailto:giova...@paclan.it> <mailto:giova...@paclan.it  giova...@paclan.it>> <mailto:giova...@

Re: Bayes Stopword

2023-12-29 Thread giovanni

To create the stopwords regexp I used the script I shared in a previous email 
and a list of words one per line.
Could you share the list you are using ?

  Giovanni

On 12/29/23 09:22, Jimmy wrote:

I use SpamAssassin 4.0.0 (2022-12-14)

$ spamassassin -D --lint 2>&1 | grep bayes:
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en th 
ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi


$ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's in 
stopword list for language 'en'

You can use "บาท" that was listed in regexp pattern but somehow I don't know 
why it not show skipped token in bayes.

Jimmy


On Fri, Dec 29, 2023 at 2:59 PM mailto:giova...@paclan.it>> wrote:

Config line produces a syntax error for me:
config: failed to parse line in /etc/mail/spamassassin/local.cf 
<http://local.cf> (line 1): bayes_stopword_th

Could you share the word list in utf8 ?
I tried adding "บาท" to 
https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
<https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> 
and it produces a working regexp.
Bayes stopwords languages must also be enabled using 
"bayes_stopword_languages" config keyword, by default only english is enabled.
   Giovanni

On 12/28/23 17:06, Jimmy wrote:
 > bayes_stopword_th https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d>>
 > Sample mail https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8> 
<https://pastebin.pl/view/e5a2c5b8 <https://pastebin.pl/view/e5a2c5b8>>
 >
 > Jimmy
 >
 >
 > On Thu, Dec 28, 2023 at 10:59 PM mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> wrote:
 >
 >     Could you share a config line and a sample you are using ?
 >        Giovanni
 >
 >     On 12/28/23 16:26, Jimmy wrote:
 >      > Yes, I have done that, and I am also editing Plugin/Bayes.pm to 
investigate why it is not being skipped. I suspect that if words are not separated by 
spaces, longer words may not match those patterns.
 >      >
 >      > Jimmy
 >      >
 >      > On Thu, Dec 28, 2023 at 10:13 PM mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> wrote:
 >      >
 >      >     "spamassassin -D bayes" will tell you, you should see a line 
like:
 >      >     bayes: skipped token 'from' because it's in stopword list for 
language 'en'
 >      >
 >      >        Giovanni
 >      >
 >      >     On 12/28/23 15:45, Jimmy wrote:
 >      >      > The pattern has successfully passed the test script, but 
it needs to check whether Bayes learning will identify and possibly exclude the word 
from matching this pattern.
 >      >      >
 >      >      > Thank you.
 >      >      >
 >      >      >
 >      >      > On Thu, Dec 28, 2023 at 9:22 PM mailto:giova...@paclan.it> 

Re: Bayes Stopword

2023-12-29 Thread Jimmy
I use SpamAssassin 4.0.0 (2022-12-14)

$ spamassassin -D --lint 2>&1 | grep bayes:
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en
th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi


$ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's
in stopword list for language 'en'

You can use "บาท" that was listed in regexp pattern but somehow I don't
know why it not show skipped token in bayes.

Jimmy


On Fri, Dec 29, 2023 at 2:59 PM  wrote:

> Config line produces a syntax error for me:
> config: failed to parse line in /etc/mail/spamassassin/local.cf (line 1):
> bayes_stopword_th
>
> Could you share the word list in utf8 ?
> I tried adding "บาท" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> and it produces a working regexp.
> Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
>   Giovanni
>
> On 12/28/23 17:06, Jimmy wrote:
> > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>
> > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>
> >
> > Jimmy
> >
> >
> > On Thu, Dec 28, 2023 at 10:59 PM  giova...@paclan.it>> wrote:
> >
> > Could you share a config line and a sample you are using ?
> >Giovanni
> >
> > On 12/28/23 16:26, Jimmy wrote:
> >  > Yes, I have done that, and I am also editing Plugin/Bayes.pm to
> investigate why it is not being skipped. I suspect that if words are not
> separated by spaces, longer words may not match those patterns.
> >  >
> >  > Jimmy
> >  >
> >  > On Thu, Dec 28, 2023 at 10:13 PM  giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>
> wrote:
> >  >
> >  > "spamassassin -D bayes" will tell you, you should see a line
> like:
> >  > bayes: skipped token 'from' because it's in stopword list for
> language 'en'
> >  >
> >  >Giovanni
> >  >
> >  > On 12/28/23 15:45, Jimmy wrote:
> >  >  > The pattern has successfully passed the test script, but
> it needs to check whether Bayes learning will identify and possibly exclude
> the word from matching this pattern.
> >  >  >
> >  >  > Thank you.
> >  >  >
> >  >  >
> >  >  > On Thu, Dec 28, 2023 at 9:22 PM  <mailto:giova...@paclan.it> <mailto:giova...@paclan.it  giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> wrote:
> >  >  >
> >  >  > On 12/28/23 12:59, Jimmy wrote:
> >  >  >  > Hi,
> >  >  >  >
> >  >  >  > I'm seeking assistance in incorporating a stopword
> for Asian languages in Unicode. Although I possess comprehensive word
> lists, my attempts to generate a regex pattern and test it have been
> unsuccessful; the pattern fai

Re: Bayes Stopword

2023-12-28 Thread giovanni

Config line produces a syntax error for me:
config: failed to parse line in /etc/mail/spamassassin/local.cf (line 1): 
bayes_stopword_th

Could you share the word list in utf8 ?
I tried adding "บาท" to 
https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt and 
it produces a working regexp.
Bayes stopwords languages must also be enabled using "bayes_stopword_languages" 
config keyword, by default only english is enabled.
 Giovanni

On 12/28/23 17:06, Jimmy wrote:

bayes_stopword_th https://pastebin.pl/view/0838138d 
<https://pastebin.pl/view/0838138d>
Sample mail https://pastebin.pl/view/e5a2c5b8 
<https://pastebin.pl/view/e5a2c5b8>

Jimmy


On Thu, Dec 28, 2023 at 10:59 PM mailto:giova...@paclan.it>> wrote:

Could you share a config line and a sample you are using ?
   Giovanni

On 12/28/23 16:26, Jimmy wrote:
 > Yes, I have done that, and I am also editing Plugin/Bayes.pm to 
investigate why it is not being skipped. I suspect that if words are not separated 
by spaces, longer words may not match those patterns.
 >
 > Jimmy
 >
 > On Thu, Dec 28, 2023 at 10:13 PM mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> wrote:
 >
 >     "spamassassin -D bayes" will tell you, you should see a line like:
 >     bayes: skipped token 'from' because it's in stopword list for 
language 'en'
 >
 >        Giovanni
 >
 >     On 12/28/23 15:45, Jimmy wrote:
 >      > The pattern has successfully passed the test script, but it needs 
to check whether Bayes learning will identify and possibly exclude the word from 
matching this pattern.
 >      >
 >      > Thank you.
 >      >
 >      >
 >      > On Thu, Dec 28, 2023 at 9:22 PM mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> wrote:
 >      >
 >      >     On 12/28/23 12:59, Jimmy wrote:
 >      >      > Hi,
 >      >      >
 >      >      > I'm seeking assistance in incorporating a stopword for 
Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to 
generate a regex pattern and test it have been unsuccessful; the pattern fails to match 
or skips tokens in the newly added stopword list.
 >      >      >
 >      >      > I created the regex pattern using the following code:
 >      >      >
 >      >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
 >      >      >
 >      >      > Afterward, I converted it to UTF-8 hex.
 >      >      >
 >      >      > I'm wondering if there are any tools available to 
facilitate the creation of these regex patterns.
 >      >      >
 >      >     I have used Regexp::Trie to create Bayes stopwords in the 
past, code is similar to:
 >      >     
---
 >      >     use strict;
 >      >     use warnings;
 >      >
 >      >     use Encode;
 >      >     use Regexp::Trie;
 >      >
 >      >     my @input = ;
 >      >     my $rt = Regexp::Trie->new;
 >      >     for my $w ( @input ) {
 >      >         chomp($w);
 >      >         $rt->add($w);
 >      >     }
 >      >     my $regexp = $rt->regexp;
 >      >     my @reg = split //, $regexp;
 >      >     for my $c ( @reg ) {
 >      >         my $char = $c;
 >      >         my $test;
 >      >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
 >      >         if( $@ ) {
 >      >           print 'x' . sprintf("%x", ord($c));
 >      >         } else {
 >      >           print $char;
 >      >         }
 >      >     }
 >      >     
---
 >      >
 >      >        Giovanni
 >      >
 >





OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Bayes Stopword

2023-12-28 Thread Jimmy
bayes_stopword_th https://pastebin.pl/view/0838138d
Sample mail https://pastebin.pl/view/e5a2c5b8

Jimmy


On Thu, Dec 28, 2023 at 10:59 PM  wrote:

> Could you share a config line and a sample you are using ?
>   Giovanni
>
> On 12/28/23 16:26, Jimmy wrote:
> > Yes, I have done that, and I am also editing Plugin/Bayes.pm to
> investigate why it is not being skipped. I suspect that if words are not
> separated by spaces, longer words may not match those patterns.
> >
> > Jimmy
> >
> > On Thu, Dec 28, 2023 at 10:13 PM  giova...@paclan.it>> wrote:
> >
> > "spamassassin -D bayes" will tell you, you should see a line like:
> > bayes: skipped token 'from' because it's in stopword list for
> language 'en'
> >
> >Giovanni
> >
> > On 12/28/23 15:45, Jimmy wrote:
> >  > The pattern has successfully passed the test script, but it needs
> to check whether Bayes learning will identify and possibly exclude the word
> from matching this pattern.
> >  >
> >  > Thank you.
> >  >
> >  >
> >  > On Thu, Dec 28, 2023 at 9:22 PM  giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>
> wrote:
> >  >
> >  > On 12/28/23 12:59, Jimmy wrote:
> >  >  > Hi,
> >  >  >
> >  >  > I'm seeking assistance in incorporating a stopword for
> Asian languages in Unicode. Although I possess comprehensive word lists, my
> attempts to generate a regex pattern and test it have been unsuccessful;
> the pattern fails to match or skips tokens in the newly added stopword list.
> >  >  >
> >  >  > I created the regex pattern using the following code:
> >  >  >
> >  >  > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >  >  >
> >  >  > Afterward, I converted it to UTF-8 hex.
> >  >  >
> >  >  > I'm wondering if there are any tools available to
> facilitate the creation of these regex patterns.
> >  >  >
> >  > I have used Regexp::Trie to create Bayes stopwords in the
> past, code is similar to:
> >  >
>  
> ---
> >  > use strict;
> >  > use warnings;
> >  >
> >  > use Encode;
> >  > use Regexp::Trie;
> >  >
> >  > my @input = ;
> >  > my $rt = Regexp::Trie->new;
> >  > for my $w ( @input ) {
> >  > chomp($w);
> >  > $rt->add($w);
> >  > }
> >  > my $regexp = $rt->regexp;
> >  > my @reg = split //, $regexp;
> >  > for my $c ( @reg ) {
> >  > my $char = $c;
> >  > my $test;
> >  > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> >  > if( $@ ) {
> >  >   print 'x' . sprintf("%x", ord($c));
> >  > } else {
> >  >   print $char;
> >  > }
> >  > }
> >  >
>  
> ---
> >  >
> >  >Giovanni
> >  >
> >
>
>


Re: Bayes Stopword

2023-12-28 Thread giovanni

Could you share a config line and a sample you are using ?
 Giovanni

On 12/28/23 16:26, Jimmy wrote:

Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why 
it is not being skipped. I suspect that if words are not separated by spaces, 
longer words may not match those patterns.

Jimmy

On Thu, Dec 28, 2023 at 10:13 PM mailto:giova...@paclan.it>> wrote:

"spamassassin -D bayes" will tell you, you should see a line like:
bayes: skipped token 'from' because it's in stopword list for language 'en'

   Giovanni

On 12/28/23 15:45, Jimmy wrote:
 > The pattern has successfully passed the test script, but it needs to 
check whether Bayes learning will identify and possibly exclude the word from 
matching this pattern.
 >
 > Thank you.
 >
 >
 > On Thu, Dec 28, 2023 at 9:22 PM mailto:giova...@paclan.it> 
<mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> wrote:
 >
 >     On 12/28/23 12:59, Jimmy wrote:
 >      > Hi,
 >      >
 >      > I'm seeking assistance in incorporating a stopword for Asian 
languages in Unicode. Although I possess comprehensive word lists, my attempts to 
generate a regex pattern and test it have been unsuccessful; the pattern fails to 
match or skips tokens in the newly added stopword list.
 >      >
 >      > I created the regex pattern using the following code:
 >      >
 >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
 >      >
 >      > Afterward, I converted it to UTF-8 hex.
 >      >
 >      > I'm wondering if there are any tools available to facilitate the 
creation of these regex patterns.
 >      >
 >     I have used Regexp::Trie to create Bayes stopwords in the past, code 
is similar to:
 >     
---
 >     use strict;
 >     use warnings;
 >
 >     use Encode;
 >     use Regexp::Trie;
 >
 >     my @input = ;
 >     my $rt = Regexp::Trie->new;
 >     for my $w ( @input ) {
 >         chomp($w);
 >         $rt->add($w);
 >     }
 >     my $regexp = $rt->regexp;
 >     my @reg = split //, $regexp;
 >     for my $c ( @reg ) {
 >         my $char = $c;
 >         my $test;
 >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
 >         if( $@ ) {
 >           print 'x' . sprintf("%x", ord($c));
 >         } else {
 >           print $char;
 >         }
 >     }
 >     
---
 >
 >        Giovanni
 >





OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Bayes Stopword

2023-12-28 Thread Jimmy
Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate
why it is not being skipped. I suspect that if words are not separated by
spaces, longer words may not match those patterns.

Jimmy

On Thu, Dec 28, 2023 at 10:13 PM  wrote:

> "spamassassin -D bayes" will tell you, you should see a line like:
> bayes: skipped token 'from' because it's in stopword list for language 'en'
>
>   Giovanni
>
> On 12/28/23 15:45, Jimmy wrote:
> > The pattern has successfully passed the test script, but it needs to
> check whether Bayes learning will identify and possibly exclude the word
> from matching this pattern.
> >
> > Thank you.
> >
> >
> > On Thu, Dec 28, 2023 at 9:22 PM  giova...@paclan.it>> wrote:
> >
> > On 12/28/23 12:59, Jimmy wrote:
> >  > Hi,
> >  >
> >  > I'm seeking assistance in incorporating a stopword for Asian
> languages in Unicode. Although I possess comprehensive word lists, my
> attempts to generate a regex pattern and test it have been unsuccessful;
> the pattern fails to match or skips tokens in the newly added stopword list.
> >  >
> >  > I created the regex pattern using the following code:
> >  >
> >  > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >  >
> >  > Afterward, I converted it to UTF-8 hex.
> >  >
> >  > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> >  >
> > I have used Regexp::Trie to create Bayes stopwords in the past, code
> is similar to:
> >
>  
> ---
> > use strict;
> > use warnings;
> >
> > use Encode;
> > use Regexp::Trie;
> >
> > my @input = ;
> > my $rt = Regexp::Trie->new;
> > for my $w ( @input ) {
> > chomp($w);
> > $rt->add($w);
> > }
> > my $regexp = $rt->regexp;
> > my @reg = split //, $regexp;
> > for my $c ( @reg ) {
> > my $char = $c;
> > my $test;
> > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > if( $@ ) {
> >   print 'x' . sprintf("%x", ord($c));
> > } else {
> >   print $char;
> > }
> > }
> >
>  
> ---
> >
> >Giovanni
> >
>
>


Re: Bayes Stopword

2023-12-28 Thread giovanni

"spamassassin -D bayes" will tell you, you should see a line like:
bayes: skipped token 'from' because it's in stopword list for language 'en'

 Giovanni

On 12/28/23 15:45, Jimmy wrote:

The pattern has successfully passed the test script, but it needs to check 
whether Bayes learning will identify and possibly exclude the word from 
matching this pattern.

Thank you.


On Thu, Dec 28, 2023 at 9:22 PM mailto:giova...@paclan.it>> wrote:

On 12/28/23 12:59, Jimmy wrote:
 > Hi,
 >
 > I'm seeking assistance in incorporating a stopword for Asian languages 
in Unicode. Although I possess comprehensive word lists, my attempts to generate a 
regex pattern and test it have been unsuccessful; the pattern fails to match or 
skips tokens in the newly added stopword list.
 >
 > I created the regex pattern using the following code:
 >
 > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
 >
 > Afterward, I converted it to UTF-8 hex.
 >
 > I'm wondering if there are any tools available to facilitate the 
creation of these regex patterns.
 >
I have used Regexp::Trie to create Bayes stopwords in the past, code is 
similar to:

---
use strict;
use warnings;

use Encode;
use Regexp::Trie;

my @input = ;
my $rt = Regexp::Trie->new;
for my $w ( @input ) {
    chomp($w);
    $rt->add($w);
}
my $regexp = $rt->regexp;
my @reg = split //, $regexp;
for my $c ( @reg ) {
    my $char = $c;
    my $test;
    eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
    if( $@ ) {
      print 'x' . sprintf("%x", ord($c));
    } else {
      print $char;
    }
}

---

   Giovanni





OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Bayes Stopword

2023-12-28 Thread Jimmy
The pattern has successfully passed the test script, but it needs to check
whether Bayes learning will identify and possibly exclude the word from
matching this pattern.

Thank you.


On Thu, Dec 28, 2023 at 9:22 PM  wrote:

> On 12/28/23 12:59, Jimmy wrote:
> > Hi,
> >
> > I'm seeking assistance in incorporating a stopword for Asian languages
> in Unicode. Although I possess comprehensive word lists, my attempts to
> generate a regex pattern and test it have been unsuccessful; the pattern
> fails to match or skips tokens in the newly added stopword list.
> >
> > I created the regex pattern using the following code:
> >
> > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >
> > Afterward, I converted it to UTF-8 hex.
> >
> > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> >
> I have used Regexp::Trie to create Bayes stopwords in the past, code is
> similar to:
>
> ---
> use strict;
> use warnings;
>
> use Encode;
> use Regexp::Trie;
>
> my @input = ;
> my $rt = Regexp::Trie->new;
> for my $w ( @input ) {
>chomp($w);
>$rt->add($w);
> }
> my $regexp = $rt->regexp;
> my @reg = split //, $regexp;
> for my $c ( @reg ) {
>my $char = $c;
>my $test;
>eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
>if( $@ ) {
>  print 'x' . sprintf("%x", ord($c));
>} else {
>  print $char;
>}
> }
>
> ---
>
>   Giovanni
>


Re: Bayes Stopword

2023-12-28 Thread giovanni

On 12/28/23 12:59, Jimmy wrote:

Hi,

I'm seeking assistance in incorporating a stopword for Asian languages in 
Unicode. Although I possess comprehensive word lists, my attempts to generate a 
regex pattern and test it have been unsuccessful; the pattern fails to match or 
skips tokens in the newly added stopword list.

I created the regex pattern using the following code:

Regexp::Assemble->new->add(@words)->reduce(0)->as_string

Afterward, I converted it to UTF-8 hex.

I'm wondering if there are any tools available to facilitate the creation of 
these regex patterns.


I have used Regexp::Trie to create Bayes stopwords in the past, code is similar 
to:
---
use strict;
use warnings;

use Encode;
use Regexp::Trie;

my @input = ;
my $rt = Regexp::Trie->new;
for my $w ( @input ) {
  chomp($w);
  $rt->add($w);
}
my $regexp = $rt->regexp;
my @reg = split //, $regexp;
for my $c ( @reg ) {
  my $char = $c;
  my $test;
  eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
  if( $@ ) {
print 'x' . sprintf("%x", ord($c));
  } else {
print $char;
  }
}
---

 Giovanni


OpenPGP_signature.asc
Description: OpenPGP digital signature


Bayes Stopword

2023-12-28 Thread Jimmy
Hi,

I'm seeking assistance in incorporating a stopword for Asian languages in
Unicode. Although I possess comprehensive word lists, my attempts to
generate a regex pattern and test it have been unsuccessful; the pattern
fails to match or skips tokens in the newly added stopword list.

I created the regex pattern using the following code:

Regexp::Assemble->new->add(@words)->reduce(0)->as_string

Afterward, I converted it to UTF-8 hex.

I'm wondering if there are any tools available to facilitate the creation
of these regex patterns.

Thank you,
Jimmy


Re: Bayes always reject.

2023-12-13 Thread Jeff Mincy
 > From: Pierluigi Frullani 
 > Date: Wed, 13 Dec 2023 07:49:24 +0100
 > 
 > Hello all,
 >  I'm facing a strange problem.

...
 > tests=BAYES_95,MISSING_DATE,MISSING_HEADERS,NO_RECEIVED,NO_RELAYS,T_SCC_BODY_TEXT_LINE

How did you feed this message into SpamAssassin?
Did you do something to strip off all of the email headers?

For the BAYES_99, as already mentioned you probably need to retrain
bayes, making sure to correct any incorrectly trained email messages.

-jeff


Re: Bayes always reject.

2023-12-13 Thread Bill Cole

On 2023-12-13 at 01:49:24 UTC-0500 (Wed, 13 Dec 2023 07:49:24 +0100)
Pierluigi Frullani 
is rumored to have said:


Hello all,
 I'm facing a strange problem.


Not really. MANY people run into this issue...

I've feed the bayes db for a while and now I would like to put it in 
use

but all messages get a BAYES_99 and very high spam point.
I would like to understand why, and troubleshoot this problem but I 
can't

find a way.


The only reasons that can happen are:

1. All of your mail is in fact spam.
2. Your Bayes DB is mis-trained.

The fix (assuming #2) is to recreate the Bayes DB with proper training.

*IN THEORY* one could fix a corrupted DB by 'unlearning' messages which 
learned incorrectly, but as a practical matter that's usually a fantasy.


Most of the scanning and DB details that you included are not useful. 
You cannot fix the bad DB, you need to rebuild it.




--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Bayes always reject.

2023-12-12 Thread Pierluigi Frullani
Hello all,
 I'm facing a strange problem.
I've feed the bayes db for a while and now I would like to put it in use
but all messages get a BAYES_99 and very high spam point.
I would like to understand why, and troubleshoot this problem but I can't
find a way.
Spamassassin version is:
root@puma:~# spamassassin --version
SpamAssassin version 3.4.6
  running on Perl version 5.22.2
This is the sa_learn --dump magic:
root@puma:~# sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db version
0.000  0 130610  0  non-token data: nspam
0.000  0 316040  0  non-token data: nham
0.000  0 136493  0  non-token data: ntokens
0.000  0 1695915149  0  non-token data: oldest atime
0.000  0 1702447561  0  non-token data: newest atime
0.000  0 1702449197  0  non-token data: last journal sync
atime
0.000  0 1701476495  0  non-token data: last expiry atime
0.000  05529600  0  non-token data: last expire atime
delta
0.000  0  34998  0  non-token data: last expire
reduction count
and this is the spamassassin --lint -D:
root@puma:~# spamassassin -D --lint  2>&1 | grep -i bay
Dec 13 07:39:07.885 [26545] dbg: plugin: loading
Mail::SpamAssassin::Plugin::Bayes from @INC
Dec 13 07:39:08.005 [26545] dbg: config: fixed relative path:
/var/lib/spamassassin/3.004006/updates_spamassassin_org/23_bayes.cf
Dec 13 07:39:08.005 [26545] dbg: config: using
"/var/lib/spamassassin/3.004006/updates_spamassassin_org/23_bayes.cf" for
included file
Dec 13 07:39:08.005 [26545] dbg: config: read file
/var/lib/spamassassin/3.004006/updates_spamassassin_org/23_bayes.cf
Dec 13 07:39:08.047 [26545] dbg: config: fixed relative path:
/var/lib/spamassassin/3.004006/updates_spamassassin_org/
60_bayes_stopwords.cf
Dec 13 07:39:08.047 [26545] dbg: config: using
"/var/lib/spamassassin/3.004006/updates_spamassassin_org/
60_bayes_stopwords.cf" for included file
Dec 13 07:39:08.047 [26545] dbg: config: read file
/var/lib/spamassassin/3.004006/updates_spamassassin_org/
60_bayes_stopwords.cf
Dec 13 07:39:08.292 [26545] dbg: shortcircuit: adding BAYES_99 using
abbreviation spam
Dec 13 07:39:08.292 [26545] dbg: shortcircuit: adding BAYES_00 using
abbreviation ham
Dec 13 07:39:08.586 [26545] dbg: plugin:
Mail::SpamAssassin::Plugin::Bayes=HASH(0x5cca570) implements 'learner_new',
priority 0
Dec 13 07:39:08.586 [26545] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x5cca570),
bayes_store_module=Mail::SpamAssassin::BayesStore::DBM
Dec 13 07:39:08.594 [26545] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x6a51bb0)
Dec 13 07:39:08.594 [26545] dbg: plugin:
Mail::SpamAssassin::Plugin::Bayes=HASH(0x5cca570) implements
'learner_is_scan_available', priority 0
Dec 13 07:39:08.595 [26545] dbg: bayes: tie-ing to DB file R/O
/var/spamassasin/bayes_toks
Dec 13 07:39:08.595 [26545] dbg: bayes: tie-ing to DB file R/O
/var/spamassasin/bayes_seen
Dec 13 07:39:08.595 [26545] dbg: bayes: found bayes db version 3
Dec 13 07:39:08.595 [26545] dbg: bayes: DB journal sync: last sync:
1702449197
Dec 13 07:39:08.621 [26545] dbg: bayes: DB journal sync: last sync:
1702449197
Dec 13 07:39:08.621 [26545] dbg: bayes: corpus size: nspam = 130610, nham =
316040
Dec 13 07:39:08.622 [26545] dbg: bayes: tokenized body: 120 tokens
Dec 13 07:39:08.622 [26545] dbg: bayes: tokenized uri: 0 tokens
Dec 13 07:39:08.622 [26545] dbg: bayes: tokenized invisible: 0 tokens
Dec 13 07:39:08.623 [26545] dbg: bayes: tokenized header: 14 tokens
Dec 13 07:39:08.623 [26545] dbg: bayes: score = 0.976034467829266
Dec 13 07:39:08.624 [26545] dbg: bayes: DB expiry: tokens in DB: 136493,
Expiry max size: 15, Oldest atime: 1695915149, Newest atime:
1702447561, Last expire: 1701476495, Current time: 1702449548
Dec 13 07:39:08.624 [26545] dbg: bayes: DB journal sync: last sync:
1702449197
Dec 13 07:39:08.624 [26545] dbg: bayes: untie-ing
Dec 13 07:39:08.624 [26545] dbg: check: tagrun - tag BAYESTCHAMMY is now
ready, value: 0
Dec 13 07:39:08.624 [26545] dbg: check: tagrun - tag BAYESTCSPAMMY is now
ready, value: 2
Dec 13 07:39:08.624 [26545] dbg: check: tagrun - tag BAYESTCLEARNED is now
ready, value: 4
Dec 13 07:39:08.624 [26545] dbg: check: tagrun - tag BAYESTC is now ready,
value: 20
Dec 13 07:39:08.628 [26545] dbg: rules: ran eval rule BAYES_95 ==> got
hit (1)
Dec 13 07:39:08.863 [26545] dbg: check:
tests=BAYES_95,MISSING_DATE,MISSING_HEADERS,NO_RECEIVED,NO_RELAYS,T_SCC_BODY_TEXT_LINE
Dec 13 07:39:08.864 [26545] dbg: timing: total 1004 ms - init: 738 (73.5%),
parse: 0.85 (0.1%), extract_message_metadata: 1.10 (0.1%),
get_uri_detail_list: 3.9 (0.4%), tests_pri_-2000: 4.3 (0.4%), compile_gen:
85 (8.5%), compile_eval: 13 (1.3%), tests_pri_-1000: 3.6 (0.4%),
tests_pri_-950: 2.8 (0.3%), tests_pri_-900: 4.2 (0.4%), tests_

Re: Share bayes database between servers

2023-07-09 Thread Matija Nalis
On Sun, Jul 09, 2023 at 07:06:10PM +0200, Robert Senger wrote:
> I've set up a testing environment that also uses master-master
> replication of the mysql bayes database, with priority in dns set to
> equal for both mx to get incoming mail distributed evenly to both
> systems. So far, this seems to work, but this is a low load
> environment.

it boils down on how much you trust mysql master-master replication
stability and performance, which is heavily dependent on your
experiences and exact versions used (are we talking about Oracle
Mysql, or MariaDB or Percona forks? which versions? What replication
setup? etc.)

I've had problems under high concurrent load (not performance, but
replication setup breaking) in the past, so I prefer to avoid
master-master replication if possible, especially if I anticipate
high concurrent load.

But if you are confident in it, sure, go ahead.

> Any suggestions?

Well, how are you training your bayes DB? If it is via cron and
manually curated ham/spam corpuses (the recommended way), I'd rather
suggest keeping databases separate and simply running training on
both servers (you can duplicate or share ham/spam corpuses as you wish,
from rsync to SMB/NFS).

If you are using auto-learn (which was not recommended last time I
looked), well, you'd probably better off NOT syncing bayes at all
IMHO, as it should be prefered that risk of bayes poisoning is
reduced to one server instead of replicating that (and there is not
much benefit, as auto-learn will quickly learn on each server
separately anyway, and if one set of domains is not getting some type
of spam, it is not beneficial to learn it anyway)

-- 
Opinions above are GNU-copylefted.


Re: Share bayes database between servers

2023-07-09 Thread Robert Senger
Am Sonntag, dem 09.07.2023 um 19:21 +0200 schrieb Reindl Harald:
> 
> 
> Am 09.07.23 um 19:06 schrieb Robert Senger:
> > But bayes data may be updated by either the primary mx or the
> > backup
> > mx, since email may arrive at either server.
> 
> in a smart setup your bayes-database is read-only like here since
> 2014, 
> any autolearning disabled and strictly trained manually by a stored 
> corpus giving you the opportinity removed and add messages to the 
> training folders and revuild from scratch
> 
> we share our bayes-db even with a different company since 2014

Well, that's the boring solution... ;) Nevertheless, this is what I
will likely do if I encounter any problems with the mysql master-master
replication as I have it running now.

Robert

-- 
Robert Senger





Share bayes database between servers

2023-07-09 Thread Robert Senger
Hi there,

I am running two mailservers, first one serving two domains, other one
serving one domain.

Both serve as backup mx for each other. Both know about users and
aliases of the other domain(s).

On both systems, spamassassin is configured to read/store userprefs and
bayes data (per user) in a local mysql database.

Both systems reject email if the score exceeds a certain limit. To
avoid backscatter (or the need to accept any spam not rejected by the
backup mx), both servers should do their spam filtering based on
exactly the same information, including bayes data.

Now, the question is, what is the best way to share bayes data between
two (or more) servers?

I already share userprefs by setting up master-master replication
between the two mysql databases on both servers. This is uncritical,
since users (or admins) will update only userprefs for the local
virtual users on each system, which means, backup mx will never touch
primary mx userprefs.

But bayes data may be updated by either the primary mx or the backup
mx, since email may arrive at either server. 

I've set up a testing environment that also uses master-master
replication of the mysql bayes database, with priority in dns set to
equal for both mx to get incoming mail distributed evenly to both
systems. So far, this seems to work, but this is a low load
environment.

Any suggestions?

Regards,

Robert


-- 
Robert Senger





Re: BAYES scores

2023-03-01 Thread Benny Pedersen

joe a skrev den 2023-02-28 17:37:

Curious as to why these scores, apparently "stock" are what they are.
I'd expect BAYES_999 BODY to count more than BAYES_99 BODY.

Noted in a header this morning:

*  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
*  [score: 1.]
*  0.2 BAYES_999 BODY: Bayes spam probability is 99.9 to 100%
*  [score: 1.]

Was this discussed recently?  I added a local score to mollify my
sense of propriety.


what does it solve for you ?

maybe it could be changed to not overlap on scores, but what should 
scores change ?






Re: BAYES scores

2023-02-28 Thread Loren Wilton

From: "Bill Cole" 

It is my understanding that an automated rescoring job was run quite some 
time ago (before I was on the PMC) to generate the Bayes scores, which 
determined that to be the best supplemental score to give to the greater 
certainty.


I was around in those days. My memory isn't the greatest anymore, but what I 
recall was that they did automatic rescoring, and then manually tweaked a 
few of the values, basically to make them look pretty by rounding off long 
fractions. BAYES_999 may have been scored almost completely manually, I 
can't quite recall.


   Loren



Re: BAYES scores

2023-02-28 Thread Benny Pedersen

joe a skrev den 2023-02-28 17:37:

Curious as to why these scores, apparently "stock" are what they are.
I'd expect BAYES_999 BODY to count more than BAYES_99 BODY.

Noted in a header this morning:

*  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
*  [score: 1.]
*  0.2 BAYES_999 BODY: Bayes spam probability is 99.9 to 100%
*  [score: 1.]

Was this discussed recently?  I added a local score to mollify my
sense of propriety.


what does it solve for you ?

maybe it could be changed to not overlap on scores, but what should 
scores change ?


tag can be splited so it is not overlapping hits, but what should scores 
so change to ?








Re: BAYES scores

2023-02-28 Thread Bill Cole

On 2023-02-28 at 13:38:35 UTC-0500 (Tue, 28 Feb 2023 13:38:35 -0500)
joe a 
is rumored to have said:


On 2/28/2023 12:05 PM, Jeff Mincy wrote:

  > From: joe a 
  > Date: Tue, 28 Feb 2023 11:37:34 -0500
  >
  > Curious as to why these scores, apparently "stock" are what they 
are.

  > I'd expect BAYES_999 BODY to count more than BAYES_99 BODY.
  >
  > Noted in a header this morning:
  >
  > *  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
  > *  [score: 1.]
  > *  0.2 BAYES_999 BODY: Bayes spam probability is 99.9 to 100%
  > *  [score: 1.]
  >
  > Was this discussed recently?  I added a local score to mollify my 
sense

  > of propriety.

Those two rules overlap.   A message with bayes >= 99.9% hits both
rules.   BAYES_99 ends at 1.00 not .999.
-jeff



I get that they overlap.  I guess my thinker gets in a knot wondering 
why there is so little weight given to the more certain determination.


It is my understanding that an automated rescoring job was run quite 
some time ago (before I was on the PMC) to generate the Bayes scores, 
which determined that to be the best supplemental score to give to the 
greater certainty. Bayes rules are not rescored routinely in the daily 
rescoring task because those hits are inherently different at every 
site. If you wish to determine the ideal scores for YOUR mix of ham and 
spam, I believe all the tools for doing so are in the SA code tree, but 
they may not be well-documented.


That's likely to not be a satisfying answer, but as a volunteer project 
we have no funding for Customer Satisfaction, so the bare unsatisfying 
truth will have to do.


In my narrow view, anything that is 99.9% certain is probably worth a 
5 on it's own.  Or, at least should when, summed with BAYES_99, equal 
5. As that is what the default "SPAM flag" is.


Appears more experienced or thoughtful persons think otherwise.


I don't know that I'd go that far. Rescoring is not done based on simple 
clear reason, but on numbers. I'm not sure whether any currently active 
SA developers are able to explain exactly how the rescoring works.


Yes, it did snow heavily overnight.  Yes, I am looking for excuses not 
to visit that issue.


I vehemently recommend reading all of Justin's scripts and documentation 
(I think it's all in the 'build' sub-directory) and figuring out how to 
rescore based on your own mail. That's MUCH less unpleasant than dealing 
with the snow.



--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Re: BAYES scores

2023-02-28 Thread hg user
>From my small experience... I score BAYES_999 with 2.00, it was
suggested to me months ago.

But nowadays I'd be more careful and do some more testing: I'd check which
messages have only BAYES_99 and  which have BAYES_999, If you are
absolutely certain that BYES_999 are only and definitively spam, go with 2
or more; if you have several false positives, keep the score low.

I learnt the hard way that BAYES depends on the corpus used to grow the
database.

On Tue, Feb 28, 2023 at 7:39 PM joe a  wrote:

> On 2/28/2023 12:05 PM, Jeff Mincy wrote:
> >   > From: joe a 
> >   > Date: Tue, 28 Feb 2023 11:37:34 -0500
> >   >
> >   > Curious as to why these scores, apparently "stock" are what they are.
> >   > I'd expect BAYES_999 BODY to count more than BAYES_99 BODY.
> >   >
> >   > Noted in a header this morning:
> >   >
> >   > *  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
> >   > *  [score: 1.]
> >   > *  0.2 BAYES_999 BODY: Bayes spam probability is 99.9 to 100%
> >   > *  [score: 1.]
> >   >
> >   > Was this discussed recently?  I added a local score to mollify my
> sense
> >   > of propriety.
> >
> > Those two rules overlap.   A message with bayes >= 99.9% hits both
> > rules.   BAYES_99 ends at 1.00 not .999.
> > -jeff
> >
>
> I get that they overlap.  I guess my thinker gets in a knot wondering
> why there is so little weight given to the more certain determination.
>
> In my narrow view, anything that is 99.9% certain is probably worth a 5
> on it's own.  Or, at least should when, summed with BAYES_99, equal 5.
> As that is what the default "SPAM flag" is.
>
> Appears more experienced or thoughtful persons think otherwise.
>
> Yes, it did snow heavily overnight.  Yes, I am looking for excuses not
> to visit that issue.
>


Re: BAYES scores

2023-02-28 Thread joe a

On 2/28/2023 12:05 PM, Jeff Mincy wrote:

  > From: joe a 
  > Date: Tue, 28 Feb 2023 11:37:34 -0500
  >
  > Curious as to why these scores, apparently "stock" are what they are.
  > I'd expect BAYES_999 BODY to count more than BAYES_99 BODY.
  >
  > Noted in a header this morning:
  >
  > *  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
  > *  [score: 1.]
  > *  0.2 BAYES_999 BODY: Bayes spam probability is 99.9 to 100%
  > *  [score: 1.]
  >
  > Was this discussed recently?  I added a local score to mollify my sense
  > of propriety.

Those two rules overlap.   A message with bayes >= 99.9% hits both
rules.   BAYES_99 ends at 1.00 not .999.
-jeff



I get that they overlap.  I guess my thinker gets in a knot wondering 
why there is so little weight given to the more certain determination.


In my narrow view, anything that is 99.9% certain is probably worth a 5 
on it's own.  Or, at least should when, summed with BAYES_99, equal 5. 
As that is what the default "SPAM flag" is.


Appears more experienced or thoughtful persons think otherwise.

Yes, it did snow heavily overnight.  Yes, I am looking for excuses not 
to visit that issue.


Re: BAYES scores

2023-02-28 Thread Jeff Mincy
 > From: joe a 
 > Date: Tue, 28 Feb 2023 11:37:34 -0500
 > 
 > Curious as to why these scores, apparently "stock" are what they are. 
 > I'd expect BAYES_999 BODY to count more than BAYES_99 BODY.
 > 
 > Noted in a header this morning:
 > 
 > *  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
 > *      [score: 1.]
 > *  0.2 BAYES_999 BODY: Bayes spam probability is 99.9 to 100%
 > *  [score: 1.]
 > 
 > Was this discussed recently?  I added a local score to mollify my sense 
 > of propriety.

Those two rules overlap.   A message with bayes >= 99.9% hits both
rules.   BAYES_99 ends at 1.00 not .999.
-jeff



BAYES scores

2023-02-28 Thread joe a
Curious as to why these scores, apparently "stock" are what they are. 
I'd expect BAYES_999 BODY to count more than BAYES_99 BODY.


Noted in a header this morning:

*  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
*  [score: 1.]
*  0.2 BAYES_999 BODY: Bayes spam probability is 99.9 to 100%
*  [score: 1.]

Was this discussed recently?  I added a local score to mollify my sense 
of propriety.





Re: Strange findings debugging bayes results

2023-02-21 Thread Michael Grant via users
On Mon, Feb 20, 2023 at 01:30:15PM -0800, Loren Wilton wrote:
> This is a home system with only a few users. All users have "Spam" and "Ham"
> folders showing up in their email program of choice, and they just drag
> messages they do or don't like into the appropriate folders. There are 
> "Oldham"
> and "Oldspam" mboxes, and the new spam and ham (respectively) get merged into
> these folders after learning, and removed from the current Spam and Ham
> folders.

I had a similar idea but never implemmented it because I felt it was
too difficult for users to deal with.  I was considering 2 folders:
'Spam Training Set' and 'Ham Training Set' which would always
represent the set of messages that Spamassassin was currently trained
with.  If you changed the contents of these mboxes, a cron job would
delete the old bayes tokens and retrain with the current set.

The difference between these folders and the Spam folder (or Junk or
whatever you call it locally) is that messages older than 30 days get
auto-deleted.  After 30 days, those messages would no longer represent
the training set.

Having 2 spam folders is confusing and not easy to manage.

Neither of these 2 extra folders are folders that users would look for
messages so they really do have to copy messages into them which isn't
just dragging them.  That for me was the main issue I faced.

So I abandoned this line of thinkinking.

You mentioned harvesting ham and spam from mboxes as in from the inbox
directly.  This got me wondering more about this.

Clearly using messages that the user dragged to Spam that
spamassassin did not mark as Spam to train as spam.  Easy.

And use messages that the user left in their mailbox or deleted or
archived as ham.  Could be ok but less sure.

And lastly, messages that were in Spam (since Spamassassin marked them
as spam), that a user moved out of Spam.  Just look through all their
folders (except Spam) for messages that Spamassassin marked as spam
and retrain on those as ham.  Again, maybe a bad assumption, could
work though.

I was really just curious to know if other people had workable ideas
how to get bayes trained with the least amount of friction.


signature.asc
Description: PGP signature


Re: Strange findings debugging bayes results

2023-02-20 Thread Loren Wilton

From: "Reindl Harald" 
in other words a system for morons - morons which will drag mails to spam 
instead click on "unsubscribe"


per-user bayes don't work well, never


Well Harald, you are certainly welcome to your opinion. It would be nicer if 
you had kept it yourself though.
The system works just fine with the userbase it has. It probably wouldn't 
work for AOL or *.online.




Re: Strange findings debugging bayes results

2023-02-20 Thread Loren Wilton
This is a home system with only a few users. All users have "Spam" and "Ham" 
folders showing up in their email program of choice, and they just drag 
messages they do or don't like into the appropriate folders. There are "Oldham" 
and "Oldspam" mboxes, and the new spam and ham (respectively) get merged into 
these folders after learning, and removed from the current Spam and Ham folders.
  - Original Message - 
  From: Michael Grant 
  To: users@spamassassin.apache.org ; Loren Wilton ; hg user 
  Sent: Monday, February 20, 2023 12:47 PM
  Subject: Re: Strange findings debugging bayes results


  On 20 February 2023 12:28:00 CET, Loren Wilton  wrote:
  >
  > A cron job that will harvest Spam and Ham mboxes and feed them to sa-learn 
once a day, then archive the learned messages. Per-user bayes and learning. 
Mail is hand-moved into the spam and ham learning folders, and for my  personal 
account, I do this rarely, generally only when a message is mis-categorized. 
Although messages being mis-categorized as spam is often the result of a lot of 
quite aggressive local rules I have rather than a Bayes mis-classification.

  When you "harvest" ham from mboxes, what do you consider ham?

  You also, additionally, have a Ham folder for your users then? Interesting. 
Did you manage to train your users to use it easily? Does it grow unbounded or 
are old messages removed from it?  If so, how to know they can be deleted like 
from the Spam folder.

  It's an interesting idea, just wondering about the details.  Getting my users 
to train spamassassim has always been impossible for me.

Re: Strange findings debugging bayes results

2023-02-20 Thread Michael Grant via users
On 20 February 2023 12:28:00 CET, Loren Wilton  wrote:
>
> A cron job that will harvest Spam and Ham mboxes and feed them to sa-learn 
> once a day, then archive the learned messages. Per-user bayes and learning. 
> Mail is hand-moved into the spam and ham learning folders, and for my  
> personal account, I do this rarely, generally only when a message is 
> mis-categorized. Although messages being mis-categorized as spam is often the 
> result of a lot of quite aggressive local rules I have rather than a Bayes 
> mis-classification.

When you "harvest" ham from mboxes, what do you consider ham?

You also, additionally, have a Ham folder for your users then? Interesting. Did 
you manage to train your users to use it easily? Does it grow unbounded or are 
old messages removed from it?  If so, how to know they can be deleted like from 
the Spam folder.

It's an interesting idea, just wondering about the details.  Getting my users 
to train spamassassim has always been impossible for me.

Re: Strange findings debugging bayes results

2023-02-20 Thread Loren Wilton
> Can you please give me some details on your bayes setup? 
> Headers exclusion, bayes_token_sources, how do you "sa-learn" messages...

Standard options on Bayes. No autolearn. A cron job that will harvest Spam and 
Ham mboxes and feed them to sa-learn once a day, then archive the learned 
messages. Per-user bayes and learning. Mail is hand-moved into the spam and ham 
learning folders, and for my  personal account, I do this rarely, generally 
only when a message is mis-categorized. Although messages being mis-categorized 
as spam is often the result of a lot of quite aggressive local rules I have 
rather than a Bayes mis-classification.


Re: Strange findings debugging bayes results

2023-02-19 Thread hg user
Can you please give me some details on your bayes setup? Headers
exclusion, bayes_token_sources, how do you "sa-learn" messages...

thank you

On Sun, Feb 19, 2023 at 11:53 PM Loren Wilton  wrote:

> > The real question is: has bayes still its use case in 2023 ? Is it still
> used with important scores or just to flag messages for a review?
>
> It works fine for me here.
>
>


Re: Strange findings debugging bayes results

2023-02-19 Thread Loren Wilton
> The real question is: has bayes still its use case in 2023 ? Is it still used 
> with important scores or just to flag messages for a review?

It works fine for me here.


Re: Strange findings debugging bayes results

2023-02-19 Thread hg user
>
>
> bayes_token_sources none visible uri mimepart
>

I added this line to my config with no changes in the tokens used to sum
the bayes score, headers still used. It may be a command only recognized
during learning but I should check the sources.


> perhaps OP has bayes_token_sources setting that takes only headers
> into the account?
>

No. that mail had really few words in the text and probably the bayes
system considered them not relevant.

The real question is: has bayes still its use case in 2023 ? Is it still
used with important scores or just to flag messages for a review?


Re: Strange findings debugging bayes results

2023-02-16 Thread Matija Nalis
On Thu, Feb 16, 2023 at 01:02:25PM +0200, Henrik K wrote:
> On Thu, Feb 16, 2023 at 10:18:50AM +0100, hg user wrote:
> > Every score is based on headers, very generic headers. and some
> > related to my setup.
> > 
> > Not a single token from the message body
> 
> The Bayes implementation has been practically unmaintained for a long time,
> so YMMV.
> 
> You can try something like this, most headers are parsed badly and generate
> biasing random garbage (unscientific observation):
> 
> bayes_ignore_header ARC-Authentication-Results
> bayes_ignore_header ARC-Message-Signature

Yeah, bayes of headers (and CSS/HTML stuff) has been doing me much
more misclassifications than good, so I've eventually given up on
updating ever-growing bayes_ignore_header list and disabled bayes on
the headers altogether:

bayes_token_sources none visible uri mimepart

My stance being: If enduser would not be classifying on those sources
(except Subject header), neither should automatic bayes classification...

perhaps OP has bayes_token_sources setting that takes only headers
into the account?

https://man.archlinux.org/man/Mail::SpamAssassin::Conf.3pm.en#bayes_token_sources

-- 
Opinions above are GNU-copylefted.


Re: Strange findings debugging bayes results

2023-02-16 Thread Axb

I've updated 23_bayes_ignore_header.cf
(last update was from 2016 :)

https://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf

Axb

On 2/16/23 14:17, Dave Wreski wrote:
Here's also another 50+ headers we've collected over the years that I 
believe started as a list from AXB 10+ years ago.


https://pastebin.com/raw/f6Fwh8HJ





Re: Strange findings debugging bayes results

2023-02-16 Thread Dave Wreski

Hi,

Here's also another 50+ headers we've collected over the years that I 
believe started as a list from AXB 10+ years ago.


https://pastebin.com/raw/f6Fwh8HJ

dave

On 2/16/23 6:02 AM, Henrik K wrote:

On Thu, Feb 16, 2023 at 10:18:50AM +0100, hg user wrote:

I was investigating a bunch of bitcoin spam: different titles,
different senders (all from gmail), different text, different pdf
attachment.

Unfortunately in those days my bayes db was polluted and they all got
a BAYES_50, 0.8.

I tested the messages now with a recreated bayes db and got some
BAYES_999. So I dug to understand if I already saw the spam...

But the debug result was unpleasant:
dbg: bayes: tokenized header: 92 tokens
dbg: bayes: token 'HX-Received:Jan' => 0.998028449502134
dbg: bayes: token 'HX-Google-DKIM-Signature:20210112' => 0.997244532803181
dbg: bayes: token 'H*r:sk:' =>
0.997244532803181
dbg: bayes: token 'H*r:a05' => 0.995425742574258
dbg: bayes: token 'HAuthentication-Results:sk:.' =>
0.986543689320388
dbg: bayes: token 'HX-Google-DKIM-Signature:reply-to' => 0.916110175863517
dbg: bayes: token 'H*r:2002' => 0.877842810325844
dbg: bayes: token 'HAuthentication-Results:2048-bit' => 0.858520043212023
dbg: bayes: token 'HAuthentication-Results:pass' => 0.855193895034317
dbg: bayes: score = 0.97915091326


Every score is based on headers, very generic headers. and some
related to my setup.

Not a single token from the message body

The Bayes implementation has been practically unmaintained for a long time,
so YMMV.

You can try something like this, most headers are parsed badly and generate
biasing random garbage (unscientific observation):

bayes_ignore_header ARC-Authentication-Results
bayes_ignore_header ARC-Message-Signature
bayes_ignore_header ARC-Seal
bayes_ignore_header Authentication-Results
bayes_ignore_header Autocrypt
bayes_ignore_header IronPort-SDR
bayes_ignore_header suggested_attachment_session_id
bayes_ignore_header X-Brightmail-Tracker
bayes_ignore_header X-Exchange-Antispam-Report-CFA-Test
bayes_ignore_header X-Forefront-Antispam-Report
bayes_ignore_header X-Forefront-Antispam-Report-Untrusted
bayes_ignore_header X-Gm-Message-State
bayes_ignore_header X-Google-DKIM-Signature
bayes_ignore_header x-microsoft-antispam
bayes_ignore_header X-Microsoft-Antispam-Message-Info
bayes_ignore_header X-Microsoft-Antispam-Message-Info-Original
bayes_ignore_header X-Microsoft-Antispam-Untrusted
bayes_ignore_header X-Microsoft-Exchange-Diagnostics
bayes_ignore_header x-ms-exchange-antispam-messagedata
bayes_ignore_header x-ms-exchange-antispam-messagedata-0
bayes_ignore_header x-ms-exchange-crosstenant-id
bayes_ignore_header x-ms-exchange-crosstenant-network-message-id
bayes_ignore_header x-ms-exchange-crosstenant-rms-persistedconsumerorg
bayes_ignore_header X-MS-Exchange-CrossTenant-userprincipalname
bayes_ignore_header x-ms-exchange-slblob-mailprops
bayes_ignore_header x-ms-office365-filtering-correlation-id
bayes_ignore_header X-MSFBL
bayes_ignore_header X-Provags-ID
bayes_ignore_header X-SG-EID
bayes_ignore_header X-SG-ID
bayes_ignore_header X-UI-Out-Filterresults
bayes_ignore_header X-ClientProxiedBy
bayes_ignore_header X-MS-Exchange-CrossTenant-FromEntityHeader
bayes_ignore_header X-OriginatorOrg
bayes_ignore_header X-MS-Exchange-CrossTenant-OriginalArrivalTime
bayes_ignore_header X-MS-TrafficTypeDiagnostic
bayes_ignore_header X-MS-Exchange-CrossTenant-AuthAs
bayes_ignore_header X-MS-Exchange-Transport-CrossTenantHeadersStamped
bayes_ignore_header X-MS-Exchange-CrossTenant-AuthSource

--


 DaveWreski

President & CEO

Guardian Digital, Inc.

We Make Email Safe








640-800-9446 

dwre...@guardiandigital.com <mailto:dwre...@guardiandigital.com>

https://guardiandigital.com <https://guardiandigital.com>

103 Godwin Ave, Suite 314, Midland Park, NJ 07432




facebook <https://www.facebook.com/gdlinux>   

twitter <https://twitter.com/gdlinux> 

linkedin <https://www.linkedin.com/company/guardiandigital>   



Re: Strange findings debugging bayes results

2023-02-16 Thread Henrik K
On Thu, Feb 16, 2023 at 10:18:50AM +0100, hg user wrote:
> I was investigating a bunch of bitcoin spam: different titles,
> different senders (all from gmail), different text, different pdf
> attachment.
> 
> Unfortunately in those days my bayes db was polluted and they all got
> a BAYES_50, 0.8.
> 
> I tested the messages now with a recreated bayes db and got some
> BAYES_999. So I dug to understand if I already saw the spam...
> 
> But the debug result was unpleasant:
> dbg: bayes: tokenized header: 92 tokens
> dbg: bayes: token 'HX-Received:Jan' => 0.998028449502134
> dbg: bayes: token 'HX-Google-DKIM-Signature:20210112' => 0.997244532803181
> dbg: bayes: token 'H*r:sk:' =>
> 0.997244532803181
> dbg: bayes: token 'H*r:a05' => 0.995425742574258
> dbg: bayes: token 'HAuthentication-Results:sk:.' =>
> 0.986543689320388
> dbg: bayes: token 'HX-Google-DKIM-Signature:reply-to' => 0.916110175863517
> dbg: bayes: token 'H*r:2002' => 0.877842810325844
> dbg: bayes: token 'HAuthentication-Results:2048-bit' => 0.858520043212023
> dbg: bayes: token 'HAuthentication-Results:pass' => 0.855193895034317
> dbg: bayes: score = 0.97915091326
> 
> 
> Every score is based on headers, very generic headers. and some
> related to my setup.
> 
> Not a single token from the message body

The Bayes implementation has been practically unmaintained for a long time,
so YMMV.

You can try something like this, most headers are parsed badly and generate
biasing random garbage (unscientific observation):

bayes_ignore_header ARC-Authentication-Results
bayes_ignore_header ARC-Message-Signature
bayes_ignore_header ARC-Seal
bayes_ignore_header Authentication-Results
bayes_ignore_header Autocrypt
bayes_ignore_header IronPort-SDR
bayes_ignore_header suggested_attachment_session_id
bayes_ignore_header X-Brightmail-Tracker
bayes_ignore_header X-Exchange-Antispam-Report-CFA-Test
bayes_ignore_header X-Forefront-Antispam-Report
bayes_ignore_header X-Forefront-Antispam-Report-Untrusted
bayes_ignore_header X-Gm-Message-State
bayes_ignore_header X-Google-DKIM-Signature
bayes_ignore_header x-microsoft-antispam
bayes_ignore_header X-Microsoft-Antispam-Message-Info
bayes_ignore_header X-Microsoft-Antispam-Message-Info-Original
bayes_ignore_header X-Microsoft-Antispam-Untrusted
bayes_ignore_header X-Microsoft-Exchange-Diagnostics
bayes_ignore_header x-ms-exchange-antispam-messagedata
bayes_ignore_header x-ms-exchange-antispam-messagedata-0
bayes_ignore_header x-ms-exchange-crosstenant-id
bayes_ignore_header x-ms-exchange-crosstenant-network-message-id
bayes_ignore_header x-ms-exchange-crosstenant-rms-persistedconsumerorg
bayes_ignore_header X-MS-Exchange-CrossTenant-userprincipalname
bayes_ignore_header x-ms-exchange-slblob-mailprops
bayes_ignore_header x-ms-office365-filtering-correlation-id
bayes_ignore_header X-MSFBL
bayes_ignore_header X-Provags-ID
bayes_ignore_header X-SG-EID
bayes_ignore_header X-SG-ID
bayes_ignore_header X-UI-Out-Filterresults
bayes_ignore_header X-ClientProxiedBy
bayes_ignore_header X-MS-Exchange-CrossTenant-FromEntityHeader
bayes_ignore_header X-OriginatorOrg
bayes_ignore_header X-MS-Exchange-CrossTenant-OriginalArrivalTime
bayes_ignore_header X-MS-TrafficTypeDiagnostic
bayes_ignore_header X-MS-Exchange-CrossTenant-AuthAs
bayes_ignore_header X-MS-Exchange-Transport-CrossTenantHeadersStamped
bayes_ignore_header X-MS-Exchange-CrossTenant-AuthSource



Strange findings debugging bayes results

2023-02-16 Thread hg user
I was investigating a bunch of bitcoin spam: different titles,
different senders (all from gmail), different text, different pdf
attachment.

Unfortunately in those days my bayes db was polluted and they all got
a BAYES_50, 0.8.

I tested the messages now with a recreated bayes db and got some
BAYES_999. So I dug to understand if I already saw the spam...

But the debug result was unpleasant:
dbg: bayes: tokenized header: 92 tokens
dbg: bayes: token 'HX-Received:Jan' => 0.998028449502134
dbg: bayes: token 'HX-Google-DKIM-Signature:20210112' => 0.997244532803181
dbg: bayes: token 'H*r:sk:' =>
0.997244532803181
dbg: bayes: token 'H*r:a05' => 0.995425742574258
dbg: bayes: token 'HAuthentication-Results:sk:.' =>
0.986543689320388
dbg: bayes: token 'HX-Google-DKIM-Signature:reply-to' => 0.916110175863517
dbg: bayes: token 'H*r:2002' => 0.877842810325844
dbg: bayes: token 'HAuthentication-Results:2048-bit' => 0.858520043212023
dbg: bayes: token 'HAuthentication-Results:pass' => 0.855193895034317
dbg: bayes: score = 0.97915091326


Every score is based on headers, very generic headers. and some
related to my setup.

Not a single token from the message body


Re: bayes in sqlite db

2022-08-16 Thread Matt Corallo
Heh, I know this thread is so old it might as well be dead, but this does work. Note that you may 
need to apply the patch from Bug 7932 until the next release.


bayes_store_module Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsn DBI:SQLite:/path/to/bayes.sqlite

On 5/26/22 9:25 AM, Michael Grant wrote:

Does anyone have a working example of storing Bayes and user prefs in
SQLite?  I only see mysql and postgres schemas in 
/usr/share/doc/spamassassin/sql/

Michael Grant


bayes in sqlite db

2022-05-26 Thread Michael Grant
Does anyone have a working example of storing Bayes and user prefs in
SQLite?  I only see mysql and postgres schemas in 
/usr/share/doc/spamassassin/sql/

Michael Grant


signature.asc
Description: PGP signature


Re: rules for a sneaky SPEAR-VIRUS spam that gets past bayes

2022-03-03 Thread Loren Wilton
Just off the top of my head:

rawbodyONEDRIVE_DOWNLOADm'https://onedrive\.live\.com/download[?]cid='
score ONEDRIVE_DOWNLOAD0.5
describeONEDRIVE_DOWNLOADDownload link to a file on Onedrive

Personally I'd be inclined to put an i on the end of that.

body FILE_PWD_INFO/\b(?:Fil lösenord|File 
password):\s[A-Z]{2}\d{4}\b/
scoreFILE_PWD_INFO3
describe  FILE_PWD_INFOEmail has a password to an archive file

meta PWD_ONEDRIVE_DLOADONEDRIVE_DOWNLOAD && FILE_PWD_INFO
scorePWD_ONEDRIVE_DLOAD4
describe   PWD_ONEDRIVE_DLOADEmail contains download for passworded 
Onedrive file

Loren


rules for a sneaky SPEAR-VIRUS spam that gets past bayes

2022-03-03 Thread Rob McEwen
rules for a sneaky SPEAR-VIRUS spam that gets past bayes because legit 
content from hijacked emails are copied into the spam, making it look 
like a follow-up msg of an existing legit conversation. Catch using 
these rules below. (Perhaps also add more to this to prevent rare FPs? 
But this is a good start!)


FILE SIZE < 50kb

then, on decoded/demime'd msg:

exact match on:
*https://onedrive.live.com/download?cid=**
*
Then a hit on THIS RegEx:
*\b(Fil lösenord|File password): [A-Z]{2}\d{4}\b**
*

(I'll let someone else jump in here and create and share the actual SA 
implementation of this, if desired - along with any suggested improvements)


-- Rob McEwen, invaluement


Re: Question about user specific bayes

2022-01-19 Thread Benny Pedersen

On 2022-01-18 22:34, Bill Cole wrote:


Well, maybe? I don't currently have a system using per-user Bayes and
it's been a bit since I set one up so hopefully someone who has a
working rig will speak up...


fuglu have pr user bayes pr default, and it recently fixed that local 
part before could be mixed case so sender could create another bayes 
user, ups, i had hoped on that this was solved in spamassassin core, but 
maybe in sa 4.0.0



Note that SA will try to create an empty DB if none exists.


and if spamd / spamc uses virtual sql users, or have static db files for 
all users with read/write permissions, ideal if sqlite3 user prefs is 
configured it could be very simple



I'm not
sure that I can think up a circumstance (other than a disappearing
user) where fallback to global Bayes would happen.


is this even supported ?


SA will not fall
back to a global Bayes DB just because an otherwise perfectly good
per-user DB isn't properly seeded.


good


RE: Question about user specific bayes

2022-01-18 Thread Dino Edwards


> Note that SA will try to create an empty DB if none exists. I'm not sure that 
> I can think up a circumstance (other than a disappearing user) where fallback 
> > to global Bayes would happen. SA will not fall back to a global Bayes DB 
> just because an otherwise perfectly good per-user DB isn't properly seeded.

It doesn't seem to be creating an empty database at all. Not sure why

> -Original Message-
> From: Bill Cole 
> Sent: Tuesday, January 18, 2022 12:23 PM
> To: users@spamassassin.apache.org
> Subject: Re: Question about user specific bayes
>
> On 2022-01-18 at 11:12:01 UTC-0500 (Tue, 18 Jan 2022 16:12:01 +) 
> Dino Edwards  is rumored to have said:
>
>> Hi,
>>
>> Trying to implement user specific bayes. My current setup is setup as 
>> follows in regards to global bayes. I'm also using amavis:
>>
>> bayes_path /opt/sa-bayes/bayes
>> bayes_file_mode 0777
>
> Don't do that anywhere. It's not safe.
>
>> use_bayes 1
>> use_bayes_rules 1
>> bayes_auto_learn 0
>> bayes_auto_learn_threshold_spam 15
>> bayes_auto_learn_threshold_nonspam -5
> [...]
>>
>> and it did seem to create  bayes_toks and bayes_seen files under the 
>> /opt/sa-bayes-users/b...@domain.tld<mailto:/opt/sa-bayes-users/bob@dom
>> a
>> in.tld>
>> directory as expected.
>
> So, it is working.
>
>> Is this all that's required to get this working?
>
> Yes
>
>> What happens to the global bayes file  in local.cf? Is that no longer 
>> used?
>
> I believe that it would be used if for some reason SA couldn't figure 
> out which user to pick for a scan at runtime. Maybe if spamd was 
> launched as a user that was later deleted?
>
> But generally, working per-user Bayes setup makes the global file 
> pointless and unused.
>
>>
>> How do the following settings from the local.cf figure in the user 
>> specific bayes files?
>>
>> use_bayes 1
>> use_bayes_rules 1
>> bayes_auto_learn 0
>> bayes_auto_learn_threshold_spam 15
>> bayes_auto_learn_threshold_nonspam -5
>
> The local.cf file is loaded before user_prefs, which is the last 
> config file loaded, so anything that can be changed in user_prefs 
> (i.e. all of those, I believe) which is set in user_prefs will 'stick'
>
> Note that in this case you're choosing to disable auto-learn, so the 
> threshold values are never used.
>
>> Do the user specific bayes have the same requirements to train them 
>> with at least 200 messages?
>
> Yes. Each Bayes DB must be seeded before it can be used. You should 
> also plan a way to regularly feed known spam and ham to those 
> databases, since you aren't auto-learning.
>
>> before they start working?
>
> Before SA will determine a Bayes score on incoming messages, yes.
>
>
>
>
> --
> Bill Cole
> b...@scconsult.com or billc...@apache.org (AKA @grumpybozo and many 
> *@billmail.scconsult.com addresses) Not Currently Available For Hire


--
Bill Cole
b...@scconsult.com or billc...@apache.org (AKA @grumpybozo and many 
*@billmail.scconsult.com addresses) Not Currently Available For Hire


Re: Question about user specific bayes

2022-01-18 Thread Bill Cole

On 2022-01-18 at 13:40:29 UTC-0500 (Tue, 18 Jan 2022 18:40:29 +)
Dino Edwards 
is rumored to have said:

Hi, thanks for the quick reply. So when amavis calls on SA for an 
incoming message, it will pass the recipient (e-mail address) in the 
%u variable and then SA will take that variable and look in the 
/opt/sa-bayes-users/%u directory for the existence of bayes database 
and if it finds one, it will use it provided it's properly seeded. If 
not, it will fall back to the global bayes. Is that correct?


Well, maybe? I don't currently have a system using per-user Bayes and 
it's been a bit since I set one up so hopefully someone who has a 
working rig will speak up...


Note that SA will try to create an empty DB if none exists. I'm not sure 
that I can think up a circumstance (other than a disappearing user) 
where fallback to global Bayes would happen. SA will not fall back to a 
global Bayes DB just because an otherwise perfectly good per-user DB 
isn't properly seeded.





-Original Message-
From: Bill Cole 
Sent: Tuesday, January 18, 2022 12:23 PM
To: users@spamassassin.apache.org
Subject: Re: Question about user specific bayes

On 2022-01-18 at 11:12:01 UTC-0500 (Tue, 18 Jan 2022 16:12:01 +) 
Dino Edwards  is rumored to have said:



Hi,

Trying to implement user specific bayes. My current setup is setup as
follows in regards to global bayes. I'm also using amavis:

bayes_path /opt/sa-bayes/bayes
bayes_file_mode 0777


Don't do that anywhere. It's not safe.


use_bayes 1
use_bayes_rules 1
bayes_auto_learn 0
bayes_auto_learn_threshold_spam 15
bayes_auto_learn_threshold_nonspam -5

[...]


and it did seem to create  bayes_toks and bayes_seen files under the
/opt/sa-bayes-users/b...@domain.tld<mailto:/opt/sa-bayes-users/bob@doma
in.tld>
directory as expected.


So, it is working.


Is this all that's required to get this working?


Yes


What happens to the global bayes file  in local.cf? Is that no longer
used?


I believe that it would be used if for some reason SA couldn't figure 
out which user to pick for a scan at runtime. Maybe if spamd was 
launched as a user that was later deleted?


But generally, working per-user Bayes setup makes the global file 
pointless and unused.




How do the following settings from the local.cf figure in the user
specific bayes files?

use_bayes 1
use_bayes_rules 1
bayes_auto_learn 0
bayes_auto_learn_threshold_spam 15
bayes_auto_learn_threshold_nonspam -5


The local.cf file is loaded before user_prefs, which is the last 
config file loaded, so anything that can be changed in user_prefs 
(i.e. all of those, I believe) which is set in user_prefs will 'stick'


Note that in this case you're choosing to disable auto-learn, so the 
threshold values are never used.



Do the user specific bayes have the same requirements to train them
with at least 200 messages?


Yes. Each Bayes DB must be seeded before it can be used. You should 
also plan a way to regularly feed known spam and ham to those 
databases, since you aren't auto-learning.



before they start working?


Before SA will determine a Bayes score on incoming messages, yes.




--
Bill Cole
b...@scconsult.com or billc...@apache.org (AKA @grumpybozo and many 
*@billmail.scconsult.com addresses) Not Currently Available For Hire



--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


RE: Question about user specific bayes

2022-01-18 Thread Dino Edwards
Hi, thanks for the quick reply. So when amavis calls on SA for an incoming 
message, it will pass the recipient (e-mail address) in the %u variable and 
then SA will take that variable and look in the /opt/sa-bayes-users/%u 
directory for the existence of bayes database and if it finds one, it will use 
it provided it's properly seeded. If not, it will fall back to the global 
bayes. Is that correct?

Thanks



-Original Message-
From: Bill Cole  
Sent: Tuesday, January 18, 2022 12:23 PM
To: users@spamassassin.apache.org
Subject: Re: Question about user specific bayes

On 2022-01-18 at 11:12:01 UTC-0500 (Tue, 18 Jan 2022 16:12:01 +) Dino 
Edwards  is rumored to have said:

> Hi,
>
> Trying to implement user specific bayes. My current setup is setup as 
> follows in regards to global bayes. I'm also using amavis:
>
> bayes_path /opt/sa-bayes/bayes
> bayes_file_mode 0777

Don't do that anywhere. It's not safe.

> use_bayes 1
> use_bayes_rules 1
> bayes_auto_learn 0
> bayes_auto_learn_threshold_spam 15
> bayes_auto_learn_threshold_nonspam -5
[...]
>
> and it did seem to create  bayes_toks and bayes_seen files under the 
> /opt/sa-bayes-users/b...@domain.tld<mailto:/opt/sa-bayes-users/bob@doma
> in.tld>
> directory as expected.

So, it is working.

> Is this all that's required to get this working?

Yes

> What happens to the global bayes file  in local.cf? Is that no longer 
> used?

I believe that it would be used if for some reason SA couldn't figure out which 
user to pick for a scan at runtime. Maybe if spamd was launched as a user that 
was later deleted?

But generally, working per-user Bayes setup makes the global file pointless and 
unused.

>
> How do the following settings from the local.cf figure in the user 
> specific bayes files?
>
> use_bayes 1
> use_bayes_rules 1
> bayes_auto_learn 0
> bayes_auto_learn_threshold_spam 15
> bayes_auto_learn_threshold_nonspam -5

The local.cf file is loaded before user_prefs, which is the last config file 
loaded, so anything that can be changed in user_prefs (i.e. all of those, I 
believe) which is set in user_prefs will 'stick'

Note that in this case you're choosing to disable auto-learn, so the threshold 
values are never used.

> Do the user specific bayes have the same requirements to train them 
> with at least 200 messages?

Yes. Each Bayes DB must be seeded before it can be used. You should also plan a 
way to regularly feed known spam and ham to those databases, since you aren't 
auto-learning.

> before they start working?

Before SA will determine a Bayes score on incoming messages, yes.




--
Bill Cole
b...@scconsult.com or billc...@apache.org (AKA @grumpybozo and many 
*@billmail.scconsult.com addresses) Not Currently Available For Hire


Re: Question about user specific bayes

2022-01-18 Thread Bill Cole

On 2022-01-18 at 11:12:01 UTC-0500 (Tue, 18 Jan 2022 16:12:01 +)
Dino Edwards 
is rumored to have said:


Hi,

Trying to implement user specific bayes. My current setup is setup as 
follows in regards to global bayes. I'm also using amavis:


bayes_path /opt/sa-bayes/bayes
bayes_file_mode 0777


Don't do that anywhere. It's not safe.


use_bayes 1
use_bayes_rules 1
bayes_auto_learn 0
bayes_auto_learn_threshold_spam 15
bayes_auto_learn_threshold_nonspam -5

[...]


and it did seem to create  bayes_toks and bayes_seen files under the 
/opt/sa-bayes-users/b...@domain.tld<mailto:/opt/sa-bayes-users/b...@domain.tld> 
directory as expected.


So, it is working.


Is this all that's required to get this working?


Yes

What happens to the global bayes file  in local.cf? Is that no longer 
used?


I believe that it would be used if for some reason SA couldn't figure 
out which user to pick for a scan at runtime. Maybe if spamd was 
launched as a user that was later deleted?


But generally, working per-user Bayes setup makes the global file 
pointless and unused.




How do the following settings from the local.cf figure in the user 
specific bayes files?


use_bayes 1
use_bayes_rules 1
bayes_auto_learn 0
bayes_auto_learn_threshold_spam 15
bayes_auto_learn_threshold_nonspam -5


The local.cf file is loaded before user_prefs, which is the last config 
file loaded, so anything that can be changed in user_prefs (i.e. all of 
those, I believe) which is set in user_prefs will 'stick'


Note that in this case you're choosing to disable auto-learn, so the 
threshold values are never used.


Do the user specific bayes have the same requirements to train them 
with at least 200 messages?


Yes. Each Bayes DB must be seeded before it can be used. You should also 
plan a way to regularly feed known spam and ham to those databases, 
since you aren't auto-learning.



before they start working?


Before SA will determine a Bayes score on incoming messages, yes.




--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Question about user specific bayes

2022-01-18 Thread Dino Edwards
Hi,

Trying to implement user specific bayes. My current setup is setup as follows 
in regards to global bayes. I'm also using amavis:

bayes_path /opt/sa-bayes/bayes
bayes_file_mode 0777
use_bayes 1
use_bayes_rules 1
bayes_auto_learn 0
bayes_auto_learn_threshold_spam 15
bayes_auto_learn_threshold_nonspam -5



According to various things I've read online, I've setup the following in 
/etc/default/spamassassin in an attempt to setup user specific bayes:


OPTIONS="--create-prefs --max-children 5 
--helper-home-dir=/opt/sa-bayes-users/%u -x -u amavis"

I've also created a bunch of subdirectories with usernames under 
/opt/sa-bayes-users. Example:

/opt/sa-bayes-users/b...@domain.tld<mailto:/opt/sa-bayes-users/b...@domain.tld>
/opt/sa-bayes-users/la...@domain.tld<mailto:/opt/sa-bayes-users/la...@domain.tld>

Etc...

I've setup the owner in /opt/sa-bayes-users/ to amavis and I've also setup the 
permissions to 700.

I've run a test sa-learn as follows where /mnt/data/amavis/clean/n/nTutbwTMVWzK 
is the actual e-mail file I use to train SA:

sa-learn --spam --dbpath /opt/sa-bayes-users/b...@domain.tld 
/mnt/data/amavis/clean/n/nTutbwTMVWzK

and it did seem to create  bayes_toks and bayes_seen files under the 
/opt/sa-bayes-users/b...@domain.tld<mailto:/opt/sa-bayes-users/b...@domain.tld> 
directory as expected.

Is this all that's required to get this working?

What happens to the global bayes file  in local.cf? Is that no longer used?

How do the following settings from the local.cf figure in the user specific 
bayes files?

use_bayes 1
use_bayes_rules 1
bayes_auto_learn 0
bayes_auto_learn_threshold_spam 15
bayes_auto_learn_threshold_nonspam -5


Do the user specific bayes have the same requirements to train them with at 
least 200 messages? before they start working?

Thanks in advance




Re: Starting Clean with Bayes

2021-10-23 Thread John Hardin

On Sat, 23 Oct 2021, Benny Pedersen wrote:


On 2021-10-20 16:58, John Hardin wrote:

On Wed, 20 Oct 2021, Axb wrote:


On 10/19/21 8:06 PM, Jerry Malcolm wrote:


Where do I find a starter toks file?


You don't need a "starter" file.


Your Bayes starter is your training corpora, which you should retain
in case you ever need to start over from scratch as you're doing now.


no one asked how to make a backup/restore, with imho would have answered all 
this just like one would just use corpus retraining data


A backup is fine for migration.

A backup of a database that has gone off the rails is useless.

It fairly accepted that there's no such thing as a "generic starter Bayes 
database" due to the variability of peoples' ham.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.org pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Are you a mildly tech-literate politico horrified by the level of
  ignorance demonstrated by lawmakers gearing up to regulate online
  technology they don't even begin to grasp? Cool. Now you have a
  tiny glimpse into a day in the life of a gun owner.   -- Sean Davis
---
 511 days since the first private commercial manned orbital mission (SpaceX)


Re: Starting Clean with Bayes

2021-10-22 Thread Benny Pedersen

On 2021-10-20 16:58, John Hardin wrote:

On Wed, 20 Oct 2021, Axb wrote:


On 10/19/21 8:06 PM, Jerry Malcolm wrote:


Where do I find a starter toks file?


You don't need a "starter" file.


Your Bayes starter is your training corpora, which you should retain
in case you ever need to start over from scratch as you're doing now.


no one asked how to make a backup/restore, with imho would have answered 
all this just like one would just use corpus retraining data


hmm :)

i just wish that its not only bayes that can be backup/restored but also 
TxRep and awl data


this will make it possible to change from postgresql to redis if needed, 
who will use mysql or berkdb ?


Starting Clean With Bayes

2021-10-20 Thread Jerry Malcolm
I am starting over with a clean install of SA on an AWS Linux2 EC2.  I'm 
am struggling with getting Bayes set up correctly.  I have a very old 
bayes_toks file from a Jam Windows install from about 4 years ago.  I 
created a userId for spamd, and I put the bayes_toks file in 
/home/spamd/bayes.  I set the bayes_path in local.cf to 
/home/spamd/bayes/bayes.  I changed the file owner to spamd:spamd.  I 
get the error message:


 cannot open bayes databases /home/spamd/bayes/bayes_* R/O: tie failed

I tried running the Spamassassin from the command line as sudo, and get 
the same error.  So I don't think it's a permissions issue.


So I moved the file out of the folder and now get:

 no dbs present, cannot tie DB R/O: /home/spamd/bayes/bayes_toks

So in the first case it finds the file but can't open it.  I found some 
posts on forums that suggested there's a possibility the file is so old 
the format is obsolete.  Fine with me.  At this point, I just want to 
start clean.  But I can't find a way to start using bayes from scratch 
with no toks file starting off.  I even did another clean install on a 
separate ec2 to see if SA would create an initial  toks file. But I 
couldn't find one.


My old toks file is probably of marginal value now anyway.  I just need 
to know where to find a brand new toks file to put into my bayes_path 
folder so it can start building up the ham/spam file and start 
contributing to my SA scores.


Where do I find a starter toks file?

Thx



Re: Starting Clean with Bayes

2021-10-20 Thread John Hardin

On Wed, 20 Oct 2021, Axb wrote:


On 10/19/21 8:06 PM, Jerry Malcolm wrote:


Where do I find a starter toks file?


You don't need a "starter" file.


Your Bayes starter is your training corpora, which you should retain in 
case you ever need to start over from scratch as you're doing now.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.org pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  At what point then is the approach of danger to be expected?
  I answer, if it ever reach us, it must spring up amongst us.
  It cannot come from abroad. If destruction be our lot, we must
  ourselves be its author and finisher. As a nation of freemen, we
  must live through all time, or die by suicide.   -- Abraham Lincoln
  ...popularly summarized as:
  "America will never be destroyed from the outside. If we falter
  and lose our freedoms, it will be because we destroyed ourselves."
---
 508 days since the first private commercial manned orbital mission (SpaceX)


Re: Starting Clean with Bayes

2021-10-20 Thread Axb

On 10/19/21 8:06 PM, Jerry Malcolm wrote:


Where do I find a starter toks file?


You don't need a "starter" file. As soon as it needs them, SA 
automagically creates the necessary files if it can write into the 
defined path.

Just feed it some spams and hams as per docs and you'll see the files.




Starting Clean with Bayes

2021-10-19 Thread Jerry Malcolm
I am starting over with a clean install of SA on an AWS Linux2 EC2.  I'm 
am struggling with getting Bayes set up correctly.  I have a very old 
bayes_toks file from a Jam Windows install from about 4 years ago.  I 
created a userId for spamd, and I put the bayes_toks file in 
/home/spamd/bayes.  I set the bayes_path in local.cf to 
/home/spamd/bayes/bayes.  I changed the file owner to spamd:spamd.  I 
get the error message:


 cannot open bayes databases /home/spamd/bayes/bayes_* R/O: tie failed

I tried running the Spamassassin from the command line as sudo, and get 
the same error.  So I don't think it's a permissions issue.


So I moved the file out of the folder and now get:

 no dbs present, cannot tie DB R/O: /home/spamd/bayes/bayes_toks

So in the first case it finds the file but can't open it.  I found some 
posts on forums that suggested there's a possibility the file is so old 
the format is obsolete.  Fine with me.  At this point, I just want to 
start clean.  But I can't find a way to start using bayes from scratch 
with no toks file starting off.  I even did another clean install on a 
separate ec2 to see if SA would create an initial  toks file. But I 
couldn't find one.


My old toks file is probably of marginal value now anyway.  I just need 
to know where to find a brand new toks file to put into my bayes_path 
folder so it can start building up the ham/spam file and start 
contributing to my SA scores.


Where do I find a starter toks file?

Thx



Re: Bayes autolearn: how does it resolve whether rules are body or header related?

2021-05-10 Thread RW
On Mon, 10 May 2021 20:39:31 +0200
Bert Van de Poel wrote:


> Based on what I've read, I agree that this is indeed a bug (or
> actually several). I've filed the following bug reports:
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7904 (missing body 
> types, as mentioned by RW)
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7905 (meta
> tflags=net tests are ignored)
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7906 (meta 
> tflags!=net tests are always header tests)
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7907 (better
> support for meta tests in autolearning in general, with 2 possible
> solutions)
> 
> Thank you very much to RW and Matus Uhlar for helping me figure out
> what code to look at and for al three of you to confirm that this is
> clearly a set of bugs.


I don't agree that they are bugs. I think it would be useful to add
missing body types, but I don't think the rest is hugely wrong, and
it's not sensible for anyone to spend a lot of time on it. Particularly
when it so easy to to turn-off the 3+3 test selectively with
autolearn_force.

Net meta rules usually contain scored net eval rules so it's sensible
to ignore them. Treating meta rules as header points seems to be erring
on the right side. There's a case for ignoring metarules altogether

Autolearning is something that's best avoided if at all possible.
Erring on on the side of avoiding mistraining is a good thing.


bayes stopwords.cf missing ifplugin

2021-05-10 Thread Benny Pedersen



ups


Re: Bayes autolearn: how does it resolve whether rules are body or header related?

2021-05-10 Thread Bert Van de Poel

Dear Loren,

Thank you very much for your email. Based on your message I could deduce 
there were earlier messages (which I then read through a web archive). 
For some unexplained reason I never received the previous 3 responses to 
my email. I hope the university network isn't randomly over-filtering 
spam again (we've had those kinds of problems for a while now, it's 
quite a problem, we are much more careful about how we mark spam).


Based on what I've read, I agree that this is indeed a bug (or actually 
several). I've filed the following bug reports:
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7904 (missing body 
types, as mentioned by RW)
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7905 (meta tflags=net 
tests are ignored)
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7906 (meta 
tflags!=net tests are always header tests)
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7907 (better support 
for meta tests in autolearning in general, with 2 possible solutions)


Thank you very much to RW and Matus Uhlar for helping me figure out what 
code to look at and for al three of you to confirm that this is clearly 
a set of bugs.


Feel free to file more bugs if you consider there are more based on my 
issue, as well as to give support, write suggestions or submit patches 
on the bugs I have already filed.


Kind regards,
Bert Van de Poel

On 10/05/2021 06:41, Loren Wilton wrote:

so you don't have points from body rules.

your mentioned URI_DEOBFU_INSTR is a meta rule:

meta URI_DEOBFU_INSTR __URI_DEOBFU_INSTR && !__MSGID_OK_HOST

so maybe it's not considered.


They are treated as header, or ignored if marked as net.


I think a bug report should be submitted for this.

Either they should be treated split 50/50 as header and body score, or 
when the metas are built they shoudl have a "body rule" flag, and that 
used to determine where the score goes.


I tried, but for some reason apache decided that I'm evil and blocked 
the submission attempt, so someone else can do it.


   Loren





Re: Bayes autolearn: how does it resolve whether rules are body or header related?

2021-05-09 Thread Loren Wilton

so you don't have points from body rules.

your mentioned URI_DEOBFU_INSTR is a meta rule:

meta URI_DEOBFU_INSTR __URI_DEOBFU_INSTR && !__MSGID_OK_HOST

so maybe it's not considered.


They are treated as header, or ignored if marked as net.


I think a bug report should be submitted for this.

Either they should be treated split 50/50 as header and body score, or when 
the metas are built they shoudl have a "body rule" flag, and that used to 
determine where the score goes.


I tried, but for some reason apache decided that I'm evil and blocked the 
submission attempt, so someone else can do it.


   Loren



Re: Bayes autolearn: how does it resolve whether rules are body or header related?

2021-05-09 Thread RW
On Sun, 9 May 2021 20:03:27 +0200
Matus UHLAR - fantomas wrote:


> so you don't have points from body rules.
> 
> your mentioned URI_DEOBFU_INSTR is a meta rule:
> 
> meta URI_DEOBFU_INSTR __URI_DEOBFU_INSTR && !__MSGID_OK_HOST
> 
> so maybe it's not considered.

They are treated as header, or ignored if marked as net. 


Re: Bayes autolearn: how does it resolve whether rules are body or header related?

2021-05-09 Thread Matus UHLAR - fantomas

On 09.05.21 04:17, Bert Van de Poel wrote:

Dear fellow Spamassassin users,

I recently noticed that quite a lot of spam emails with high scores 
weren't marked for Bayes autolearning. While some senders and 
receivers were a common match, explaining why autolearn was nog, there 
was no clear explanation for other cases. I therefore put Spamassassin 
in debug mode to check in more detail, and noticed that fairly often 
autolearn is not used because the minimum score for body tests isn't 
achieved. After looking at some specific cases, it seems however that 
several rules are either not considered when calculating the header 
rule score and body rule score for Bayes autolearning. I've always 
presumed these scores are calculated based on whether the underlying 
rule performs a regex on a header or on the body, but now I'm not so 
sure any more. I hope you can help clear up whether this is intended 
behaviour (and what that behaviour is) or whether I should report this 
as a bug.


One example I noticed is URI_DEOBFU_INSTR=3.595. This is if I 
understand it correctly a URI test that's performed on the body. 
Should a test like this be counted towards the body score count? Then 
there's the question of meta rules such as MONEY_NOHTML. If you 
resolve the different meta levels within this rule, it's a combination 
of header and body, however it's only counted towards the header 
score. Finally, it seems as if custom rules I've added within local.cf 
aren't considered. Is that indeed the case (and if so, is that by 
design)? I'm also not completely sure if UNWANTED_BODY_LANGUAGE and 
tests like razor, pyzor and DCC are considered for body scores.


Within the same realm, I'm also wondering whether these expected 
numbers for body and header can be tweaked and if so, how. For example 
the case below isn't autolearned even though it has a huge score and a 
vast amount of tests going off, but seemingly not enough body-related 
scores. Is that really the intended behaviour?


May  8 10:40:32 mail amavis[4076058]: (4076058-16) 
header_edits_for_quar:  -> 
, Yes, score=24.619 tag=- tag2=5 
kill=7.5 tests=[ADVANCE_FEE_3_NEW_MONEY=0.001, 
AXB_XMAILER_MIMEOLE_OL_024C2=0.001, BAYES_50=0.8, BERT_KULSPAM=1, 
FORGED_MUA_OUTLOOK=1.927, FREEMAIL_FORGED_REPLYTO=2.095, 
FREEMAIL_REPLYTO=1, FREEMAIL_REPLYTO_END_DIGIT=0.25, 
FROM_MISSPACED=0.001, FROM_MISSP_EH_MATCH=0.001, 
FROM_MISSP_FREEMAIL=0.001, FROM_MISSP_MSFT=0.001, 
FROM_MISSP_REPLYTO=2.497, FSL_BULK_SIG=0.001, FSL_CTYPE_WIN1251=0.001, 
FSL_NEW_HELO_USER=0.001, KHOP_HELO_FCRDNS=0.398, LOTS_OF_MONEY=0.001, 
MISSING_HEADERS=1.021, MISSING_MID=0.497, MONEY_FREEMAIL_REPTO=1.202, 
MONEY_FROM_MISSP=0.001, MONEY_NOHTML=2.497, NSL_RCVD_HELO_USER=0.001, 
PYZOR_CHECK=1.392, REPLYTO_WITHOUT_TO_CC=1.552, REPTO_419_FRAUD=2.996, 
SPF_HELO_NONE=0.001, TO_NO_BRKTS_FROM_MSSP=1.593, 
TO_NO_BRKTS_MSFT=1.888, XFER_LOTSA_MONEY=0.001] autolearn=no 
autolearn_force=no


Thank you in advance for your help. If you need any more examples or 
would us to run some tests, then feel free to let me know.


looks like most of those are meta rules:

header FREEMAIL_REPLYTO_END_DIGIT
header MISSING_HEADERS
body BAYES_50
header SPF_HELO_NONE
header FSL_CTYPE_WIN1251
header NSL_RCVD_HELO_USER
header REPTO_419_FRAUD

score FREEMAIL_REPLYTO_END_DIGIT 0.25
score MISSING_HEADERS 0.915 1.207 1.204 1.021
score SPF_HELO_NONE 0.001

so you don't have points from body rules.

your mentioned URI_DEOBFU_INSTR is a meta rule:

meta URI_DEOBFU_INSTR __URI_DEOBFU_INSTR && !__MSGID_OK_HOST

so maybe it's not considered.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Linux IS user friendly, it's just selective who its friends are...


Re: Bayes autolearn: how does it resolve whether rules are body or header related?

2021-05-09 Thread RW
On Sun, 9 May 2021 04:17:26 +0200
Bert Van de Poel wrote:


> Within the same realm, I'm also wondering whether these expected
> numbers for body and header can be tweaked and if so, how.

You can create a meta-rule for definite spam and set:
 
tflags  autolearn_force

a hit on any rule with this flag set causes the 3+3 check to be
ignored. It does nothing else.



One thing that does look wrong is that maybe_body_only() looks
for:

(($type == $TYPE_BODY_TESTS) || ($type == $TYPE_BODY_EVALS)
|| ($type == $TYPE_URI_TESTS) || ($type == $TYPE_URI_EVALS))

so it's missing any rawbody and full rules. 


Specifically Pyzor, Razor2 and DCC are full eval rules.



Bayes autolearn: how does it resolve whether rules are body or header related?

2021-05-08 Thread Bert Van de Poel

Dear fellow Spamassassin users,

I recently noticed that quite a lot of spam emails with high scores 
weren't marked for Bayes autolearning. While some senders and receivers 
were a common match, explaining why autolearn was nog, there was no 
clear explanation for other cases. I therefore put Spamassassin in debug 
mode to check in more detail, and noticed that fairly often autolearn is 
not used because the minimum score for body tests isn't achieved. After 
looking at some specific cases, it seems however that several rules are 
either not considered when calculating the header rule score and body 
rule score for Bayes autolearning. I've always presumed these scores are 
calculated based on whether the underlying rule performs a regex on a 
header or on the body, but now I'm not so sure any more. I hope you can 
help clear up whether this is intended behaviour (and what that 
behaviour is) or whether I should report this as a bug.


One example I noticed is URI_DEOBFU_INSTR=3.595. This is if I understand 
it correctly a URI test that's performed on the body. Should a test like 
this be counted towards the body score count? Then there's the question 
of meta rules such as MONEY_NOHTML. If you resolve the different meta 
levels within this rule, it's a combination of header and body, however 
it's only counted towards the header score. Finally, it seems as if 
custom rules I've added within local.cf aren't considered. Is that 
indeed the case (and if so, is that by design)? I'm also not completely 
sure if UNWANTED_BODY_LANGUAGE and tests like razor, pyzor and DCC are 
considered for body scores.


Within the same realm, I'm also wondering whether these expected numbers 
for body and header can be tweaked and if so, how. For example the case 
below isn't autolearned even though it has a huge score and a vast 
amount of tests going off, but seemingly not enough body-related scores. 
Is that really the intended behaviour?


May  8 10:40:32 mail amavis[4076058]: (4076058-16) 
header_edits_for_quar:  -> 
, Yes, score=24.619 tag=- tag2=5 kill=7.5 
tests=[ADVANCE_FEE_3_NEW_MONEY=0.001, 
AXB_XMAILER_MIMEOLE_OL_024C2=0.001, BAYES_50=0.8, BERT_KULSPAM=1, 
FORGED_MUA_OUTLOOK=1.927, FREEMAIL_FORGED_REPLYTO=2.095, 
FREEMAIL_REPLYTO=1, FREEMAIL_REPLYTO_END_DIGIT=0.25, 
FROM_MISSPACED=0.001, FROM_MISSP_EH_MATCH=0.001, 
FROM_MISSP_FREEMAIL=0.001, FROM_MISSP_MSFT=0.001, 
FROM_MISSP_REPLYTO=2.497, FSL_BULK_SIG=0.001, FSL_CTYPE_WIN1251=0.001, 
FSL_NEW_HELO_USER=0.001, KHOP_HELO_FCRDNS=0.398, LOTS_OF_MONEY=0.001, 
MISSING_HEADERS=1.021, MISSING_MID=0.497, MONEY_FREEMAIL_REPTO=1.202, 
MONEY_FROM_MISSP=0.001, MONEY_NOHTML=2.497, NSL_RCVD_HELO_USER=0.001, 
PYZOR_CHECK=1.392, REPLYTO_WITHOUT_TO_CC=1.552, REPTO_419_FRAUD=2.996, 
SPF_HELO_NONE=0.001, TO_NO_BRKTS_FROM_MSSP=1.593, 
TO_NO_BRKTS_MSFT=1.888, XFER_LOTSA_MONEY=0.001] autolearn=no 
autolearn_force=no


Thank you in advance for your help. If you need any more examples or 
would us to run some tests, then feel free to let me know.


Kind regards,
Bert Van de Poel
ULYSSIS



Re: SA's bayes with the Redis backend?

2021-02-11 Thread Dean Carpenter
 

On 2021-02-11 12:58 pm, Alex wrote: 

> Hi,
> 
>> I've had good luck with using mariadb and galera to share the spamassassin 
>> database across systems. I run a small 3-node setup for email, 2x servers 
>> running dovecot replicating to each other, and a 3rd galera quorum server. 
>> Mariadb is master-master across all 3 nodes, so changes on any one are 
>> replicated to all the others via vpn. Works well, and for the amount of data 
>> in the spamassassin database, it replicates very quickly.
> 
> This sounds very interesting to me. Can you share more details about
> your configuration? I haven't worked with galera before, but have some
> experience with mariadb - it's currently set up as a single master
> with the actual mail relays being set up as slaves. I'd imagine the
> first thing is to convert them all to masters...
> 
> Any help would be greatly appreciated.
> Thanks,
> Alex

Sure. I have a (fairly complex) ansible playbook that sets up the whole
3-node cluster, but here are the relevant details. 

This is the galera portion of _/etc/mysql/my.cnf_ 

> # 
> # * Galera-related settings 
> # 
> [galera] 
> bind-address = 0.0.0.0 
> binlog_format = row 
> default_storage_engine = InnoDB 
> innodb_autoinc_lock_mode = 2 
> innodb_flush_log_at_trx_commit = 2 
> wsrep_cluster_address = gcomm://master.vpn,dove1.vpn,dove2.vpn 
> wsrep_node_name = master.vpn 
> 
> # Need to specify vpn address here, not public address
> wsrep_node_address = 192.168.100.50 
> 
> wsrep_cluster_name = my_cluster 
> wsrep_on = 1 
> wsrep_provider = /usr/lib/galera/libgalera_smm.so 
> wsrep_sst_auth = "root:my_sekrit_password"

And this is the ansible role mariadb_restart/tasks/main.yml. This gets
called whenever another mariadb-affecting task sets the DO_RESTART
variable. This will cleanly restart the whole mariadb galera cluster. 

> - become: yes
> block: 
> 
> - name: Check status of mysqld
> command: systemctl status mysql
> ignore_errors: yes
> changed_when: false
> register: mysql_status 
> 
> - name: Gracefully stop mysql on all nodes to start up cluster
> service:
> name: "mysql"
> state: "stopped"
> register: mysql_stopped
> when: mysql_status is succeeded 
> 
> - name: Force kill mysqld if stuck in starting state
> command: pkill -9 mysqld
> ignore_errors: yes
> changed_when: false
> when: mysql_stopped is failed 
> 
> - name: Clear RAM caches to free up space
> command: sysctl -w vm.drop_caches=3
> when: ansible_virtualization_type != "openvz"
> changed_when: false 
> 
> - name: Check if grastate.dat file exists for bootstrapping node0
> stat:
> path: /var/lib/mysql/grastate.dat
> register: grastate_exists
> when: inventory_hostname == play_hosts[0] 
> 
> - name: Force node0 to be a new bootstrap node
> lineinfile:
> dest: /var/lib/mysql/grastate.dat
> regexp: 'safe_to_bootstrap: 0'
> line: 'safe_to_bootstrap: 1'
> when:
> - inventory_hostname == play_hosts[0]
> - grastate_exists.stat.exists 
> 
> - name: bootstrap a new cluster with galera_new_cluster
> shell: /usr/bin/galera_new_cluster
> when: inventory_hostname == play_hosts[0] 
> 
> - name: add slave nodes to the cluster
> service:
> name: "mysql"
> state: "started"
> when: inventory_hostname != play_hosts[0] 
> 
> - name: Stop mysql on node0
> service:
> name: "mysql"
> state: "stopped"
> when: inventory_hostname == play_hosts[0] 
> 
> - name: re-add master node to the cluster
> service:
> name: "mysql"
> state: "started"
> when: inventory_hostname == play_hosts[0] 
> 
> #
> # block
> when: do_restart | bool

Argh - formatting got messed up. But you get the idea. It can also be
run via a small script that runs ansible-playbook like this 

> ansible-playbook -K -i hosts mariadb.yml --extra-vars "{do_restart: true}"

In the list of hosts, node0 is the master or quorum node. The other two
are the dovecot replication nodes (dovecot, exim4, roundcube, etc) 

I have a bash alias to check on cluster status ... 

> alias cluster='mysql -B -s -N -e "show status like "%wsrep_cluster%";"' 
> 
> dc@master:~$ cluster
> wsrep_cluster_weight 3
> wsrep_cluster_capabilities
> wsrep_cluster_conf_id 10
> wsrep_cluster_size 3
> wsrep_cluster_state_uuid 480b440d-6643-11eb-94bc-5b47cf0676a8
> wsrep_cluster_status Primary
 

Re: SA's bayes with the Redis backend?

2021-02-11 Thread Antony Stone
On Thursday 11 February 2021 at 17:21:41, deano-spamassas...@areyes.com wrote:

> Is there an easy/efficient way of converting an existing mariadb bayes
> database to redis?
> 
> Perhaps "sa-learn --backup", set up redis, then restore?

https://www.mail-archive.com/users@spamassassin.apache.org/msg107512.html 
answers this for you, I think :)


Antony.

-- 
There are two possible outcomes:

 If the result confirms the hypothesis, then you've made a measurement.
 If the result is contrary to the hypothesis, then you've made a discovery.

 - Enrico Fermi

   Please reply to the list;
 please *don't* CC me.


Re: SA's bayes with the Redis backend?

2021-02-11 Thread deano-spamassassin
 

On 2021-02-11 9:54 am, Alex wrote: 

> Hi,
> There is no real question, but what I would like to find out is (and to ask), 
> does it scale and are any pitfalls? Naturally, we would look at doing HA, but 
> am asking for that any comment, any tip, any opinion on using redis for 
> bayes. Been using it from day one (I'm party to blame we have this) and it 
> scales VERY well. Bayes processing bottleneck has become a thing of the past. 
> Pifalls? none so far. I wouldn't go back anymore. Obviously, it's global 
> only, no per user.

Is there an easy/efficient way of converting an existing mariadb bayes
database to redis?

Perhaps "sa-learn --backup", set up redis, then restore?

I know I've been less than successful in the past when migrating from
one version of mariadb to another, so just wondering how successful
this approach would be.

The problem I'm having with bayes in mariadb is being able to use a
central database server for the database, while reading and updating
it from remote systems. Will redis solve this problem?

# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 11083 0 non-token data: nspam
0.000 0 48363 0 non-token data: nham
0.000 0 3709015 0 non-token data: ntokens
0.000 0 1372117134 0 non-token data: oldest atime
0.000 0 1613055126 0 non-token data: newest atime
0.000 0 0 0 non-token data: last journal sync atime
0.000 0 1606461007 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime delta
0.000 0 0 0 non-token data: last expire
reduction count

I've had good luck with using mariadb and galera to share the
spamassassin database across systems. I run a small 3-node setup for
email, 2x servers running dovecot replicating to each other, and a 3rd
galera quorum server. Mariadb is master-master across all 3 nodes, so
changes on any one are replicated to all the others via vpn. 

Works well, and for the amount of data in the spamassassin database, it
replicates very quickly. 

Re: SA's bayes with the Redis backend?

2021-02-11 Thread Alex
Hi,

> > There is no real question, but what I would like to find out is (and to
> > ask), does it scale and are any pitfalls?
> > Naturally, we would look at doing HA, but am asking for that any
> > comment, any tip, any opinion on using redis for bayes.
>
> Been using it from day one (I'm party to blame we have this) and it
> scales VERY well. Bayes processing bottleneck has become a thing of the
> past.
> Pifalls? none so far.
>
> I wouldn't go back anymore.
> Obviously, it's global only, no per user.

Is there an easy/efficient way of converting an existing mariadb bayes
database to redis?

Perhaps "sa-learn --backup", set up redis, then restore?

I know I've been less than successful in the past when migrating from
one version of mariadb to another, so just wondering how successful
this approach would be.

The problem I'm having with bayes in mariadb is being able to use a
central database server for the database, while reading and updating
it from remote systems. Will redis solve this problem?

# sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db version
0.000  0  11083  0  non-token data: nspam
0.000  0  48363  0  non-token data: nham
0.000  03709015  0  non-token data: ntokens
0.000  0 1372117134  0  non-token data: oldest atime
0.000  0 1613055126  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal sync atime
0.000  0 1606461007  0  non-token data: last expiry atime
0.000  0  0  0  non-token data: last expire atime delta
0.000  0  0  0  non-token data: last expire
reduction count


Re: SA's bayes with the Redis backend?

2021-02-10 Thread Axb

Hi Brent,

On 2/10/21 12:21 PM, Brent Clark wrote:

Good day Guys

I just want to check with the community, is there anybody using SA's 
bayes with the Redis backend?


I work at a largish ISP, so we talking lots of mail.

There is no real question, but what I would like to find out is (and to 
ask), does it scale and are any pitfalls?
Naturally, we would look at doing HA, but am asking for that any 
comment, any tip, any opinion on using redis for bayes.


Been using it from day one (I'm party to blame we have this) and it 
scales VERY well. Bayes processing bottleneck has become a thing of the 
past.

Pifalls? none so far.

I wouldn't go back anymore.
Obviously, it's global only, no per user.

Axb



SA's bayes with the Redis backend?

2021-02-10 Thread Brent Clark

Good day Guys

I just want to check with the community, is there anybody using SA's bayes with 
the Redis backend?

I work at a largish ISP, so we talking lots of mail.

There is no real question, but what I would like to find out is (and to ask), 
does it scale and are any pitfalls?
Naturally, we would look at doing HA, but am asking for that any comment, any 
tip, any opinion on using redis for bayes.

Thanks in advance.

Regards
Brent




Re: Bayes converstion: SQL--> Redis?

2021-02-04 Thread Kevin A. McGrail



On 2/4/2021 5:32 AM, Giovanni Bechis wrote:

On 2/4/21 10:47 AM, Dan Mahoney (Gushi) wrote:

Hey there all,

In looking at my sql server, it looks like the on-disk size of my MySQL DB's is 
like 9G (because of InnoDB, it's hard to glean just from the filesystem what 
tables are which).

Anyway, I'd like to move over to a global redis system, but I don't see an easy 
way to convert from bayes SQL to redis bayes.

Is this somewhere and I can't find it?


"sa-learn --backup" with old config and "sa-learn --restore" with new one 
should do what you need.

  Giovanni
Hi Gushi, I also like to use innodb-file-per-table = 1 so I don't have 
one centralized innodb file.




Re: Bayes converstion: SQL--> Redis?

2021-02-04 Thread Giovanni Bechis
On 2/4/21 10:47 AM, Dan Mahoney (Gushi) wrote:
> Hey there all,
> 
> In looking at my sql server, it looks like the on-disk size of my MySQL DB's 
> is like 9G (because of InnoDB, it's hard to glean just from the filesystem 
> what tables are which).
> 
> Anyway, I'd like to move over to a global redis system, but I don't see an 
> easy way to convert from bayes SQL to redis bayes.
> 
> Is this somewhere and I can't find it?
> 
"sa-learn --backup" with old config and "sa-learn --restore" with new one 
should do what you need.

 Giovanni



Bayes converstion: SQL--> Redis?

2021-02-04 Thread Dan Mahoney (Gushi)

Hey there all,

In looking at my sql server, it looks like the on-disk size of my MySQL 
DB's is like 9G (because of InnoDB, it's hard to glean just from the 
filesystem what tables are which).


Anyway, I'd like to move over to a global redis system, but I don't see an 
easy way to convert from bayes SQL to redis bayes.


Is this somewhere and I can't find it?

-Dan

--

Dan Mahoney
Techie,  Sysadmin,  WebGeek
Gushi on efnet/undernet IRC
FB:  fb.com/DanielMahoneyIV
LI:   linkedin.com/in/gushi
Site:  http://www.gushi.org
---



Re: Error "cannot open bayes databases" lock failed: File exists

2021-01-22 Thread Matus UHLAR - fantomas

On 21.01.21 13:41, Emanuel Gonzalez wrote:

anyway, the error is still represented even with low configuration values.

Jan 21 10:39:43 eternia6 spamd[28053]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Jan 21 10:39:43 eternia6 spamd[28299]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Jan 21 10:39:43 eternia6 spamd[28273]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists

Anyone know any way to fix it??


I have mentioned that before, citing from message you quoted:


If you process too much mail, you could store bayes database in SQL or
redis. However, first lower amount of processes.



--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Support bacteria - they're the only culture some people have.


Re: Error "cannot open bayes databases" lock failed: File exists

2021-01-21 Thread RW
On Thu, 21 Jan 2021 14:08:59 +0100
Matus UHLAR - fantomas wrote:

 
> journalling may help a bit, but it makes no sense to parse more mail
> within one CPU at the same time.

That's true provided that everything remains completely CPU limited.

The problem is that if you run any network tests and something becomes
slow or unreliable, child processes can spend most of their time
blocked. If you have multiple processes per core, the throughput can be
more reliable.

I'd start with 5 processes per core and see how it goes. 


> >model name  : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz  
> 
> 4 cores, 8 threads. provided you only have one CPU.
> 
> I'd set max-children to 4 and not set min-children,min-spare and
> max-spare at all.

If you do that you implicitly set them to 2,1 and 2 respectively.

If you want a fixed number you can set the min and max values equal.


RE: Error "cannot open bayes databases" lock failed: File exists

2021-01-21 Thread Emanuel Gonzalez
anyway, the error is still represented even with low configuration values.

Jan 21 10:39:43 eternia6 spamd[28053]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Jan 21 10:39:43 eternia6 spamd[28299]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Jan 21 10:39:43 eternia6 spamd[28273]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists

Anyone know any way to fix it??

Regards Emanuel.

De: Emanuel Gonzalez 
Enviado: jueves, 21 de enero de 2021 10:35
Para: Matus UHLAR - fantomas ; users@spamassassin.apache.org 

Asunto: RE: Error "cannot open bayes databases" lock failed: File exists

I'm testing right now. I have lowered the parameters but in the logs I see an 
error or warning:

prefork: adjust: 3 idle children more than 2 maximum idle children. Decreasing 
spamd children: 28057 killed.

That message can cause slow analysis of emails?

In my infrastructure I have about 10 physical servers with spamassassin, using 
the keepalived service the requests are balanced between them.

Regards, Emanuel.



De: Emanuel Gonzalez 
Enviado: miércoles, 20 de enero de 2021 15:31
Para: Matus UHLAR - fantomas ; users@spamassassin.apache.org 

Asunto: RE: Error "cannot open bayes databases" lock failed: File exists

The problem can be generated by the number of processes?

# Server CPU

cpu family  : 6
model   : 60
model name  : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

# SpamAssassin

SPAMDOPTIONS="-u spamd --min-children=30 --max-children=80 --min-spare=25 
--max-spare=80 --timeout-child=60 --max-conn-per-child=150

what change i need to apply?

Regards, Emanuel.

De: Matus UHLAR - fantomas 
Enviado: miércoles, 20 de enero de 2021 15:28
Para: users@spamassassin.apache.org 
Asunto: Re: Error "cannot open bayes databases" lock failed: File exists

On 20.01.21 14:50, Emanuel Gonzalez wrote:
>Hello Matus, thanks for your reply.
>
># ls -la /var/spamassassin/bayesdb/bayes
>
>ls: no se puede acceder a /var/spamassassin/bayesdb/bayes: No existe el 
>fichero o el directorio

>I see an error of inexistent file.

sorry, that was supposed to be:

ls -la /var/spamassassin/bayesdb/

so we can see hidden files too.

/var/spamassassin/bayesdb/bayes* does NOT show hidden filesa.

...however you showed us many lock files, which should explain.


># lsof /var/spamassassin/bayesdb/bayes_journal  
>/var/spamassassin/bayesdb/bayes_seen /var/spamassassin/bayesdb/bayes_toks
>
>COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME

>spamd   25467 spamd   12r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25470 spamd   15r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25491 spamd   36r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25494 spamd   39r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25502 spamd   47r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
[...]

ohh!  too many processes.  I don't recommend more spamd processes than e.g.
2x number of CPUs. maybe even less.
It does not make sense to run too many processes in parallel.

If you process too much mail, you could store bayes database in SQL or
redis. However, first lower amount of processes.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759


RE: Error "cannot open bayes databases" lock failed: File exists

2021-01-21 Thread Emanuel Gonzalez
I'm testing right now. I have lowered the parameters but in the logs I see an 
error or warning:

prefork: adjust: 3 idle children more than 2 maximum idle children. Decreasing 
spamd children: 28057 killed.

That message can cause slow analysis of emails?

In my infrastructure I have about 10 physical servers with spamassassin, using 
the keepalived service the requests are balanced between them.

Regards, Emanuel.



De: Emanuel Gonzalez 
Enviado: miércoles, 20 de enero de 2021 15:31
Para: Matus UHLAR - fantomas ; users@spamassassin.apache.org 

Asunto: RE: Error "cannot open bayes databases" lock failed: File exists

The problem can be generated by the number of processes?

# Server CPU

cpu family  : 6
model   : 60
model name  : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

# SpamAssassin

SPAMDOPTIONS="-u spamd --min-children=30 --max-children=80 --min-spare=25 
--max-spare=80 --timeout-child=60 --max-conn-per-child=150

what change i need to apply?

Regards, Emanuel.

De: Matus UHLAR - fantomas 
Enviado: miércoles, 20 de enero de 2021 15:28
Para: users@spamassassin.apache.org 
Asunto: Re: Error "cannot open bayes databases" lock failed: File exists

On 20.01.21 14:50, Emanuel Gonzalez wrote:
>Hello Matus, thanks for your reply.
>
># ls -la /var/spamassassin/bayesdb/bayes
>
>ls: no se puede acceder a /var/spamassassin/bayesdb/bayes: No existe el 
>fichero o el directorio

>I see an error of inexistent file.

sorry, that was supposed to be:

ls -la /var/spamassassin/bayesdb/

so we can see hidden files too.

/var/spamassassin/bayesdb/bayes* does NOT show hidden filesa.

...however you showed us many lock files, which should explain.


># lsof /var/spamassassin/bayesdb/bayes_journal  
>/var/spamassassin/bayesdb/bayes_seen /var/spamassassin/bayesdb/bayes_toks
>
>COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME

>spamd   25467 spamd   12r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25470 spamd   15r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25491 spamd   36r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25494 spamd   39r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25502 spamd   47r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
[...]

ohh!  too many processes.  I don't recommend more spamd processes than e.g.
2x number of CPUs. maybe even less.
It does not make sense to run too many processes in parallel.

If you process too much mail, you could store bayes database in SQL or
redis. However, first lower amount of processes.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759


Re: Error "cannot open bayes databases" lock failed: File exists

2021-01-21 Thread Matus UHLAR - fantomas

On 20.01.21 18:31, Emanuel Gonzalez wrote:

The problem can be generated by the number of processes?


number of concurrent processes trying to write to the bayes DB at the same
time.

journalling may help a bit, but it makes no sense to parse more mail within
one CPU at the same time.



# Server CPU

cpu family  : 6
model   : 60
model name  : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz


4 cores, 8 threads. provided you only have one CPU.

I'd set max-children to 4 and not set min-children,min-spare and max-spare
at all.

... on some systems I disable HT CPUs by disabling in /etc/sysfs.conf:

devices/system/cpu/cpu4/online = 0
devices/system/cpu/cpu5/online = 0
devices/system/cpu/cpu6/online = 0
devices/system/cpu/cpu7/online = 0

I think since spectre/meltdown it's a good idea, and some systems reported
high dummy CPU usage when those were enabled.


# SpamAssassin

SPAMDOPTIONS="-u spamd --min-children=30 --max-children=80 --min-spare=25 
--max-spare=80 --timeout-child=60 --max-conn-per-child=150



ohh!  too many processes.  I don't recommend more spamd processes than e.g.
2x number of CPUs. maybe even less.
It does not make sense to run too many processes in parallel.

If you process too much mail, you could store bayes database in SQL or
redis. However, first lower amount of processes.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
10 GOTO 10 : REM (C) Bill Gates 1998, All Rights Reserved!


RE: Error "cannot open bayes databases" lock failed: File exists

2021-01-20 Thread Emanuel Gonzalez
The problem can be generated by the number of processes?

# Server CPU

cpu family  : 6
model   : 60
model name  : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

# SpamAssassin

SPAMDOPTIONS="-u spamd --min-children=30 --max-children=80 --min-spare=25 
--max-spare=80 --timeout-child=60 --max-conn-per-child=150

what change i need to apply?

Regards, Emanuel.

De: Matus UHLAR - fantomas 
Enviado: miércoles, 20 de enero de 2021 15:28
Para: users@spamassassin.apache.org 
Asunto: Re: Error "cannot open bayes databases" lock failed: File exists

On 20.01.21 14:50, Emanuel Gonzalez wrote:
>Hello Matus, thanks for your reply.
>
># ls -la /var/spamassassin/bayesdb/bayes
>
>ls: no se puede acceder a /var/spamassassin/bayesdb/bayes: No existe el 
>fichero o el directorio

>I see an error of inexistent file.

sorry, that was supposed to be:

ls -la /var/spamassassin/bayesdb/

so we can see hidden files too.

/var/spamassassin/bayesdb/bayes* does NOT show hidden filesa.

...however you showed us many lock files, which should explain.


># lsof /var/spamassassin/bayesdb/bayes_journal  
>/var/spamassassin/bayesdb/bayes_seen /var/spamassassin/bayesdb/bayes_toks
>
>COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME

>spamd   25467 spamd   12r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25470 spamd   15r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25491 spamd   36r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25494 spamd   39r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
>spamd   25502 spamd   47r   REG8,1  5132288 402667308 
>/var/spamassassin/bayesdb/bayes_toks
[...]

ohh!  too many processes.  I don't recommend more spamd processes than e.g.
2x number of CPUs. maybe even less.
It does not make sense to run too many processes in parallel.

If you process too much mail, you could store bayes database in SQL or
redis. However, first lower amount of processes.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759


Re: Error "cannot open bayes databases" lock failed: File exists

2021-01-20 Thread Matus UHLAR - fantomas

On 20.01.21 14:50, Emanuel Gonzalez wrote:

Hello Matus, thanks for your reply.

# ls -la /var/spamassassin/bayesdb/bayes

ls: no se puede acceder a /var/spamassassin/bayesdb/bayes: No existe el fichero 
o el directorio



I see an error of inexistent file.


sorry, that was supposed to be:

ls -la /var/spamassassin/bayesdb/

so we can see hidden files too.

/var/spamassassin/bayesdb/bayes* does NOT show hidden filesa.

...however you showed us many lock files, which should explain.



# lsof /var/spamassassin/bayesdb/bayes_journal  
/var/spamassassin/bayesdb/bayes_seen /var/spamassassin/bayesdb/bayes_toks

COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME



spamd   25467 spamd   12r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25470 spamd   15r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25491 spamd   36r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25494 spamd   39r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25502 spamd   47r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks

[...]

ohh!  too many processes.  I don't recommend more spamd processes than e.g. 
2x number of CPUs. maybe even less.

It does not make sense to run too many processes in parallel.

If you process too much mail, you could store bayes database in SQL or
redis. However, first lower amount of processes.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759


Re: Error "cannot open bayes databases" lock failed: File exists

2021-01-20 Thread RW
On Wed, 20 Jan 2021 14:50:53 +
Emanuel Gonzalez wrote:


> # lsof /var/spamassassin/bayesdb/bayes_journal
> /var/spamassassin/bayesdb/bayes_seen
> /var/spamassassin/bayesdb/bayes_toks
> 
> COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> spamd   25467 spamd   12r   REG8,1  5132288 402667308
> /var/spamassassin/bayesdb/bayes_toks spamd   25467 spamd   13r   REG
>   8,1   172032 402828743 /var/spamassassin/bayesdb/bayes_seen spamd
> 25470 spamd   15r   REG8,1  5132288 402667308
> /var/spamassassin/bayesdb/bayes_toks spamd   25470 spamd   16r   REG
>   8,1   172032 402828743 /var/spamassassin/bayesdb/bayes_seen spamd
...
> 29921 spamd  192r   REG8,1  5132288 402667308
> /var/spamassassin/bayesdb/bayes_toks spamd   29921 spamd  193r   REG
>   8,1   172032 402828743 /var/spamassassin/bayesdb/bayes_seen

Do you actually need so many child processes? You have 40 in Bayes
alone and in a previous post you had "--round-robin" with
"--max-children=180", i.e. a fixed number of 180 in total. 




RE: Error "cannot open bayes databases" lock failed: File exists

2021-01-20 Thread Emanuel Gonzalez
Hello,

-rw--- 1 spamd spamd 224 ene 20 13:45 bayes.lock
-rw--- 1 spamd spamd  84 ene  2 01:31 
bayes.lock.eternia6.dattaweb.com.11016
-rw--- 1 spamd spamd 224 ene  2 01:31 
bayes.lock.eternia6.dattaweb.com.11251
-rw--- 1 spamd spamd  84 ene  2 01:31 
bayes.lock.eternia6.dattaweb.com.14855
-rw--- 1 spamd spamd 224 ene  2 01:31 
bayes.lock.eternia6.dattaweb.com.16779
-rw--- 1 spamd spamd 224 ene  5 01:37 
bayes.lock.eternia6.dattaweb.com.25210
-rw--- 1 spamd spamd 168 ene 20 11:29 
bayes.lock.eternia6.dattaweb.com.25620
-rw--- 1 spamd spamd  28 ene  5 01:37 
bayes.lock.eternia6.dattaweb.com.25694
-rw--- 1 spamd spamd  28 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.29848
-rw--- 1 spamd spamd 112 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.29852
-rw--- 1 spamd spamd  28 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.29868
-rw--- 1 spamd spamd 224 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.29873
-rw--- 1 spamd spamd  54 ene 15 17:47 
bayes.lock.eternia6.dattaweb.com.3018
-rw--- 1 spamd spamd 252 ene 19 11:22 
bayes.lock.eternia6.dattaweb.com.30473
-rw--- 1 spamd spamd 252 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31005
-rw--- 1 spamd spamd 252 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31007
-rw--- 1 spamd spamd 224 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31009
-rw--- 1 spamd spamd 112 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31092
-rw--- 1 spamd spamd 112 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31095
-rw--- 1 spamd spamd 196 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31101
-rw--- 1 spamd spamd 196 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31149
-rw--- 1 spamd spamd 112 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31160
-rw--- 1 spamd spamd 252 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31274
-rw--- 1 spamd spamd 140 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31687
-rw--- 1 spamd spamd 168 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31733
-rw--- 1 spamd spamd  56 ene 20 13:54 
bayes.lock.eternia6.dattaweb.com.31836
-rw--- 1 spamd spamd 270 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5412
-rw--- 1 spamd spamd  54 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5429
-rw--- 1 spamd spamd 216 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5436
-rw--- 1 spamd spamd 108 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5443
-rw--- 1 spamd spamd 270 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5455
-rw--- 1 spamd spamd 243 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5493
-rw--- 1 spamd spamd 135 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5496
-rw--- 1 spamd spamd 270 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5524
-rw--- 1 spamd spamd 189 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5527
-rw--- 1 spamd spamd 108 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5529
-rw--- 1 spamd spamd  81 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5540
-rw--- 1 spamd spamd 243 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5549
-rw--- 1 spamd spamd 270 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5557
-rw--- 1 spamd spamd 162 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5574
-rw--- 1 spamd spamd  81 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5579
-rw--- 1 spamd spamd 108 ene 18 10:11 
bayes.lock.eternia6.dattaweb.com.5582
-rw--- 1 spamd spamd 216 ene  2 01:31 
bayes.lock.eternia6.dattaweb.com.9227
-rw--- 1 spamd spamd  720192 ene 20 13:54 bayes_journal
-rwxr-xr-x 1 spamd spamd  172032 dic 18 10:52 bayes_seen
-rwxr-xr-x 1 spamd spamd 5132288 ene 20 13:45 bayes_toks



De: Dave Funk 
Enviado: miércoles, 20 de enero de 2021 13:39
Para: users@spamassassin.apache.org 
Asunto: Re: Error "cannot open bayes databases" lock failed: File exists

On Wed, 20 Jan 2021, Matus UHLAR - fantomas wrote:

> On 20.01.21 11:07, Emanuel Gonzalez wrote:
>> Date: Wed, 20 Jan 2021 11:07:59 +
>> From: Emanuel Gonzalez 
>> To: SA Mailing list 
>> Subject: Re: Error "cannot open bayes databases" lock failed: File exists
>>
>> Hello everyone, i'm back from my vacations, i try solved this problem but i
>> could not.
>>
>> I still see in the spamsassin error logs the mentioned error:
>>
>> bayes_learn_to_journal 1
>> use_bayes yes
>> bayes_path /var/spamassassin/bayesdb/bayes
>> bayes_auto_learn 0
>> bayes_auto_expire 0
>>
>
> try:
>
> ls -la /var/spamassassin/bayesdb/bayes
> lsof /var/spamassassin/bayesdb/bayes_journal
> /var/spamassassin/bayesdb/bayes_seen /var/spamassassin/bayesdb/bayes_toks

Umm, the command:
   ls -la /var/spamassassin/bayesdb/bay

Re: Error "cannot open bayes databases" lock failed: File exists

2021-01-20 Thread Dave Funk

On Wed, 20 Jan 2021, Matus UHLAR - fantomas wrote:


On 20.01.21 11:07, Emanuel Gonzalez wrote:

Date: Wed, 20 Jan 2021 11:07:59 +
From: Emanuel Gonzalez 
To: SA Mailing list 
Subject: Re: Error "cannot open bayes databases" lock failed: File exists

Hello everyone, i'm back from my vacations, i try solved this problem but i 
could not.


I still see in the spamsassin error logs the mentioned error:

bayes_learn_to_journal 1
use_bayes yes
bayes_path /var/spamassassin/bayesdb/bayes
bayes_auto_learn 0
bayes_auto_expire 0



try:

ls -la /var/spamassassin/bayesdb/bayes
lsof /var/spamassassin/bayesdb/bayes_journal 
/var/spamassassin/bayesdb/bayes_seen /var/spamassassin/bayesdb/bayes_toks


Umm, the command:
  ls -la /var/spamassassin/bayesdb/bayes

should get you the error:

ls: cannot access /var/spamassassin/bayesdb/bayes : No such file or directory

On the otherhand:

 ls -la /var/spamassassin/bayesdb/bayes*
(taken from the bayes_path parameter) should get you what you want.

even better:

 ls -la /var/spamassassin/bayesdb/
(to see if there's any leftover lock files in that directory)


--
Dave Funk   University of Iowa
 College of Engineering
319/335-5751   FAX: 319/384-05491256 Seamans Center, 103 S Capitol St.
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{


RE: Error "cannot open bayes databases" lock failed: File exists

2021-01-20 Thread Emanuel Gonzalez
Hello Matus, thanks for your reply.

# ls -la /var/spamassassin/bayesdb/bayes

ls: no se puede acceder a /var/spamassassin/bayesdb/bayes: No existe el fichero 
o el directorio

I see an error of inexistent file.

# lsof /var/spamassassin/bayesdb/bayes_journal  
/var/spamassassin/bayesdb/bayes_seen /var/spamassassin/bayesdb/bayes_toks

COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
spamd   25467 spamd   12r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25467 spamd   13r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25470 spamd   15r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25470 spamd   16r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25491 spamd   36r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25491 spamd   37r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25494 spamd   39r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25494 spamd   40r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25502 spamd   47r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25502 spamd   48r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25503 spamd   48r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25503 spamd   49r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25504 spamd   51r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25504 spamd   52r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25506 spamd   51r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25506 spamd   52r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25514 spamd   59r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25514 spamd   60r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25515 spamd   60r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25515 spamd   70r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25520 spamd   68r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25520 spamd   69r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25536 spamd   81r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25536 spamd   82r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25537 spamd   84r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25537 spamd   85r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25542 spamd   87r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25542 spamd   88r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25544 spamd   90r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25544 spamd   91r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25546 spamd   91r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25546 spamd   92r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25552 spamd   97r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25552 spamd   98r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25561 spamd  106r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25561 spamd  107r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25568 spamd  113r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25568 spamd  114r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25573 spamd  118r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25573 spamd  119r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25574 spamd  119r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25574 spamd  120r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25586 spamd  131r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25586 spamd  132r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25588 spamd  133r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25588 spamd  134r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb/bayes_seen
spamd   25592 spamd  137r   REG8,1  5132288 402667308 
/var/spamassassin/bayesdb/bayes_toks
spamd   25592 spamd  138r   REG8,1   172032 402828743 
/var/spamassassin/bayesdb

Re: Error "cannot open bayes databases" lock failed: File exists

2021-01-20 Thread Matus UHLAR - fantomas

On 20.01.21 11:07, Emanuel Gonzalez wrote:

Date: Wed, 20 Jan 2021 11:07:59 +
From: Emanuel Gonzalez 
To: SA Mailing list 
Subject: Re: Error "cannot open bayes databases" lock failed: File exists

Hello everyone, i'm back from my vacations, i try solved this problem but i 
could not.

I still see in the spamsassin error logs the mentioned error:

bayes_learn_to_journal 1
use_bayes yes
bayes_path /var/spamassassin/bayesdb/bayes
bayes_auto_learn 0
bayes_auto_expire 0



try:

ls -la /var/spamassassin/bayesdb/bayes
lsof /var/spamassassin/bayesdb/bayes_journal  
/var/spamassassin/bayesdb/bayes_seen /var/spamassassin/bayesdb/bayes_toks


- rw--- 1 spamd spamd   48984 ene 20 08:06 
/var/spamassassin/bayesdb/bayes_journal
-rwxr-xr-x 1 spamd spamd  172032 dic 18 10:52 
/var/spamassassin/bayesdb/bayes_seen
-rwxr-xr-x 1 spamd spamd 5132288 ene 20 08:05 
/var/spamassassin/bayesdb/bayes_toks

Jan 20 07:25:27 eternia6 spamd[22817]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Jan 20 07:25:27 eternia6 spamd[22916]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Jan 20 07:25:27 eternia6 spamd[22843]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists


Any ideas?  i don't know how resolve this error.



--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Atheism is a non-prophet organization.


Re: Error "cannot open bayes databases" lock failed: File exists

2021-01-20 Thread Emanuel Gonzalez
Hello everyone, i'm back from my vacations, i try solved this problem but i 
could not.

I still see in the spamsassin error logs the mentioned error:

bayes_learn_to_journal 1
use_bayes yes
bayes_path /var/spamassassin/bayesdb/bayes
bayes_auto_learn 0
bayes_auto_expire 0

#

 - rw--- 1 spamd spamd   48984 ene 20 08:06 
/var/spamassassin/bayesdb/bayes_journal
-rwxr-xr-x 1 spamd spamd  172032 dic 18 10:52 
/var/spamassassin/bayesdb/bayes_seen
-rwxr-xr-x 1 spamd spamd 5132288 ene 20 08:05 
/var/spamassassin/bayesdb/bayes_toks

Jan 20 07:25:27 eternia6 spamd[22817]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Jan 20 07:25:27 eternia6 spamd[22916]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Jan 20 07:25:27 eternia6 spamd[22843]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists


Any ideas?  i don't know how resolve this error.

Regards, Emanuel.


Re: Error "cannot open bayes databases" lock failed: File exists

2020-12-30 Thread Kris Deugau

Emanuel Gonzalez wrote:


# SpamAssassin Deamon config

SPAMDOPTIONS="-u spamd --round-robin --min-children=30 
--max-children=180 --min-spare=25 --max-spare=80 --timeout-child=60 
--max-conn-per-child=150
-i -A 
172.17.0.0/16,10.0.0.0/8,200.58.96.0/19,179.43.112.0/20,168.197.48.0/22,168.181.184.0/22,138.219.40.0/22,138.36.236.0/22,66.97.32.0/20"


Putting aside your Bayes error (which I'm pretty sure Matus answered), 
this seems like an awful lot of individual systems allowed to connect to 
a single spamd instance - it's not generally an end-user-accessible 
service.  Do you really need to access this spamd instance from ~20,000 
public IPs?


-kgd


Re: Error "cannot open bayes databases" lock failed: File exists

2020-12-30 Thread Matus UHLAR - fantomas

On 30.12.20 13:53, Emanuel Gonzalez wrote:

Dec 30 09:56:57 eternia6 spamd[15993]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:56:57 eternia6 spamd[15915]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:56:58 eternia6 spamd[16002]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:56:59 eternia6 spamd[15960]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:57:00 eternia6 spamd[15847]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:57:01 eternia6 spamd[15909]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists

is possible be an error of permission?


aparently no. That's apparently problem of a process having the database
locked while other process tries to write to it.


drwsr-sr-x 3 spamd spamd 20 dic 18 10:26 /var/spamassassin
drwxr-xr-x 2 spamd spamd 60 dic 30 10:03 /var/spamassassin/bayesdb/

-rw--- 1 spamd spamd   66960 dic 30 10:03 bayes_journal
-rwxr-xr-x 1 spamd spamd  172032 dic 18 10:52 bayes_seen
-rwxr-xr-x 1 spamd spamd 5132288 dic 30 10:03 bayes_toks

# Bayes config

use_bayes yes
bayes_path /var/spamassassin/bayesdb/bayes
bayes_auto_learn 0
bayes_auto_expire 0

# SpamAssassin Deamon config

SPAMDOPTIONS="-u spamd --round-robin --min-children=30 --max-children=180 
--min-spare=25 --max-spare=80 --timeout-child=60 --max-conn-per-child=150
-i -A 
172.17.0.0/16,10.0.0.0/8,200.58.96.0/19,179.43.112.0/20,168.197.48.0/22,168.181.184.0/22,138.219.40.0/22,138.36.236.0/22,66.97.32.0/20"

I read various publications for this error but i don't know how resolve it.

Any ideas, recommendations?


bayes_learn_to_journal 1


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
On the other hand, you have different fingers.


Error "cannot open bayes databases" lock failed: File exists

2020-12-30 Thread Emanuel Gonzalez
Good Morning everyone,

In the logs of spamassassin i see this error:

Dec 30 09:56:57 eternia6 spamd[15993]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:56:57 eternia6 spamd[15915]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:56:58 eternia6 spamd[16002]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:56:59 eternia6 spamd[15960]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:57:00 eternia6 spamd[15847]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists
Dec 30 09:57:01 eternia6 spamd[15909]: bayes: cannot open bayes databases 
/var/spamassassin/bayesdb/bayes_* R/W: lock failed: File exists

is possible be an error of permission?

drwsr-sr-x 3 spamd spamd 20 dic 18 10:26 /var/spamassassin
drwxr-xr-x 2 spamd spamd 60 dic 30 10:03 /var/spamassassin/bayesdb/

-rw--- 1 spamd spamd   66960 dic 30 10:03 bayes_journal
-rwxr-xr-x 1 spamd spamd  172032 dic 18 10:52 bayes_seen
-rwxr-xr-x 1 spamd spamd 5132288 dic 30 10:03 bayes_toks

# Bayes config

use_bayes yes
bayes_path /var/spamassassin/bayesdb/bayes
bayes_auto_learn 0
bayes_auto_expire 0

# SpamAssassin Deamon config

SPAMDOPTIONS="-u spamd --round-robin --min-children=30 --max-children=180 
--min-spare=25 --max-spare=80 --timeout-child=60 --max-conn-per-child=150
-i -A 
172.17.0.0/16,10.0.0.0/8,200.58.96.0/19,179.43.112.0/20,168.197.48.0/22,168.181.184.0/22,138.219.40.0/22,138.36.236.0/22,66.97.32.0/20"

I read various publications for this error but i don't know how resolve it.

Any ideas, recommendations?

Regards, Emanuel.


  1   2   3   4   5   6   7   8   9   10   >