https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8271
Bug ID: 8271
Summary: bayes_stopword_ru in 60_bayes_stopwords.cf misses some
very common russian stopwords
Product: Spamassassin
Version: 4.0.1
Hardware: PC
OS: FreeBSD
Status: NEW
Severity: enhancement
Priority: P2
Component: Rules
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: Undefined
The list of russian stopwords compiled into bayes_stopword_ru seems to be very
long but it misses some very common russian stopwords.
How to reproduce:
cat <<EOT >> stopwords.mbox
>From test Mon Jul 22 20:42:19 2024
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
=D0=B5=D1=81=D0=BB=D0=B8 =D0=B8=D0=BB=D0=B8 =D0=BD=D0=B5
EOT
spamassassin -L -D bayes --mbox stopwords.mbox
Only the first stopword "если" is found:
dbg: bayes: skipped token '\x{D0}\x{B5}\x{D1}\x{81}\x{D0}\x{BB}\x{D0}\x{B8}'
because it's in stopword list for language 'ru'
Stopwords \x{D0}\x{B8}\x{D0}\x{BB}\x{D0}\x{B8} ("или" means "or" in english)
and \x{D0}\x{BD}\x{D0}\x{B5} ("не" means "not" in english) are missed.
The whole list of missed russian stopwords is "на|по|не|от|для|или|за|из|что".
It will be great to adjust 60_bayes_stopwords.cf by adding these stopwords to
bayes_stopword_ru rule.
P.S. Current version of bayes_stopword_ru from 60_bayes_stopwords.cf also
contains some low frequency russian words which cannot be considered as
stopwords. For example word "иногда" ("sometimes" in english) was found only in
7 of 8000 spam messages and in 50 from 40000 ham messages in my own corpus
(selected by hand).
Personally I use the following custom list of stopwords in my installation of
spamassassin:
bayes_stopword_ru
(?^:(на|по|не|от|для|или|за|Вас|из|что|если|будет|Вам|Если|мы|Здравствуйте|есть|это|можно|только|вас|нужно|без|его))
Each of these words had a Bayes score between 0.4 and 0.6 (for bayes db trained
on my corpus with no stopwords) and was found in at least 10% of messages from
my corpus.
--
You are receiving this mail because:
You are the assignee for the bug.