This is what I believe: the words need to be trimmed or separated, and careful consideration is required to determine the language in order to perform accurate cutoffs.
Jimmy On Fri, Dec 29, 2023 at 5:16 PM <giova...@paclan.it> wrote: > "ทุก" is not considered a word because it's part of the token > "ทุกวันพุธเล่นชนะรับเพิ่ม". > Words must be separated by spaces, otherwise we should skip the word > "theme" just because "the" is in english stopword list. > No idea if this makes sense for asian languages. > > Giovanni > > On 12/29/23 11:04, Jimmy wrote: > > > > The sample email and word list should contain at least these words. > > > > ถูก > > เลย > > ทุก > > > > Jimmy > > > > On Fri, Dec 29, 2023 at 4:47 PM <giova...@paclan.it <mailto: > giova...@paclan.it>> wrote: > > > > I do not speak Thai but I cannot see any word in the sample email > that should match that list. > > Which word do you think should match the regexp ? > > Giovanni > > > > On 12/29/23 10:08, Jimmy wrote: > > > You can use this word list > > > > > > > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt > >> > > > > > > Jimmy > > > > > > On Fri, Dec 29, 2023 at 3:59 PM <giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> > wrote: > > > > > > To create the stopwords regexp I used the script I shared in > a previous email and a list of words one per line. > > > Could you share the list you are using ? > > > > > > Giovanni > > > > > > On 12/29/23 09:22, Jimmy wrote: > > > > I use SpamAssassin 4.0.0 (2022-12-14) > > > > > > > > $ spamassassin -D --lint 2>&1 | grep bayes: > > > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found > lang=en > > > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found > lang=th > > > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found > lang=ru > > > > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found > lang=fr > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=ja > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=zh > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=dk > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=nl > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=de > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=es > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=fi > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=fr > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=it > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=no > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=ru > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=se > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=tr > > > > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found > lang=vi > > > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found > lang=ko > > > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found > lang=zh > > > > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found > lang=hi > > > > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for > languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko > zh hi > > > > > > > > > > > > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep > "skipped token" > > > > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token > 'Email' because it's in stopword list for language 'en' > > > > > > > > You can use "บาท" that was listed in regexp pattern but > somehow I don't know why it not show skipped token in bayes. > > > > > > > > Jimmy > > > > > > > > > > > > On Fri, Dec 29, 2023 at 2:59 PM <giova...@paclan.it > <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> > <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> wrote: > > > > > > > > Config line produces a syntax error for me: > > > > config: failed to parse line in /etc/mail/spamassassin/ > local.cf <http://local.cf> <http://local.cf <http://local.cf>> < > http://local.cf <http://local.cf> <http://local.cf <http://local.cf>>> > (line 1): bayes_stopword_th > > > > > > > > Could you share the word list in utf8 ? > > > > I tried adding "บาท" to > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>> > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt> > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt > < > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>> > and it produces a working regexp. > > > > Bayes stopwords languages must also be enabled using > "bayes_stopword_languages" config keyword, by default only english is > enabled. > > > > Giovanni > > > > > > > > On 12/28/23 17:06, Jimmy wrote: > > > > > bayes_stopword_th https://pastebin.pl/view/0838138d > <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d < > https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d < > https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d < > https://pastebin.pl/view/0838138d>>> <https://pastebin.pl/view/0838138d < > https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d < > https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d < > https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d < > https://pastebin.pl/view/0838138d>>>> > > > > > Sample mail https://pastebin.pl/view/e5a2c5b8 < > https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 < > https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 < > https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 < > https://pastebin.pl/view/e5a2c5b8>>> <https://pastebin.pl/view/e5a2c5b8 < > https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 < > https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 < > https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 < > https://pastebin.pl/view/e5a2c5b8>>>> > > > > > > > > > > Jimmy > > > > > > > > > > > > > > > On Thu, Dec 28, 2023 at 10:59 PM < > giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> > <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it>>>>> wrote: > > > > > > > > > > Could you share a config line and a sample you > are using ? > > > > > Giovanni > > > > > > > > > > On 12/28/23 16:26, Jimmy wrote: > > > > > > Yes, I have done that, and I am also editing > Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that > if words are not separated by spaces, longer words may not match those > patterns. > > > > > > > > > > > > Jimmy > > > > > > > > > > > > On Thu, Dec 28, 2023 at 10:13 PM < > giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> > <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> > <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it>>>>>> > wrote: > > > > > > > > > > > > "spamassassin -D bayes" will tell you, > you should see a line like: > > > > > > bayes: skipped token 'from' because it's > in stopword list for language 'en' > > > > > > > > > > > > Giovanni > > > > > > > > > > > > On 12/28/23 15:45, Jimmy wrote: > > > > > > > The pattern has successfully passed > the test script, but it needs to check whether Bayes learning will identify > and possibly exclude the word from matching this pattern. > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 28, 2023 at 9:22 PM < > giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> > <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> > <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>> > <mailto:giova...@paclan.it > > <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> > <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> > <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it > <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto: > giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>> > <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto: > giova...@paclan.it <mailto:giova...@paclan.it>>>>>>> wrote: > > > > > > > > > > > > > > On 12/28/23 12:59, Jimmy wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I'm seeking assistance in > incorporating a stopword for Asian languages in Unicode. Although I possess > comprehensive word lists, my attempts to generate a regex pattern and test > it have been unsuccessful; the pattern fails to match or skips tokens in > the newly added stopword list. > > > > > > > > > > > > > > > > I created the regex pattern > using the following code: > > > > > > > > > > > > > > > > > Regexp::Assemble->new->add(@words)->reduce(0)->as_string > > > > > > > > > > > > > > > > Afterward, I converted it to > UTF-8 hex. > > > > > > > > > > > > > > > > I'm wondering if there are any > tools available to facilitate the creation of these regex patterns. > > > > > > > > > > > > > > > I have used Regexp::Trie to > create Bayes stopwords in the past, code is similar to: > > > > > > > > > ----------------------------------------------------------------------------------------------------------- > > > > > > > use strict; > > > > > > > use warnings; > > > > > > > > > > > > > > use Encode; > > > > > > > use Regexp::Trie; > > > > > > > > > > > > > > my @input = <STDIN>; > > > > > > > my $rt = Regexp::Trie->new; > > > > > > > for my $w ( @input ) { > > > > > > > chomp($w); > > > > > > > $rt->add($w); > > > > > > > } > > > > > > > my $regexp = $rt->regexp; > > > > > > > my @reg = split //, $regexp; > > > > > > > for my $c ( @reg ) { > > > > > > > my $char = $c; > > > > > > > my $test; > > > > > > > eval "\$test = decode( > 'utf8', \$c, Encode::FB_CROAK )"; > > > > > > > if( $@ ) { > > > > > > > print 'x' . sprintf("%x", > ord($c)); > > > > > > > } else { > > > > > > > print $char; > > > > > > > } > > > > > > > } > > > > > > > > > ----------------------------------------------------------------------------------------------------------- > > > > > > > > > > > > > > Giovanni > > > > > > > > > > > > > > > > > > > > > > > > > > > > >