This is what I believe: the words need to be trimmed or separated, and
careful consideration is required to determine the language in order to
perform accurate cutoffs.

Jimmy

On Fri, Dec 29, 2023 at 5:16 PM <giova...@paclan.it> wrote:

> "ทุก" is not considered a word because it's part of the token
> "ทุกวันพุธเล่นชนะรับเพิ่ม".
> Words must be separated by spaces, otherwise we should skip the word
> "theme" just because "the" is in english stopword list.
> No idea if this makes sense for asian languages.
>
>   Giovanni
>
> On 12/29/23 11:04, Jimmy wrote:
> >
> > The sample email and word list should contain at least these words.
> >
> > ถูก
> > เลย
> > ทุก
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 4:47 PM <giova...@paclan.it <mailto:
> giova...@paclan.it>> wrote:
> >
> >     I do not speak Thai but I cannot see any word in the sample email
> that should match that list.
> >     Which word do you think should match the regexp ?
> >        Giovanni
> >
> >     On 12/29/23 10:08, Jimmy wrote:
> >      > You can use this word list
> >      >
> >      >
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> >>
> >      >
> >      > Jimmy
> >      >
> >      > On Fri, Dec 29, 2023 at 3:59 PM <giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>
> wrote:
> >      >
> >      >     To create the stopwords regexp I used the script I shared in
> a previous email and a list of words one per line.
> >      >     Could you share the list you are using ?
> >      >
> >      >         Giovanni
> >      >
> >      >     On 12/29/23 09:22, Jimmy wrote:
> >      >      > I use SpamAssassin 4.0.0 (2022-12-14)
> >      >      >
> >      >      > $ spamassassin -D --lint 2>&1 | grep bayes:
> >      >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=en
> >      >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=th
> >      >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=ru
> >      >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=fr
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=ja
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=zh
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=dk
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=nl
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=de
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=es
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=fi
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=fr
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=it
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=no
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=ru
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=se
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=tr
> >      >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=vi
> >      >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=ko
> >      >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=zh
> >      >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=hi
> >      >      > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for
> languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko
> zh hi
> >      >      >
> >      >      >
> >      >      > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep
> "skipped token"
> >      >      > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token
> 'Email' because it's in stopword list for language 'en'
> >      >      >
> >      >      > You can use "บาท" that was listed in regexp pattern but
> somehow I don't know why it not show skipped token in bayes.
> >      >      >
> >      >      > Jimmy
> >      >      >
> >      >      >
> >      >      > On Fri, Dec 29, 2023 at 2:59 PM <giova...@paclan.it
> <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>> wrote:
> >      >      >
> >      >      >     Config line produces a syntax error for me:
> >      >      >     config: failed to parse line in /etc/mail/spamassassin/
> local.cf <http://local.cf> <http://local.cf <http://local.cf>> <
> http://local.cf <http://local.cf> <http://local.cf <http://local.cf>>>
> (line 1): bayes_stopword_th
> >      >      >
> >      >      >     Could you share the word list in utf8 ?
> >      >      >     I tried adding "บาท" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>>
> and it produces a working regexp.
> >      >      >     Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> >      >      >        Giovanni
> >      >      >
> >      >      >     On 12/28/23 17:06, Jimmy wrote:
> >      >      >      > bayes_stopword_th https://pastebin.pl/view/0838138d
> <https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>>> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d> <https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>>>>
> >      >      >      > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>>> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8> <https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>>>>
> >      >      >      >
> >      >      >      > Jimmy
> >      >      >      >
> >      >      >      >
> >      >      >      > On Thu, Dec 28, 2023 at 10:59 PM <
> giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it>>>>> wrote:
> >      >      >      >
> >      >      >      >     Could you share a config line and a sample you
> are using ?
> >      >      >      >        Giovanni
> >      >      >      >
> >      >      >      >     On 12/28/23 16:26, Jimmy wrote:
> >      >      >      >      > Yes, I have done that, and I am also editing
> Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that
> if words are not separated by spaces, longer words may not match those
> patterns.
> >      >      >      >      >
> >      >      >      >      > Jimmy
> >      >      >      >      >
> >      >      >      >      > On Thu, Dec 28, 2023 at 10:13 PM <
> giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it 
> <mailto:giova...@paclan.it>>>>>>
> wrote:
> >      >      >      >      >
> >      >      >      >      >     "spamassassin -D bayes" will tell you,
> you should see a line like:
> >      >      >      >      >     bayes: skipped token 'from' because it's
> in stopword list for language 'en'
> >      >      >      >      >
> >      >      >      >      >        Giovanni
> >      >      >      >      >
> >      >      >      >      >     On 12/28/23 15:45, Jimmy wrote:
> >      >      >      >      >      > The pattern has successfully passed
> the test script, but it needs to check whether Bayes learning will identify
> and possibly exclude the word from matching this pattern.
> >      >      >      >      >      >
> >      >      >      >      >      > Thank you.
> >      >      >      >      >      >
> >      >      >      >      >      >
> >      >      >      >      >      > On Thu, Dec 28, 2023 at 9:22 PM <
> giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it>>>> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>>
> <mailto:giova...@paclan.it
> >     <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>>>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it>> <mailto:giova...@paclan.it
> <mailto:giova...@paclan.it> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it>>> <mailto:giova...@paclan.it <mailto:
> giova...@paclan.it> <mailto:giova...@paclan.it <mailto:giova...@paclan.it>>
> <mailto:giova...@paclan.it <mailto:giova...@paclan.it> <mailto:
> giova...@paclan.it <mailto:giova...@paclan.it>>>>>>> wrote:
> >      >      >      >      >      >
> >      >      >      >      >      >     On 12/28/23 12:59, Jimmy wrote:
> >      >      >      >      >      >      > Hi,
> >      >      >      >      >      >      >
> >      >      >      >      >      >      > I'm seeking assistance in
> incorporating a stopword for Asian languages in Unicode. Although I possess
> comprehensive word lists, my attempts to generate a regex pattern and test
> it have been unsuccessful; the pattern fails to match or skips tokens in
> the newly added stopword list.
> >      >      >      >      >      >      >
> >      >      >      >      >      >      > I created the regex pattern
> using the following code:
> >      >      >      >      >      >      >
> >      >      >      >      >      >      >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >      >      >      >      >      >      >
> >      >      >      >      >      >      > Afterward, I converted it to
> UTF-8 hex.
> >      >      >      >      >      >      >
> >      >      >      >      >      >      > I'm wondering if there are any
> tools available to facilitate the creation of these regex patterns.
> >      >      >      >      >      >      >
> >      >      >      >      >      >     I have used Regexp::Trie to
> create Bayes stopwords in the past, code is similar to:
> >      >      >      >      >      >
>  
> -----------------------------------------------------------------------------------------------------------
> >      >      >      >      >      >     use strict;
> >      >      >      >      >      >     use warnings;
> >      >      >      >      >      >
> >      >      >      >      >      >     use Encode;
> >      >      >      >      >      >     use Regexp::Trie;
> >      >      >      >      >      >
> >      >      >      >      >      >     my @input = <STDIN>;
> >      >      >      >      >      >     my $rt = Regexp::Trie->new;
> >      >      >      >      >      >     for my $w ( @input ) {
> >      >      >      >      >      >         chomp($w);
> >      >      >      >      >      >         $rt->add($w);
> >      >      >      >      >      >     }
> >      >      >      >      >      >     my $regexp = $rt->regexp;
> >      >      >      >      >      >     my @reg = split //, $regexp;
> >      >      >      >      >      >     for my $c ( @reg ) {
> >      >      >      >      >      >         my $char = $c;
> >      >      >      >      >      >         my $test;
> >      >      >      >      >      >         eval "\$test = decode(
> 'utf8', \$c, Encode::FB_CROAK )";
> >      >      >      >      >      >         if( $@ ) {
> >      >      >      >      >      >           print 'x' . sprintf("%x",
> ord($c));
> >      >      >      >      >      >         } else {
> >      >      >      >      >      >           print $char;
> >      >      >      >      >      >         }
> >      >      >      >      >      >     }
> >      >      >      >      >      >
>  
> -----------------------------------------------------------------------------------------------------------
> >      >      >      >      >      >
> >      >      >      >      >      >        Giovanni
> >      >      >      >      >      >
> >      >      >      >      >
> >      >      >      >
> >      >      >
> >      >
> >
>
>

Reply via email to