Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate
why it is not being skipped. I suspect that if words are not separated by
spaces, longer words may not match those patterns.

Jimmy

On Thu, Dec 28, 2023 at 10:13 PM <giova...@paclan.it> wrote:

> "spamassassin -D bayes" will tell you, you should see a line like:
> bayes: skipped token 'from' because it's in stopword list for language 'en'
>
>   Giovanni
>
> On 12/28/23 15:45, Jimmy wrote:
> > The pattern has successfully passed the test script, but it needs to
> check whether Bayes learning will identify and possibly exclude the word
> from matching this pattern.
> >
> > Thank you.
> >
> >
> > On Thu, Dec 28, 2023 at 9:22 PM <giova...@paclan.it <mailto:
> giova...@paclan.it>> wrote:
> >
> >     On 12/28/23 12:59, Jimmy wrote:
> >      > Hi,
> >      >
> >      > I'm seeking assistance in incorporating a stopword for Asian
> languages in Unicode. Although I possess comprehensive word lists, my
> attempts to generate a regex pattern and test it have been unsuccessful;
> the pattern fails to match or skips tokens in the newly added stopword list.
> >      >
> >      > I created the regex pattern using the following code:
> >      >
> >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >      >
> >      > Afterward, I converted it to UTF-8 hex.
> >      >
> >      > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> >      >
> >     I have used Regexp::Trie to create Bayes stopwords in the past, code
> is similar to:
> >
>  
> -----------------------------------------------------------------------------------------------------------
> >     use strict;
> >     use warnings;
> >
> >     use Encode;
> >     use Regexp::Trie;
> >
> >     my @input = <STDIN>;
> >     my $rt = Regexp::Trie->new;
> >     for my $w ( @input ) {
> >         chomp($w);
> >         $rt->add($w);
> >     }
> >     my $regexp = $rt->regexp;
> >     my @reg = split //, $regexp;
> >     for my $c ( @reg ) {
> >         my $char = $c;
> >         my $test;
> >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> >         if( $@ ) {
> >           print 'x' . sprintf("%x", ord($c));
> >         } else {
> >           print $char;
> >         }
> >     }
> >
>  
> -----------------------------------------------------------------------------------------------------------
> >
> >        Giovanni
> >
>
>

Reply via email to