The pattern has successfully passed the test script, but it needs to check
whether Bayes learning will identify and possibly exclude the word from
matching this pattern.

Thank you.


On Thu, Dec 28, 2023 at 9:22 PM <giova...@paclan.it> wrote:

> On 12/28/23 12:59, Jimmy wrote:
> > Hi,
> >
> > I'm seeking assistance in incorporating a stopword for Asian languages
> in Unicode. Although I possess comprehensive word lists, my attempts to
> generate a regex pattern and test it have been unsuccessful; the pattern
> fails to match or skips tokens in the newly added stopword list.
> >
> > I created the regex pattern using the following code:
> >
> > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >
> > Afterward, I converted it to UTF-8 hex.
> >
> > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> >
> I have used Regexp::Trie to create Bayes stopwords in the past, code is
> similar to:
>
> -----------------------------------------------------------------------------------------------------------
> use strict;
> use warnings;
>
> use Encode;
> use Regexp::Trie;
>
> my @input = <STDIN>;
> my $rt = Regexp::Trie->new;
> for my $w ( @input ) {
>    chomp($w);
>    $rt->add($w);
> }
> my $regexp = $rt->regexp;
> my @reg = split //, $regexp;
> for my $c ( @reg ) {
>    my $char = $c;
>    my $test;
>    eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
>    if( $@ ) {
>      print 'x' . sprintf("%x", ord($c));
>    } else {
>      print $char;
>    }
> }
>
> -----------------------------------------------------------------------------------------------------------
>
>   Giovanni
>

Reply via email to