Re: Bayes Stopword

2023-12-29 Thread Jimmy
This is what I believe: the words need to be trimmed or separated, and
careful consideration is required to determine the language in order to
perform accurate cutoffs.

Jimmy

On Fri, Dec 29, 2023 at 5:16 PM  wrote:

> "ทุก" is not considered a word because it's part of the token
> "ทุกวันพุธเล่นชนะรับเพิ่ม".
> Words must be separated by spaces, otherwise we should skip the word
> "theme" just because "the" is in english stopword list.
> No idea if this makes sense for asian languages.
>
>   Giovanni
>
> On 12/29/23 11:04, Jimmy wrote:
> >
> > The sample email and word list should contain at least these words.
> >
> > ถูก
> > เลย
> > ทุก
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 4:47 PM  giova...@paclan.it>> wrote:
> >
> > I do not speak Thai but I cannot see any word in the sample email
> that should match that list.
> > Which word do you think should match the regexp ?
> >Giovanni
> >
> > On 12/29/23 10:08, Jimmy wrote:
> >  > You can use this word list
> >  >
> >  >
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> >>
> >  >
> >  > Jimmy
> >  >
> >  > On Fri, Dec 29, 2023 at 3:59 PM  giova...@paclan.it> >>
> wrote:
> >  >
> >  > To create the stopwords regexp I used the script I shared in
> a previous email and a list of words one per line.
> >  > Could you share the list you are using ?
> >  >
> >  > Giovanni
> >  >
> >  > On 12/29/23 09:22, Jimmy wrote:
> >  >  > I use SpamAssassin 4.0.0 (2022-12-14)
> >  >  >
> >  >  > $ spamassassin -D --lint 2>&1 | grep bayes:
> >  >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=en
> >  >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=th
> >  >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=ru
> >  >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found
> lang=fr
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=ja
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=zh
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=dk
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=nl
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=de
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=es
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=fi
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=fr
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=it
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=no
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=ru
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=se
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=tr
> >  >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found
> lang=vi
> >  >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=ko
> >  >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=zh
> >  >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found
> lang=hi
> >  >  > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for
> languages enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko
> zh hi
> >  >  >
> >  >  >
> >  >  > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep
> "skipped token"
> >  >  > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token
> 'Email' because it's in stopword list for language 'en'
> >  >  >
> >  >  > You can use "บาท" that was listed in regexp pattern but
> somehow I don't know why it not show skipped token in bayes.
> >  >  >
> >  >  > Jimmy
> >  >  >
> >  >  >
> >  >  > On Fri, Dec 29, 2023 at 2:59 PM   > 
>  >  >  >
> >  >  > Config line produces a syntax error for me:
> >  >  > config: failed to parse line in /etc/mail/spamassassin/
> local.cf  > <
> http://local.cf  >>
> (line 1): bayes_stopword_th
> >  >  >
> >  >  > Cou

Re: Bayes Stopword

2023-12-29 Thread giovanni

"ทุก" is not considered a word because it's part of the token 
"ทุกวันพุธเล่นชนะรับเพิ่ม".
Words must be separated by spaces, otherwise we should skip the word "theme" just because 
"the" is in english stopword list.
No idea if this makes sense for asian languages.

 Giovanni

On 12/29/23 11:04, Jimmy wrote:


The sample email and word list should contain at least these words.

ถูก
เลย
ทุก

Jimmy

On Fri, Dec 29, 2023 at 4:47 PM mailto:giova...@paclan.it>> wrote:

I do not speak Thai but I cannot see any word in the sample email that 
should match that list.
Which word do you think should match the regexp ?
   Giovanni

On 12/29/23 10:08, Jimmy wrote:
 > You can use this word list
 >
 > https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
 
>
 >
 > Jimmy
 >
 > On Fri, Dec 29, 2023 at 3:59 PM mailto:giova...@paclan.it> 
>> wrote:
 >
 >     To create the stopwords regexp I used the script I shared in a 
previous email and a list of words one per line.
 >     Could you share the list you are using ?
 >
 >         Giovanni
 >
 >     On 12/29/23 09:22, Jimmy wrote:
 >      > I use SpamAssassin 4.0.0 (2022-12-14)
 >      >
 >      > $ spamassassin -D --lint 2>&1 | grep bayes:
 >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
 >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
 >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
 >      > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
 >      > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
 >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
 >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
 >      > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
 >      > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages 
enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
 >      >
 >      >
 >      > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped 
token"
 >      > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' 
because it's in stopword list for language 'en'
 >      >
 >      > You can use "บาท" that was listed in regexp pattern but somehow I 
don't know why it not show skipped token in bayes.
 >      >
 >      > Jimmy
 >      >
 >      >
 >      > On Fri, Dec 29, 2023 at 2:59 PM mailto:giova...@paclan.it> 
>  
      >
 >      >     Config line produces a syntax error for me:
 >      >     config: failed to parse line in /etc/mail/spamassassin/local.cf  
>  >> (line 1): bayes_stopword_th
 >      >
 >      >     Could you share the word list in utf8 ?
 >      >     I tried adding "บาท" to 
https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
 
> 


Re: Bayes Stopword

2023-12-29 Thread Jimmy
The sample email and word list should contain at least these words.

ถูก
เลย
ทุก

Jimmy

On Fri, Dec 29, 2023 at 4:47 PM  wrote:

> I do not speak Thai but I cannot see any word in the sample email that
> should match that list.
> Which word do you think should match the regexp ?
>   Giovanni
>
> On 12/29/23 10:08, Jimmy wrote:
> > You can use this word list
> >
> >
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> >
> >
> > Jimmy
> >
> > On Fri, Dec 29, 2023 at 3:59 PM  giova...@paclan.it>> wrote:
> >
> > To create the stopwords regexp I used the script I shared in a
> previous email and a list of words one per line.
> > Could you share the list you are using ?
> >
> > Giovanni
> >
> > On 12/29/23 09:22, Jimmy wrote:
> >  > I use SpamAssassin 4.0.0 (2022-12-14)
> >  >
> >  > $ spamassassin -D --lint 2>&1 | grep bayes:
> >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> >  > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> >  > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> >  > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> >  > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages
> enabled: en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> >  >
> >  >
> >  > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped
> token"
> >  > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email'
> because it's in stopword list for language 'en'
> >  >
> >  > You can use "บาท" that was listed in regexp pattern but somehow I
> don't know why it not show skipped token in bayes.
> >  >
> >  > Jimmy
> >  >
> >  >
> >  > On Fri, Dec 29, 2023 at 2:59 PM  giova...@paclan.it> >>
> wrote:
> >  >
> >  > Config line produces a syntax error for me:
> >  > config: failed to parse line in /etc/mail/spamassassin/
> local.cf  > (line 1):
> bayes_stopword_th
> >  >
> >  > Could you share the word list in utf8 ?
> >  > I tried adding "บาท" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>>
> and it produces a working regexp.
> >  > Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> >  >Giovanni
> >  >
> >  > On 12/28/23 17:06, Jimmy wrote:
> >  >  > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>  https://pastebin.pl/view/0838138d>>  https://pastebin.pl/view/0838138d>  https://pastebin.pl/view/0838138d>>>
> >  >  > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>  https://pastebin.pl/view/e5a2c5b8>>  https://pastebin.pl/view/e5a2c5b8>  https://pastebin.pl/view/e5a2c5b8>>>
> >  >  >
> >  >  > Jimmy
> >  >  >
> >  >  >
> >  >  > On Thu, Dec 28, 2023 

Re: Bayes Stopword

2023-12-29 Thread giovanni

I do not speak Thai but I cannot see any word in the sample email that should 
match that list.
Which word do you think should match the regexp ?
 Giovanni

On 12/29/23 10:08, Jimmy wrote:

You can use this word list

https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 


Jimmy

On Fri, Dec 29, 2023 at 3:59 PM mailto:giova...@paclan.it>> wrote:

To create the stopwords regexp I used the script I shared in a previous 
email and a list of words one per line.
Could you share the list you are using ?

    Giovanni

On 12/29/23 09:22, Jimmy wrote:
 > I use SpamAssassin 4.0.0 (2022-12-14)
 >
 > $ spamassassin -D --lint 2>&1 | grep bayes:
 > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
 > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
 > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
 > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
 > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
 > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
 > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
 > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
 > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: 
en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
 >
 >
 > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
 > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because 
it's in stopword list for language 'en'
 >
 > You can use "บาท" that was listed in regexp pattern but somehow I don't 
know why it not show skipped token in bayes.
 >
 > Jimmy
 >
 >
 > On Fri, Dec 29, 2023 at 2:59 PM mailto:giova...@paclan.it> 
>> wrote:
 >
 >     Config line produces a syntax error for me:
 >     config: failed to parse line in /etc/mail/spamassassin/local.cf 
 > (line 1): bayes_stopword_th
 >
 >     Could you share the word list in utf8 ?
 >     I tried adding "บาท" to 
https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
 
> and it 
produces a working regexp.
 >     Bayes stopwords languages must also be enabled using 
"bayes_stopword_languages" config keyword, by default only english is enabled.
 >        Giovanni
 >
 >     On 12/28/23 17:06, Jimmy wrote:
 >      > bayes_stopword_th https://pastebin.pl/view/0838138d  
>  >>
 >      > Sample mail https://pastebin.pl/view/e5a2c5b8  
>  >>
 >      >
 >      > Jimmy
 >      >
 >      >
 >      > On Thu, Dec 28, 2023 at 10:59 PM mailto:giova...@paclan.it> 
>  
      >
 >      >     Could you share a config line and a sample you are using ?
 >      >        Giovanni
 >      >
 >      >     On 12/28/23 16:26, Jimmy wrote:
 >      >      > Yes, I have done that

Re: Bayes Stopword

2023-12-29 Thread Jimmy
You can use this word list

https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt

Jimmy

On Fri, Dec 29, 2023 at 3:59 PM  wrote:

> To create the stopwords regexp I used the script I shared in a previous
> email and a list of words one per line.
> Could you share the list you are using ?
>
>Giovanni
>
> On 12/29/23 09:22, Jimmy wrote:
> > I use SpamAssassin 4.0.0 (2022-12-14)
> >
> > $ spamassassin -D --lint 2>&1 | grep bayes:
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
> > Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
> > Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
> > Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled:
> en th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi
> >
> >
> > $ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
> > Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because
> it's in stopword list for language 'en'
> >
> > You can use "บาท" that was listed in regexp pattern but somehow I don't
> know why it not show skipped token in bayes.
> >
> > Jimmy
> >
> >
> > On Fri, Dec 29, 2023 at 2:59 PM  giova...@paclan.it>> wrote:
> >
> > Config line produces a syntax error for me:
> > config: failed to parse line in /etc/mail/spamassassin/local.cf <
> http://local.cf> (line 1): bayes_stopword_th
> >
> > Could you share the word list in utf8 ?
> > I tried adding "บาท" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> <
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt>
> and it produces a working regexp.
> > Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
> >Giovanni
> >
> > On 12/28/23 17:06, Jimmy wrote:
> >  > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>  https://pastebin.pl/view/0838138d>>
> >  > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>  https://pastebin.pl/view/e5a2c5b8>>
> >  >
> >  > Jimmy
> >  >
> >  >
> >  > On Thu, Dec 28, 2023 at 10:59 PM  giova...@paclan.it> >>
> wrote:
> >  >
> >  > Could you share a config line and a sample you are using ?
> >  >Giovanni
> >  >
> >  > On 12/28/23 16:26, Jimmy wrote:
> >  >  > Yes, I have done that, and I am also editing
> Plugin/Bayes.pm to investigate why it is not being skipped. I suspect that
> if words are not separated by spaces, longer words may not match those
> patterns.
> >  >  >
> >  >  > Jimmy
> >  >  >
> >  >  > On Thu, Dec 28, 2023 at 10:13 PM   > 
>  >  >  >
> >  >  > "spamassassin -D bayes" will tell you, you should see
> a line like:
> >  >  > bayes: skipped token 'from' because it's in stopword
> list for language 'en'
> >  >  >
> >  >  >Giovanni
> >  >  >
> >  >  > On 12/28/23 15:45, Jimmy wrote:
> >  >  >  > The pattern has successfully passed the test
> script, but it needs to check whether Bayes learning will identify and
> possibly exclude the word from matching this pattern.
> >  >  >  >
> >  >  >  > Thank you.
> >  >  >  >
> >  >  >  >
> >  >

Re: Bayes Stopword

2023-12-29 Thread giovanni

To create the stopwords regexp I used the script I shared in a previous email 
and a list of words one per line.
Could you share the list you are using ?

  Giovanni

On 12/29/23 09:22, Jimmy wrote:

I use SpamAssassin 4.0.0 (2022-12-14)

$ spamassassin -D --lint 2>&1 | grep bayes:
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en th 
ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi


$ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's in 
stopword list for language 'en'

You can use "บาท" that was listed in regexp pattern but somehow I don't know 
why it not show skipped token in bayes.

Jimmy


On Fri, Dec 29, 2023 at 2:59 PM mailto:giova...@paclan.it>> wrote:

Config line produces a syntax error for me:
config: failed to parse line in /etc/mail/spamassassin/local.cf 
 (line 1): bayes_stopword_th

Could you share the word list in utf8 ?
I tried adding "บาท" to 
https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt 
 
and it produces a working regexp.
Bayes stopwords languages must also be enabled using 
"bayes_stopword_languages" config keyword, by default only english is enabled.
   Giovanni

On 12/28/23 17:06, Jimmy wrote:
 > bayes_stopword_th https://pastebin.pl/view/0838138d 
 >
 > Sample mail https://pastebin.pl/view/e5a2c5b8  
>
 >
 > Jimmy
 >
 >
 > On Thu, Dec 28, 2023 at 10:59 PM mailto:giova...@paclan.it> 
>> wrote:
 >
 >     Could you share a config line and a sample you are using ?
 >        Giovanni
 >
 >     On 12/28/23 16:26, Jimmy wrote:
 >      > Yes, I have done that, and I am also editing Plugin/Bayes.pm to 
investigate why it is not being skipped. I suspect that if words are not separated by 
spaces, longer words may not match those patterns.
 >      >
 >      > Jimmy
 >      >
 >      > On Thu, Dec 28, 2023 at 10:13 PM mailto:giova...@paclan.it> 
>  
      >
 >      >     "spamassassin -D bayes" will tell you, you should see a line 
like:
 >      >     bayes: skipped token 'from' because it's in stopword list for 
language 'en'
 >      >
 >      >        Giovanni
 >      >
 >      >     On 12/28/23 15:45, Jimmy wrote:
 >      >      > The pattern has successfully passed the test script, but 
it needs to check whether Bayes learning will identify and possibly exclude the word 
from matching this pattern.
 >      >      >
 >      >      > Thank you.
 >      >      >
 >      >      >
 >      >      > On Thu, Dec 28, 2023 at 9:22 PM mailto:giova...@paclan.it> >  >> 
 >   wrote:

Re: Bayes Stopword

2023-12-29 Thread Jimmy
I use SpamAssassin 4.0.0 (2022-12-14)

$ spamassassin -D --lint 2>&1 | grep bayes:
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=en
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=th
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.919 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ja
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=dk
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=nl
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=de
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=es
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fi
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=fr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=it
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=no
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=ru
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=se
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=tr
Dec 29 15:17:56.920 [17420] dbg: bayes: stopword found lang=vi
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=ko
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=zh
Dec 29 15:17:56.921 [17420] dbg: bayes: stopword found lang=hi
Dec 29 15:17:58.019 [17420] dbg: bayes: stopwords for languages enabled: en
th ru fr ja zh dk nl de es fi fr it no ru se tr vi ko zh hi


$ spamassassin -D bayes,learn < test.msg 2>&1 | grep "skipped token"
Dec 29 15:16:57.585 [17347] dbg: bayes: skipped token 'Email' because it's
in stopword list for language 'en'

You can use "บาท" that was listed in regexp pattern but somehow I don't
know why it not show skipped token in bayes.

Jimmy


On Fri, Dec 29, 2023 at 2:59 PM  wrote:

> Config line produces a syntax error for me:
> config: failed to parse line in /etc/mail/spamassassin/local.cf (line 1):
> bayes_stopword_th
>
> Could you share the word list in utf8 ?
> I tried adding "บาท" to
> https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt
> and it produces a working regexp.
> Bayes stopwords languages must also be enabled using
> "bayes_stopword_languages" config keyword, by default only english is
> enabled.
>   Giovanni
>
> On 12/28/23 17:06, Jimmy wrote:
> > bayes_stopword_th https://pastebin.pl/view/0838138d <
> https://pastebin.pl/view/0838138d>
> > Sample mail https://pastebin.pl/view/e5a2c5b8 <
> https://pastebin.pl/view/e5a2c5b8>
> >
> > Jimmy
> >
> >
> > On Thu, Dec 28, 2023 at 10:59 PM  giova...@paclan.it>> wrote:
> >
> > Could you share a config line and a sample you are using ?
> >Giovanni
> >
> > On 12/28/23 16:26, Jimmy wrote:
> >  > Yes, I have done that, and I am also editing Plugin/Bayes.pm to
> investigate why it is not being skipped. I suspect that if words are not
> separated by spaces, longer words may not match those patterns.
> >  >
> >  > Jimmy
> >  >
> >  > On Thu, Dec 28, 2023 at 10:13 PM  giova...@paclan.it> >>
> wrote:
> >  >
> >  > "spamassassin -D bayes" will tell you, you should see a line
> like:
> >  > bayes: skipped token 'from' because it's in stopword list for
> language 'en'
> >  >
> >  >Giovanni
> >  >
> >  > On 12/28/23 15:45, Jimmy wrote:
> >  >  > The pattern has successfully passed the test script, but
> it needs to check whether Bayes learning will identify and possibly exclude
> the word from matching this pattern.
> >  >  >
> >  >  > Thank you.
> >  >  >
> >  >  >
> >  >  > On Thu, Dec 28, 2023 at 9:22 PM   > 
>  >  >  >
> >  >  > On 12/28/23 12:59, Jimmy wrote:
> >  >  >  > Hi,
> >  >  >  >
> >  >  >  > I'm seeking assistance in incorporating a stopword
> for Asian languages in Unicode. Although I possess comprehensive word
> lists, my attempts to generate a regex pattern and test it have been
> unsuccessful; the pattern fails to match or skips tokens in the newly added
> stopword list.
> >  >  >  >
> >  >  >  > I created the regex pattern using the following
> code:
> >  >  >  >
> >  >  >  >
> Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >  >  >  >
> >  >  >  > Afterward, I converted it to UTF-8 hex.
> >  >  >  >
> >  >  >  > I'm wondering if there are any tools available to
> facilitate the creation of these regex patterns.
> >  >  >  >
> >  >  > I have used Regexp::Trie to create Bayes 

Re: Bayes Stopword

2023-12-28 Thread giovanni

Config line produces a syntax error for me:
config: failed to parse line in /etc/mail/spamassassin/local.cf (line 1): 
bayes_stopword_th

Could you share the word list in utf8 ?
I tried adding "บาท" to 
https://raw.githubusercontent.com/stopwords-iso/stopwords-th/master/stopwords-th.txt and 
it produces a working regexp.
Bayes stopwords languages must also be enabled using "bayes_stopword_languages" 
config keyword, by default only english is enabled.
 Giovanni

On 12/28/23 17:06, Jimmy wrote:

bayes_stopword_th https://pastebin.pl/view/0838138d 

Sample mail https://pastebin.pl/view/e5a2c5b8 


Jimmy


On Thu, Dec 28, 2023 at 10:59 PM mailto:giova...@paclan.it>> wrote:

Could you share a config line and a sample you are using ?
   Giovanni

On 12/28/23 16:26, Jimmy wrote:
 > Yes, I have done that, and I am also editing Plugin/Bayes.pm to 
investigate why it is not being skipped. I suspect that if words are not separated 
by spaces, longer words may not match those patterns.
 >
 > Jimmy
 >
 > On Thu, Dec 28, 2023 at 10:13 PM mailto:giova...@paclan.it> 
>> wrote:
 >
 >     "spamassassin -D bayes" will tell you, you should see a line like:
 >     bayes: skipped token 'from' because it's in stopword list for 
language 'en'
 >
 >        Giovanni
 >
 >     On 12/28/23 15:45, Jimmy wrote:
 >      > The pattern has successfully passed the test script, but it needs 
to check whether Bayes learning will identify and possibly exclude the word from 
matching this pattern.
 >      >
 >      > Thank you.
 >      >
 >      >
 >      > On Thu, Dec 28, 2023 at 9:22 PM mailto:giova...@paclan.it> 
>  
      >
 >      >     On 12/28/23 12:59, Jimmy wrote:
 >      >      > Hi,
 >      >      >
 >      >      > I'm seeking assistance in incorporating a stopword for 
Asian languages in Unicode. Although I possess comprehensive word lists, my attempts to 
generate a regex pattern and test it have been unsuccessful; the pattern fails to match 
or skips tokens in the newly added stopword list.
 >      >      >
 >      >      > I created the regex pattern using the following code:
 >      >      >
 >      >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
 >      >      >
 >      >      > Afterward, I converted it to UTF-8 hex.
 >      >      >
 >      >      > I'm wondering if there are any tools available to 
facilitate the creation of these regex patterns.
 >      >      >
 >      >     I have used Regexp::Trie to create Bayes stopwords in the 
past, code is similar to:
 >      >     
---
 >      >     use strict;
 >      >     use warnings;
 >      >
 >      >     use Encode;
 >      >     use Regexp::Trie;
 >      >
 >      >     my @input = ;
 >      >     my $rt = Regexp::Trie->new;
 >      >     for my $w ( @input ) {
 >      >         chomp($w);
 >      >         $rt->add($w);
 >      >     }
 >      >     my $regexp = $rt->regexp;
 >      >     my @reg = split //, $regexp;
 >      >     for my $c ( @reg ) {
 >      >         my $char = $c;
 >      >         my $test;
 >      >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
 >      >         if( $@ ) {
 >      >           print 'x' . sprintf("%x", ord($c));
 >      >         } else {
 >      >           print $char;
 >      >         }
 >      >     }
 >      >     
---
 >      >
 >      >        Giovanni
 >      >
 >





OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Bayes Stopword

2023-12-28 Thread Jimmy
bayes_stopword_th https://pastebin.pl/view/0838138d
Sample mail https://pastebin.pl/view/e5a2c5b8

Jimmy


On Thu, Dec 28, 2023 at 10:59 PM  wrote:

> Could you share a config line and a sample you are using ?
>   Giovanni
>
> On 12/28/23 16:26, Jimmy wrote:
> > Yes, I have done that, and I am also editing Plugin/Bayes.pm to
> investigate why it is not being skipped. I suspect that if words are not
> separated by spaces, longer words may not match those patterns.
> >
> > Jimmy
> >
> > On Thu, Dec 28, 2023 at 10:13 PM  giova...@paclan.it>> wrote:
> >
> > "spamassassin -D bayes" will tell you, you should see a line like:
> > bayes: skipped token 'from' because it's in stopword list for
> language 'en'
> >
> >Giovanni
> >
> > On 12/28/23 15:45, Jimmy wrote:
> >  > The pattern has successfully passed the test script, but it needs
> to check whether Bayes learning will identify and possibly exclude the word
> from matching this pattern.
> >  >
> >  > Thank you.
> >  >
> >  >
> >  > On Thu, Dec 28, 2023 at 9:22 PM  giova...@paclan.it> >>
> wrote:
> >  >
> >  > On 12/28/23 12:59, Jimmy wrote:
> >  >  > Hi,
> >  >  >
> >  >  > I'm seeking assistance in incorporating a stopword for
> Asian languages in Unicode. Although I possess comprehensive word lists, my
> attempts to generate a regex pattern and test it have been unsuccessful;
> the pattern fails to match or skips tokens in the newly added stopword list.
> >  >  >
> >  >  > I created the regex pattern using the following code:
> >  >  >
> >  >  > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >  >  >
> >  >  > Afterward, I converted it to UTF-8 hex.
> >  >  >
> >  >  > I'm wondering if there are any tools available to
> facilitate the creation of these regex patterns.
> >  >  >
> >  > I have used Regexp::Trie to create Bayes stopwords in the
> past, code is similar to:
> >  >
>  
> ---
> >  > use strict;
> >  > use warnings;
> >  >
> >  > use Encode;
> >  > use Regexp::Trie;
> >  >
> >  > my @input = ;
> >  > my $rt = Regexp::Trie->new;
> >  > for my $w ( @input ) {
> >  > chomp($w);
> >  > $rt->add($w);
> >  > }
> >  > my $regexp = $rt->regexp;
> >  > my @reg = split //, $regexp;
> >  > for my $c ( @reg ) {
> >  > my $char = $c;
> >  > my $test;
> >  > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> >  > if( $@ ) {
> >  >   print 'x' . sprintf("%x", ord($c));
> >  > } else {
> >  >   print $char;
> >  > }
> >  > }
> >  >
>  
> ---
> >  >
> >  >Giovanni
> >  >
> >
>
>


Re: Bayes Stopword

2023-12-28 Thread giovanni

Could you share a config line and a sample you are using ?
 Giovanni

On 12/28/23 16:26, Jimmy wrote:

Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate why 
it is not being skipped. I suspect that if words are not separated by spaces, 
longer words may not match those patterns.

Jimmy

On Thu, Dec 28, 2023 at 10:13 PM mailto:giova...@paclan.it>> wrote:

"spamassassin -D bayes" will tell you, you should see a line like:
bayes: skipped token 'from' because it's in stopword list for language 'en'

   Giovanni

On 12/28/23 15:45, Jimmy wrote:
 > The pattern has successfully passed the test script, but it needs to 
check whether Bayes learning will identify and possibly exclude the word from 
matching this pattern.
 >
 > Thank you.
 >
 >
 > On Thu, Dec 28, 2023 at 9:22 PM mailto:giova...@paclan.it> 
>> wrote:
 >
 >     On 12/28/23 12:59, Jimmy wrote:
 >      > Hi,
 >      >
 >      > I'm seeking assistance in incorporating a stopword for Asian 
languages in Unicode. Although I possess comprehensive word lists, my attempts to 
generate a regex pattern and test it have been unsuccessful; the pattern fails to 
match or skips tokens in the newly added stopword list.
 >      >
 >      > I created the regex pattern using the following code:
 >      >
 >      > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
 >      >
 >      > Afterward, I converted it to UTF-8 hex.
 >      >
 >      > I'm wondering if there are any tools available to facilitate the 
creation of these regex patterns.
 >      >
 >     I have used Regexp::Trie to create Bayes stopwords in the past, code 
is similar to:
 >     
---
 >     use strict;
 >     use warnings;
 >
 >     use Encode;
 >     use Regexp::Trie;
 >
 >     my @input = ;
 >     my $rt = Regexp::Trie->new;
 >     for my $w ( @input ) {
 >         chomp($w);
 >         $rt->add($w);
 >     }
 >     my $regexp = $rt->regexp;
 >     my @reg = split //, $regexp;
 >     for my $c ( @reg ) {
 >         my $char = $c;
 >         my $test;
 >         eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
 >         if( $@ ) {
 >           print 'x' . sprintf("%x", ord($c));
 >         } else {
 >           print $char;
 >         }
 >     }
 >     
---
 >
 >        Giovanni
 >





OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Bayes Stopword

2023-12-28 Thread Jimmy
Yes, I have done that, and I am also editing Plugin/Bayes.pm to investigate
why it is not being skipped. I suspect that if words are not separated by
spaces, longer words may not match those patterns.

Jimmy

On Thu, Dec 28, 2023 at 10:13 PM  wrote:

> "spamassassin -D bayes" will tell you, you should see a line like:
> bayes: skipped token 'from' because it's in stopword list for language 'en'
>
>   Giovanni
>
> On 12/28/23 15:45, Jimmy wrote:
> > The pattern has successfully passed the test script, but it needs to
> check whether Bayes learning will identify and possibly exclude the word
> from matching this pattern.
> >
> > Thank you.
> >
> >
> > On Thu, Dec 28, 2023 at 9:22 PM  giova...@paclan.it>> wrote:
> >
> > On 12/28/23 12:59, Jimmy wrote:
> >  > Hi,
> >  >
> >  > I'm seeking assistance in incorporating a stopword for Asian
> languages in Unicode. Although I possess comprehensive word lists, my
> attempts to generate a regex pattern and test it have been unsuccessful;
> the pattern fails to match or skips tokens in the newly added stopword list.
> >  >
> >  > I created the regex pattern using the following code:
> >  >
> >  > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >  >
> >  > Afterward, I converted it to UTF-8 hex.
> >  >
> >  > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> >  >
> > I have used Regexp::Trie to create Bayes stopwords in the past, code
> is similar to:
> >
>  
> ---
> > use strict;
> > use warnings;
> >
> > use Encode;
> > use Regexp::Trie;
> >
> > my @input = ;
> > my $rt = Regexp::Trie->new;
> > for my $w ( @input ) {
> > chomp($w);
> > $rt->add($w);
> > }
> > my $regexp = $rt->regexp;
> > my @reg = split //, $regexp;
> > for my $c ( @reg ) {
> > my $char = $c;
> > my $test;
> > eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
> > if( $@ ) {
> >   print 'x' . sprintf("%x", ord($c));
> > } else {
> >   print $char;
> > }
> > }
> >
>  
> ---
> >
> >Giovanni
> >
>
>


Re: Bayes Stopword

2023-12-28 Thread giovanni

"spamassassin -D bayes" will tell you, you should see a line like:
bayes: skipped token 'from' because it's in stopword list for language 'en'

 Giovanni

On 12/28/23 15:45, Jimmy wrote:

The pattern has successfully passed the test script, but it needs to check 
whether Bayes learning will identify and possibly exclude the word from 
matching this pattern.

Thank you.


On Thu, Dec 28, 2023 at 9:22 PM mailto:giova...@paclan.it>> wrote:

On 12/28/23 12:59, Jimmy wrote:
 > Hi,
 >
 > I'm seeking assistance in incorporating a stopword for Asian languages 
in Unicode. Although I possess comprehensive word lists, my attempts to generate a 
regex pattern and test it have been unsuccessful; the pattern fails to match or 
skips tokens in the newly added stopword list.
 >
 > I created the regex pattern using the following code:
 >
 > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
 >
 > Afterward, I converted it to UTF-8 hex.
 >
 > I'm wondering if there are any tools available to facilitate the 
creation of these regex patterns.
 >
I have used Regexp::Trie to create Bayes stopwords in the past, code is 
similar to:

---
use strict;
use warnings;

use Encode;
use Regexp::Trie;

my @input = ;
my $rt = Regexp::Trie->new;
for my $w ( @input ) {
    chomp($w);
    $rt->add($w);
}
my $regexp = $rt->regexp;
my @reg = split //, $regexp;
for my $c ( @reg ) {
    my $char = $c;
    my $test;
    eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
    if( $@ ) {
      print 'x' . sprintf("%x", ord($c));
    } else {
      print $char;
    }
}

---

   Giovanni





OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Bayes Stopword

2023-12-28 Thread Jimmy
The pattern has successfully passed the test script, but it needs to check
whether Bayes learning will identify and possibly exclude the word from
matching this pattern.

Thank you.


On Thu, Dec 28, 2023 at 9:22 PM  wrote:

> On 12/28/23 12:59, Jimmy wrote:
> > Hi,
> >
> > I'm seeking assistance in incorporating a stopword for Asian languages
> in Unicode. Although I possess comprehensive word lists, my attempts to
> generate a regex pattern and test it have been unsuccessful; the pattern
> fails to match or skips tokens in the newly added stopword list.
> >
> > I created the regex pattern using the following code:
> >
> > Regexp::Assemble->new->add(@words)->reduce(0)->as_string
> >
> > Afterward, I converted it to UTF-8 hex.
> >
> > I'm wondering if there are any tools available to facilitate the
> creation of these regex patterns.
> >
> I have used Regexp::Trie to create Bayes stopwords in the past, code is
> similar to:
>
> ---
> use strict;
> use warnings;
>
> use Encode;
> use Regexp::Trie;
>
> my @input = ;
> my $rt = Regexp::Trie->new;
> for my $w ( @input ) {
>chomp($w);
>$rt->add($w);
> }
> my $regexp = $rt->regexp;
> my @reg = split //, $regexp;
> for my $c ( @reg ) {
>my $char = $c;
>my $test;
>eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
>if( $@ ) {
>  print 'x' . sprintf("%x", ord($c));
>} else {
>  print $char;
>}
> }
>
> ---
>
>   Giovanni
>


Re: Bayes Stopword

2023-12-28 Thread giovanni

On 12/28/23 12:59, Jimmy wrote:

Hi,

I'm seeking assistance in incorporating a stopword for Asian languages in 
Unicode. Although I possess comprehensive word lists, my attempts to generate a 
regex pattern and test it have been unsuccessful; the pattern fails to match or 
skips tokens in the newly added stopword list.

I created the regex pattern using the following code:

Regexp::Assemble->new->add(@words)->reduce(0)->as_string

Afterward, I converted it to UTF-8 hex.

I'm wondering if there are any tools available to facilitate the creation of 
these regex patterns.


I have used Regexp::Trie to create Bayes stopwords in the past, code is similar 
to:
---
use strict;
use warnings;

use Encode;
use Regexp::Trie;

my @input = ;
my $rt = Regexp::Trie->new;
for my $w ( @input ) {
  chomp($w);
  $rt->add($w);
}
my $regexp = $rt->regexp;
my @reg = split //, $regexp;
for my $c ( @reg ) {
  my $char = $c;
  my $test;
  eval "\$test = decode( 'utf8', \$c, Encode::FB_CROAK )";
  if( $@ ) {
print 'x' . sprintf("%x", ord($c));
  } else {
print $char;
  }
}
---

 Giovanni


OpenPGP_signature.asc
Description: OpenPGP digital signature