Bayes and hyphens

Amir Caspi Fri, 30 Mar 2018 12:09:35 -0700

Hi all,

Does Bayes tokenize on word boundaries and hence would ignore hyphens?  Or does 
it include them?  I've seen a lot of spam lately inserting random hyphens 
between key spammy words (like "economic-crisis"), presumably in an attempt to 
bypass word filters and/or Bayes.  So would word1-word2 get tokenized as a 
single item or as two words?


If hyphens are currently included, then perhaps Bayes should be updated to 
ignore hyphens and/or tokenize at word boundaries?

Cheers.

--- Amir

Bayes and hyphens

Reply via email to