Hi all, Does Bayes tokenize on word boundaries and hence would ignore hyphens? Or does it include them? I've seen a lot of spam lately inserting random hyphens between key spammy words (like "economic-crisis"), presumably in an attempt to bypass word filters and/or Bayes. So would word1-word2 get tokenized as a single item or as two words?
If hyphens are currently included, then perhaps Bayes should be updated to ignore hyphens and/or tokenize at word boundaries? Cheers. --- Amir