On Sat, 28 May 2016 15:37:21 -0400
Bill Cole wrote:

> More importantly (IMHO) they aren't designed to collide with existing 
> common tokens and be added back into messages that may contain those 
> tokens already in order to influence Bayesian classification.
> 
> There is sound statistical theory consistent with empirical evidence 
> underpinning the Bayes classifier implementation in SA. While there
> can be legitimate critiques of the SA implementation specifically and
> in general how well email word frequency fits Bayes' Theorem,
> injecting a pile of new derivative meta-tokens based on pre-conceived
> notions of "concepts" into the Bayesian analysis invalidates the
> assumption of what the input for Naive Bayes analysis is:
> *independent* features. The "concepts" approach adds words that are
> *dependent* on the presence of other words in the document and to
> make it worse, those dependent words may already exist in some
> pristine messages. It unmoors the SA Bayes implementation from any
> theoretical grounding, converting its complex math from statistical
> analysis into arbitrary numerology.

Statistical filters are based on some statistical theory combined with
pragmatic kludges and assumptions. Practical filters have been
developed based on what's been found to work, not on what's more
statistically correct.

Bayes already creates multiple tokens from the same information, most
notably case-sensitive and lower-case words in the body.

I don't see a huge difference between

  "Bill Cole" tokenizing as {Bill, bill, Cole, cole}

and

  "v1agra, ciali5"  tokenizing as {v1agra, ciali5, eddrug}

The only way to find out whether it works is to try it.

I think the OP is probably underselling it, in that it could be used to
extract information that normal tokenization can't get, for example:

/%.off/i

/Symbol:/i,   /Date:/i,    /Price:/i ...

/^Barrister/i



The main problem is that you'd need a lot of rules to make a substantial
difference.

Reply via email to