On 30 May 2016, at 18:25, Dianne Skoll wrote:

On Mon, 30 May 2016 17:45:52 -0400
"Bill Cole" <sausers-20150...@billmail.scconsult.com> wrote:

So you could have 'sex' and 'meds' and 'watches' tallied up in into
frequency counts that sum up natural (word) and synthetic (concept)
occurrences, not just as incompatible types of input feature but as
a conflation of incompatible features.

That is easy to patch by giving "concepts" a separate namespace.  You
could do that by picking a character that can't be in a normal token and
using something like:  concept*meds, concept*sex, etc. as tokens.

Yes, but I'd still be reluctant to have that namespace directly blended with 1-word Bayes because those "concepts" are qualitatively different: inherently much more complex in their measurement than words. Robotic semantic analysis hasn't reached the point where an unremarkable machine can decide whether a message is porn or a discussion of current political issues, and I would not hazard a guess as to which actual concept in email is more likely to be spam or ham these days. Any old mail server can of course tell whether the word 'Carolina' is present in a message, which probably distributes quite disproportionately towards ham.

FWIW, I have roughly no free time for anything between work and
family demands but if I did, I would most like to build a blind
fixed-length tokenization Bayes classifier: just slice up a message
into all of its n-byte sequences (so that a message of bytelength x
would have x-(n-1) different tokens) and use those as inputs instead
of words.

I think that could be very effective with (as you said) plenty of
training.  I think there *may* be slight justification for
canonicalizing text parts into utf-8 first; while you are losing
information, it's hard to see how ζ‰‹ζœΊθ‰²ζƒ… should be treated
differently depending on the character encoding.

Well, I've not thought it through deeply, but an evasion of the charset issue might be to just decode any Base64 or QP transfer encoding (which can be path-dependent rather than a function of the sender or content) to get 8-bit bytes and use 6-byte tokens as if it was all 1-byte chars. UCS-4 messages would be a wreck, but pairs of non-ASCII chars in UTF-8 would be seen cleanly once and as an aura of 10 semi-junk tokens around them, in a manner that might effectively wash itself out. Or go to 12-byte tokens and get the same effect with UCS-4. Or 3-byte tokens: screw 32-bit charsets, screw encoding semantics of UTF-8, just have 16.8 million possible 24-bit tokens and see how they distribute. It seems to me that this is almost the ultimate test for Naive Bayes text analysis: break away from the idea that the input features have any innate meaning at all, let them be pure proxies for whatever complex larger patterns give rise to them.

Oh, and did I mention that Bayes' Theorem has different "interpretations" in the same way Heisenberg's Uncertainty Principle and quantum superposition do? 24-bit tokens could settle the dispute...

Reply via email to