Re: SA Concepts - plugin for email semantics

Bill Cole Mon, 30 May 2016 17:31:36 -0700

On 30 May 2016, at 18:25, Dianne Skoll wrote:

On Mon, 30 May 2016 17:45:52 -0400
"Bill Cole" <sausers-20150...@billmail.scconsult.com> wrote:

So you could have 'sex' and 'meds' and 'watches' tallied up in into
frequency counts that sum up natural (word) and synthetic (concept)
occurrences, not just as incompatible types of input feature but as
a conflation of incompatible features.


That is easy to patch by giving "concepts" a separate namespace.  You

could do that by picking a character that can't be in a normal tokenand

using something like:  concept*meds, concept*sex, etc. as tokens.

Yes, but I'd still be reluctant to have that namespace directly blendedwith 1-word Bayes because those "concepts" are qualitatively different:inherently much more complex in their measurement than words. Roboticsemantic analysis hasn't reached the point where an unremarkable machinecan decide whether a message is porn or a discussion of currentpolitical issues, and I would not hazard a guess as to which actualconcept in email is more likely to be spam or ham these days. Any oldmail server can of course tell whether the word 'Carolina' is present ina message, which probably distributes quite disproportionately towardsham.

FWIW, I have roughly no free time for anything between work and
family demands but if I did, I would most like to build a blind
fixed-length tokenization Bayes classifier: just slice up a message
into all of its n-byte sequences (so that a message of bytelength x
would have x-(n-1) different tokens) and use those as inputs instead
of words.


I think that could be very effective with (as you said) plenty of
training.  I think there *may* be slight justification for
canonicalizing text parts into utf-8 first; while you are losing
information, it's hard to see how 手机色情 should be treated
differently depending on the character encoding.

Well, I've not thought it through deeply, but an evasion of the charsetissue might be to just decode any Base64 or QP transfer encoding (whichcan be path-dependent rather than a function of the sender or content)to get 8-bit bytes and use 6-byte tokens as if it was all 1-byte chars.UCS-4 messages would be a wreck, but pairs of non-ASCII chars in UTF-8would be seen cleanly once and as an aura of 10 semi-junk tokens aroundthem, in a manner that might effectively wash itself out. Or go to12-byte tokens and get the same effect with UCS-4. Or 3-byte tokens:screw 32-bit charsets, screw encoding semantics of UTF-8, just have 16.8million possible 24-bit tokens and see how they distribute. It seems tome that this is almost the ultimate test for Naive Bayes text analysis:break away from the idea that the input features have any innate meaningat all, let them be pure proxies for whatever complex larger patternsgive rise to them.

Oh, and did I mention that Bayes' Theorem has different"interpretations" in the same way Heisenberg's Uncertainty Principle andquantum superposition do? 24-bit tokens could settle the dispute...

Re: SA Concepts - plugin for email semantics

Reply via email to