>From: RW <rwmailli...@googlemail.com> >Sent: Tuesday, May 31, 2016 5:20 PM >To: users@spamassassin.apache.org >Subject: Re: SA Concepts - plugin for email semantics
>On Tue, 31 May 2016 15:20:56 -0400 >Bill Cole wrote: >> On 29 May 2016, at 11:07, RW wrote: >> >> > Statistical filters are based on some statistical theory combined >> > with pragmatic kludges and assumptions. Practical filters have been >> > developed based on what's been found to work, not on what's more >> > statistically correct. >> >> I'm not aware of any hard evidence that the SA Bayes pragmatic >> kludges and assumptions perform better or worse than an >> implementation that used fewer or different ones. >It's not specific to SA, for example there's no sound basis for >assigning token probability to tokens that have zero ham or spam >counts, many classifications turn on completely made-up probabilities. >There's also no way of assigning meaningful probabilities to tokens >that enter or re-enter the database while it's mature without making >an assumption about the current spam/ham training ratio. >The assumption that tokens are independent was never reasonable in the >first place, there's plenty of natural duplication e.g. ip address and >RDNS, and strong correlations between important tokens. There's also a >lot of inadvertent duplication for example from metadata headers that >are not primarily intended for Bayes. >I don't think concepts is a particular good idea, but I don't like to >see someone's worked dismissed on such paper-thin theoretical grounds. >> > I think the OP is probably underselling it, in that it could be >> > used to >> > extract information that normal tokenization can't get, for example: >> > ... >> > The main problem is that you'd need a lot of rules to make a >> > substantial >> > difference. >> >> So: re-invent SpamAssassin v1 but without rule scores, using Bayes to >> do half-assed dynamic score adjustment per site with rules that with >> either evolve constantly or grow stale? >I was thinking that it would be an alternative to local custom rules >- particularly for spams that leave Bayes with little to work with and > where individual body rules aren't worth much of a score. I think it could be valuable in custom meta rules. That's how I would like to try it out anyway for a while with minuscule scores. Dave