Am 21.01.2016 um 17:53 schrieb John Hardin:
On Thu, 21 Jan 2016, RW wrote:On Thu, 21 Jan 2016 14:31:09 +0100 Christian Laußat wrote:Am 21.01.2016 14:17, schrieb RW:The FNs dropped from 287 to 69, which I'd call a four-fold improvement. The FPs rose from 0 to 1, but that mail was ham quoting a full spam, so arguably it just did a better job in detecting the embedded spam.Yes, but is it really worth the resources? I mean, the database got 13 time larger for 3 word token, and with more words per token it will grow exponentially.But if you are training on error it only grows by a factor of 3.1 (13*69/287). You also have to consider what happens if you simply reduce the retention time by a factor of 3.1 - that corpus had 4 years retention so it's unlikely that maintaining a constant size database would have made much difference in this case. When you train from corpus the database size is dominated by ephemeral tokens which makes the situation look worse than it is. It depends what you want. I don't care about an extra 100 MB of disk space and a few milliseconds if it gives any measurable improvement. Personally I wouldn't like to see Bayes go multi-word because it would likely end-up as a poor compromise. Two-word tokenization is the default on DSPAM, but I've not seen anyone advocate using it. I think it's better to score in an external filter that runs in addition to Bayes.There was an improvement in FP and FN from two tokens. The marginal improvement from three doesn't seem worth it. I'd like to see a SA Bayes config option to select between one-word and two-word tokens
not only you!like "bayes_token_sources all" was introduced a "bayes_multiword_tokens <integer>" would be perfect dsiabled by default, so one could easily verify the differences with a existing corpus and what's the best result
like the mime-tokens these should be additional ones to the in any case generated 1-word-tokens
_________________________for "Two-word tokenization is the default on DSPAM, but I've not seen anyone advocate using it" - just because it is a dead project, looking only at the bayes-implementation i have read more than once it's better then SA and the reason to not consider it was the fact it's dead and full of unfixed bugs
signature.asc
Description: OpenPGP digital signature