On Thu, 21 Jan 2016 14:31:09 +0100 Christian Laußat wrote: > Am 21.01.2016 14:17, schrieb RW: > > The FNs dropped from 287 to 69, which I'd call a four-fold > > improvement. > > > > The FPs rose from 0 to 1, but that mail was ham quoting a full > > spam, so arguably it just did a better job in detecting the > > embedded spam. > > Yes, but is it really worth the resources? I mean, the database got > 13 time larger for 3 word token, and with more words per token it > will grow exponentially.
But if you are training on error it only grows by a factor of 3.1 (13*69/287). You also have to consider what happens if you simply reduce the retention time by a factor of 3.1 - that corpus had 4 years retention so it's unlikely that maintaining a constant size database would have made much difference in this case. When you train from corpus the database size is dominated by ephemeral tokens which makes the situation look worse than it is. It depends what you want. I don't care about an extra 100 MB of disk space and a few milliseconds if it gives any measurable improvement. Personally I wouldn't like to see Bayes go multi-word because it would likely end-up as a poor compromise. Two-word tokenization is the default on DSPAM, but I've not seen anyone advocate using it. I think it's better to score in an external filter that runs in addition to Bayes.