On Thu, 21 Jan 2016 14:31:09 +0100
Christian Laußat wrote:

> Am 21.01.2016 14:17, schrieb RW:
> > The FNs dropped from 287 to 69, which I'd call a four-fold
> > improvement.
> > 
> > The FPs rose from 0 to 1, but that mail was ham quoting a full
> > spam, so arguably it just did a better job in detecting the
> > embedded spam.  
> 
> Yes, but is it really worth the resources? I mean, the database got
> 13 time larger for 3 word token, and with more words per token it
> will grow exponentially.

But if you are training on error it only grows by a factor of 3.1
(13*69/287).  You also have to consider what happens if you simply
reduce the retention time by a factor of 3.1 - that corpus had 4 years
retention so it's unlikely that maintaining a constant size database
would have made much difference in this case. When you train from
corpus the database size is dominated by ephemeral tokens which makes
the situation look worse than it is. 

It depends what you want. I don't care about an extra 100 MB
of disk space and a few milliseconds if it gives any measurable
improvement. 

Personally I wouldn't like to see Bayes go multi-word because it would
likely end-up as a poor compromise. Two-word tokenization is the
default on DSPAM, but I've not seen anyone advocate using it. I think
it's better to score in an external filter that runs in addition to
Bayes.



  

Reply via email to