Am 21.01.2016 um 17:53 schrieb John Hardin:
On Thu, 21 Jan 2016, RW wrote:

On Thu, 21 Jan 2016 14:31:09 +0100
Christian Laußat wrote:

Am 21.01.2016 14:17, schrieb RW:
The FNs dropped from 287 to 69, which I'd call a four-fold
improvement.

The FPs rose from 0 to 1, but that mail was ham quoting a full
spam, so arguably it just did a better job in detecting the
embedded spam.

Yes, but is it really worth the resources? I mean, the database got
13 time larger for 3 word token, and with more words per token it
will grow exponentially.

But if you are training on error it only grows by a factor of 3.1
(13*69/287).  You also have to consider what happens if you simply
reduce the retention time by a factor of 3.1 - that corpus had 4 years
retention so it's unlikely that maintaining a constant size database
would have made much difference in this case. When you train from
corpus the database size is dominated by ephemeral tokens which makes
the situation look worse than it is.

It depends what you want. I don't care about an extra 100 MB
of disk space and a few milliseconds if it gives any measurable
improvement.

Personally I wouldn't like to see Bayes go multi-word because it would
likely end-up as a poor compromise. Two-word tokenization is the
default on DSPAM, but I've not seen anyone advocate using it. I think
it's better to score in an external filter that runs in addition to
Bayes.

There was an improvement in FP and FN from two tokens. The marginal
improvement from three doesn't seem worth it.

I'd like to see a SA Bayes config option to select between one-word and
two-word tokens


not only you!

like "bayes_token_sources all" was introduced a "bayes_multiword_tokens <integer>" would be perfect dsiabled by default, so one could easily verify the differences with a existing corpus and what's the best result

like the mime-tokens these should be additional ones to the in any case generated 1-word-tokens
_________________________

for "Two-word tokenization is the default on DSPAM, but I've not seen anyone advocate using it" - just because it is a dead project, looking only at the bayes-implementation i have read more than once it's better then SA and the reason to not consider it was the fact it's dead and full of unfixed bugs

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to