On Mon, 27 Jun 2005, Theo Van Dinter spake:
> On Mon, Jun 27, 2005 at 03:54:35PM +0100, Nix wrote:
>> > run with --learn=N -- we're going to want to figure out N
>> > small # for large # of messages, large # for small # of messages?
>>
>> That sounds like an optimization problem to me (find that percentage
>> which yields the greatest accuracy when tested against an entirely
>> unrelated corpus).
>
> Well, it's more about finding an N that simulates real-world behavior. We
> don't want to find the N that gives the best results unless the same N is what
> the average user does.
True. What we really want to do is learn not a random subset of messages but
a random highly-scored subset. That's not the same as auto-learning in the
presence of net tests, but it's closer than picking the messages randomly.
(if this is what is actually implemented, forgive me: I haven't checked).
>> ... ah, I see, and this gives you Bayes-plus-net results, from which you
>> can determine the other results by just filtering certain rules out of
>> the mass-check results. Neat.
>
> Yeah, previous mass-check runs required 3 because we let auto-learn do its
> thing and that required scores to be set, and bayes depended on net rules,
> etc, etc.
Slight theoretical reduction in accuracy; huge reduction in time
spent. Probably a good trade-off, since nobody uses *exactly* the
environment we're training against anyway. (It might actually help
reduce our overfitting problems a bit ;) )
--
`I lost interest in "blade servers" when I found they didn't throw knives
at people who weren't supposed to be in your machine room.'
--- Anthony de Boer