Re: 3.1.0 schedule

Nix Mon, 27 Jun 2005 11:54:42 -0700

On Mon, 27 Jun 2005, Theo Van Dinter spake:
> On Mon, Jun 27, 2005 at 03:54:35PM +0100, Nix wrote:
>> > run with --learn=N -- we're going to want to figure out N
>> >   small # for large # of messages, large # for small # of messages?
>> 
>> That sounds like an optimization problem to me (find that percentage
>> which yields the greatest accuracy when tested against an entirely
>> unrelated corpus).
> 
> Well, it's more about finding an N that simulates real-world behavior.  We
> don't want to find the N that gives the best results unless the same N is what
> the average user does.


True. What we really want to do is learn not a random subset of messages but
a random highly-scored subset. That's not the same as auto-learning in the
presence of net tests, but it's closer than picking the messages randomly.

(if this is what is actually implemented, forgive me: I haven't checked).

>> ... ah, I see, and this gives you Bayes-plus-net results, from which you
>> can determine the other results by just filtering certain rules out of
>> the mass-check results. Neat.
> 
> Yeah, previous mass-check runs required 3 because we let auto-learn do its
> thing and that required scores to be set, and bayes depended on net rules,
> etc, etc.

Slight theoretical reduction in accuracy; huge reduction in time
spent. Probably a good trade-off, since nobody uses *exactly* the
environment we're training against anyway. (It might actually help
reduce our overfitting problems a bit ;) )

-- 
`I lost interest in "blade servers" when I found they didn't throw knives
 at people who weren't supposed to be in your machine room.'
    --- Anthony de Boer

Re: 3.1.0 schedule

Reply via email to