I tested apace commons v.s. cocoon. They use two different build
systems, with different message formats.  I believe the repetitive
messages have the effect of spamming terms, both in the subject lines
and body. In fact, the subject lines are probably bigger offenders
than the bodies. But, we shall see.

On Wed, Jan 4, 2012 at 6:29 PM, Ted Dunning <[email protected]> wrote:
> On Wed, Jan 4, 2012 at 4:59 PM, Lance Norskog <[email protected]> wrote:
>
>> The last step before posting was to test it on SGD :) My results on
>> ASF mails (two labels) is around 80%, but both failure boxes get about
>> 20% of the messages. This seems more realistic.
>>
>
> Was this 80% on time-separated test data?  Or the training data?
>
>
>> There is another leakage/spam problem in the dev mails: build reports.
>>
>
> Why are these a problem?  Too easy?
>
> They are emails sent to the group and should be reasonable to classify
> unless they inflate the accuracy.
>
>
>> The MailProcessor has positive regex rules to find header entries &
>> subject lines. It does not do negative regex rules to reject a
>> message- this is the right way to nuke (the first) build message.
>>
>
> Yes.  They should be easy to nuke.  But I am not sure why.
>
>
>> Is it worthwhile to clamp the training data so that there are similar
>> numbers of documents for each label? Or does Naive Bayes work well
>> with a bell curve?
>>
>
> Shouldn't matter much for any of our classifiers.
>
> The only strong reason to do this is to speed up training but this data set
> is pretty small.



-- 
Lance Norskog
[email protected]

Reply via email to