The last step before posting was to test it on SGD :) My results on
ASF mails (two labels) is around 80%, but both failure boxes get about
20% of the messages. This seems more realistic.

There is another leakage/spam problem in the dev mails: build reports.
The MailProcessor has positive regex rules to find header entries &
subject lines. It does not do negative regex rules to reject a
message- this is the right way to nuke (the first) build message.

Is it worthwhile to clamp the training data so that there are similar
numbers of documents for each label? Or does Naive Bayes work well
with a bell curve?

On Wed, Jan 4, 2012 at 2:29 PM, Ted Dunning <[email protected]> wrote:
> Stripping quoted text is very important.
>
> Otherwise, you get a failure mode where the cross-validation in the
> CrossFoldLearner gives you an unrealistically optimistic view of things.
>  This happens because successive documents look too much alike.
>
> The result is that performance appears to get good (to the CFL) so the
> evolutionary process clamps down on the learning rate way too soon.  You
> get bad results on held out data because of this.
>
> On Wed, Jan 4, 2012 at 2:24 PM, Lance Norskog <[email protected]> wrote:
>
>> Sorry, cocoon v.s. commons.
>>
>> On Wed, Jan 4, 2012 at 2:24 PM, Lance Norskog <[email protected]> wrote:
>> > I have a separate solution: strip the quoted text. Quoted text in the
>> > emails spams the term vectors; just plain TF-IDF is not enough to
>> > combat this. Lucene has a lot of tools besides TFi-IDF.
>> >
>> > I have a patch, gotta start the JIRA. Also added more measurements to
>> > the confusion matrix. I want to get a good measurement of the
>> > performance on each producer and consumer, not just a global ratio.
>> > 'testnb' gives 80% but one of the false boxes has a 1. This is bogus.
>> > (I'm using your complete corpus of commons v.s. cocoon, classifying
>> > dev v.s. user.)
>> >
>> > On Wed, Jan 4, 2012 at 6:57 AM, Grant Ingersoll (Updated) (JIRA)
>> > <[email protected]> wrote:
>> >>
>> >>     [
>> https://issues.apache.org/jira/browse/MAHOUT-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>> >>
>> >> Grant Ingersoll updated MAHOUT-939:
>> >> -----------------------------------
>> >>
>> >>    Attachment: MAHOUT-939.patch
>> >>
>> >> Here's a start on this.  Added some more construction options to the
>> AdaptiveLogisticRegression class.  Still testing what values to use in
>> TrainASFEmail, but thought I would put this up for now.
>> >>
>> >>> ASF Email SGD Examples don't produce good results
>> >>> -------------------------------------------------
>> >>>
>> >>>                 Key: MAHOUT-939
>> >>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-939
>> >>>             Project: Mahout
>> >>>          Issue Type: Bug
>> >>>    Affects Versions: 0.6
>> >>>            Reporter: Grant Ingersoll
>> >>>            Assignee: Grant Ingersoll
>> >>>              Labels: MAHOUT_INTRO_CONTRIBUTE
>> >>>             Fix For: 0.7
>> >>>
>> >>>         Attachments: MAHOUT-939.patch
>> >>>
>> >>>
>> >>> The SGD examples for the ASF email don't work all that well currently
>> in terms of quality.  Also, need to determine how much memory is required
>> for vectors of cardinality size 100K.
>> >>
>> >> --
>> >> This message is automatically generated by JIRA.
>> >> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> >> For more information on JIRA, see:
>> http://www.atlassian.com/software/jira
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Lance Norskog
>> > [email protected]
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>



-- 
Lance Norskog
[email protected]

Reply via email to