I have a separate solution: strip the quoted text. Quoted text in the
emails spams the term vectors; just plain TF-IDF is not enough to
combat this. Lucene has a lot of tools besides TFi-IDF.

I have a patch, gotta start the JIRA. Also added more measurements to
the confusion matrix. I want to get a good measurement of the
performance on each producer and consumer, not just a global ratio.
'testnb' gives 80% but one of the false boxes has a 1. This is bogus.
(I'm using your complete corpus of commons v.s. cocoon, classifying
dev v.s. user.)

On Wed, Jan 4, 2012 at 6:57 AM, Grant Ingersoll (Updated) (JIRA)
<[email protected]> wrote:
>
>     [ 
> https://issues.apache.org/jira/browse/MAHOUT-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
>
> Grant Ingersoll updated MAHOUT-939:
> -----------------------------------
>
>    Attachment: MAHOUT-939.patch
>
> Here's a start on this.  Added some more construction options to the 
> AdaptiveLogisticRegression class.  Still testing what values to use in 
> TrainASFEmail, but thought I would put this up for now.
>
>> ASF Email SGD Examples don't produce good results
>> -------------------------------------------------
>>
>>                 Key: MAHOUT-939
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-939
>>             Project: Mahout
>>          Issue Type: Bug
>>    Affects Versions: 0.6
>>            Reporter: Grant Ingersoll
>>            Assignee: Grant Ingersoll
>>              Labels: MAHOUT_INTRO_CONTRIBUTE
>>             Fix For: 0.7
>>
>>         Attachments: MAHOUT-939.patch
>>
>>
>> The SGD examples for the ASF email don't work all that well currently in 
>> terms of quality.  Also, need to determine how much memory is required for 
>> vectors of cardinality size 100K.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA 
> administrators: 
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>



-- 
Lance Norskog
[email protected]

Reply via email to