[ 
https://issues.apache.org/jira/browse/MAHOUT-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1545.
----------------------------------------

       Resolution: Later
    Fix Version/s: 1.0

Closing this as it is a reminder for things to do in the future.

> Creating holdout sets with seq2sparse and split
> -----------------------------------------------
>
>                 Key: MAHOUT-1545
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1545
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification, CLI, Examples
>    Affects Versions: 0.9
>            Reporter: Andrew Palumbo
>             Fix For: 1.0
>
>
> The current method for vectorizing data using seq2sparse and then "split" 
> allows for a large amount of information to spill over from the training sets 
> to the test sets- especially in the case of TF-IDF transformations.  The IDF 
> transform provides alot of information on the holdout set to the training set 
> if calculated previous to splitting them up.  
> I'm not sure if given the current seq2sparse implementation's status as 
> Legacy and the relatively minor advantages that it might give whether or not 
> its worth adding something like a "split" option to 
> SparseVectorsFromSequenceFiles.java.  But i know that i saw a new 
> implementation being discussed and and think that it would be worth it to 
> have an option like this built in.    
> I think that this issue may have been raised before, but i wanted to bring it 
> up again in light of the current move away from MapReduce and the new 
> implementations of Mahout tools that will be coming along. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to