[ https://issues.apache.org/jira/browse/MAHOUT-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Schelter resolved MAHOUT-1545. ---------------------------------------- Resolution: Later Fix Version/s: 1.0 Closing this as it is a reminder for things to do in the future. > Creating holdout sets with seq2sparse and split > ----------------------------------------------- > > Key: MAHOUT-1545 > URL: https://issues.apache.org/jira/browse/MAHOUT-1545 > Project: Mahout > Issue Type: Bug > Components: Classification, CLI, Examples > Affects Versions: 0.9 > Reporter: Andrew Palumbo > Fix For: 1.0 > > > The current method for vectorizing data using seq2sparse and then "split" > allows for a large amount of information to spill over from the training sets > to the test sets- especially in the case of TF-IDF transformations. The IDF > transform provides alot of information on the holdout set to the training set > if calculated previous to splitting them up. > I'm not sure if given the current seq2sparse implementation's status as > Legacy and the relatively minor advantages that it might give whether or not > its worth adding something like a "split" option to > SparseVectorsFromSequenceFiles.java. But i know that i saw a new > implementation being discussed and and think that it would be worth it to > have an option like this built in. > I think that this issue may have been raised before, but i wanted to bring it > up again in light of the current move away from MapReduce and the new > implementations of Mahout tools that will be coming along. -- This message was sent by Atlassian JIRA (v6.2#6252)