Dear all, I'm trying to achieve something simple as a toy example for Mahout. I want to prepare a collection of text documents for sentiment classification. The labels are P(ositive) and N(egative). I have a train and test set. Thus in total 4 datasets: train_P, train_N, test_P, test_N. After converting them to sequencefiles with key "/dataset/docid" and value the text, I want to vectorize the training set and test set over the same dictionary/IDF, but with the training and test set kept separate in the output.
The best way to do this would be to re-use the dictionary and IDF values. Is this possible? I guess this is not implemented, from the unsolved discussion on this mailinglist on Apr 17th 2011 and the answerless Stackoverflow question on http://stackoverflow.com/questions/20885406/can-the-mahout-seq2sparse-command-use-the-previous-generated-dictionary In this case the dictionary/IDF will only pick up words from the training data, which is perfectly fine. The second option I see is to give all 4 input files as input to the same seq2sparse command, obtaining one tfidf-vectorized dataset, and then split afterwards by writing a java program that reads through the whole concatenated dataset and splits it up in a train and a test dataset, where the model targets are replaced (train_P, test_P) become P, (train_N, test_N) become N. This situation emerges in any case where you train a model and want to use it to classify a new/unseen model. Therefore option 1 is clearly the best. Thanks for any guidance on this, Tom Sercu