I'm using mahout 0.7 on a pseudo-distributed hadoop installation for testing purposes.
A lot of what I'm doing is being guided by Mahout in Action, which I know deals with 0.5, but as far as I can tell, nothing major has changed with seq2sparse. I'm having a problem with the tfidf vectors generated by seq2sparse. No matter what I set "-x" (max document frequency percentage) to, I end up with the same number of terms in my dictionary, and vectors of the same size. Shouldn't I be getting smaller tfidf vectors as my -x value decreases? I found one posting about mahout 0.6 where -x was being parsed as an absolute number of documents rather than a percentage of documents. That was supposed to have been fixed in 0.7, but I tried using it in that way too just to see if it would help. No change in the number of terms I'm getting. Here are the values I've tried, and the number of terms I've ended up with. My data set is 4850 wikipedia articles from: http://dumps.wikimedia.org/enwiki/20110803/ The exact file is: pages-articles1.xml.bz2 The xml file was turned into a seqfile with: mahout seqwiki -all -i <path to xml file> -o <path to output directory> My calls to seq2sparse look like this: mahout seq2sparse -i <seq directory> -o <out dir> -ow -wt tfidf -x 4800 -nv My results: |-x value | #of terms | |4800 | 256623 | |4600 | 256623 | |2500 | 256623 | |99 | 256623 | |90 | 256623 | |25 | 256623 | |5 | 256623 | Any ideas on what I'm doing wrong? Thanks for the help.
