seq2sparse seems to be ignoring the value of my “-x” parameter

Matt Molek Tue, 25 Sep 2012 10:25:37 -0700

I'm using mahout 0.7 on a pseudo-distributed hadoop installation for
testing purposes.


A lot of what I'm doing is being guided by Mahout in Action, which I
know deals with 0.5, but as far as I can tell, nothing major has
changed with seq2sparse.

I'm having a problem with the tfidf vectors generated by seq2sparse.
No matter what I set "-x" (max document frequency percentage) to, I
end up with the same number of terms in my dictionary, and vectors of
the same size. Shouldn't I be getting smaller tfidf vectors as my -x
value decreases?

I found one posting about mahout 0.6 where -x was being parsed as an
absolute number of documents rather than a percentage of documents.
That was supposed to have been fixed in 0.7, but I tried using it in
that way too just to see if it would help. No change in the number of
terms I'm getting. Here are the values I've tried, and the number of
terms I've ended up with. My data set is 4850 wikipedia articles from:
http://dumps.wikimedia.org/enwiki/20110803/

The exact file is: pages-articles1.xml.bz2

The xml file was turned into a seqfile with:

mahout seqwiki -all -i <path to xml file> -o <path to output directory>

My calls to seq2sparse look like this:

mahout seq2sparse -i <seq directory> -o <out dir> -ow -wt tfidf -x 4800 -nv

My results:

|-x value       | #of terms |
|4800           |  256623   |
|4600           |  256623   |
|2500           |  256623   |
|99             |  256623   |
|90             |  256623   |
|25             |  256623   |
|5              |  256623   |

Any ideas on what I'm doing wrong? Thanks for the help.

seq2sparse seems to be ignoring the value of my “-x” parameter

Reply via email to