Hi, I'm trying to write a map-reduce program that will convert text documents into a format suitable for Mahout's clustering algorithms. From what I can gather, it seems like the output should be a sequence file with a long integer document index (key) and a sparse vector (value) that contains TF (or TFIDF) counts. This sparse vector also has a name that identifies the document.
Does the long integer document index matter? I would rather avoid having to set this to something meaningful. Do the numbers have to be unique or contiguous? Does the name of the sparse vector matter? I noticed that it is being set as a string in LuceneIterable.
