Sequence file format for Kmeans, LDA, etc.

Gregory Lawrence Thu, 12 Nov 2009 17:58:24 -0800

Hi,

I'm trying to write a map-reduce program that will convert text documents into 
a format suitable for Mahout's clustering algorithms. From what I can gather, 
it seems like the output should be a sequence file with a long integer document 
index (key) and a sparse vector (value) that contains TF (or TFIDF) counts. 
This sparse vector also has a name that identifies the document.


Does the long integer document index matter? I would rather avoid having to set 
this to something meaningful. Do the numbers have to be unique or contiguous? 
Does the name of the sparse vector matter? I noticed that it is being set as a 
string in LuceneIterable.

Sequence file format for Kmeans, LDA, etc.

Reply via email to