Hi all,

After a while I've gotten a cluster I'm more or less happy with, form a corpus 
of textual articles. 


I did the standard thing, took a directory with text files  -> seqfile -> 
TF-IDF -> canopies -> k-means.


Now, I want to take a second set of documents and see how they would fit in the 
existent cluster, 
so the idea is to take each document transform it into a feature vector 
(TF-IDF), then see how close it is to 
the centroids (canopy it likely for speed).
 

However, I'm having a hard time trying to turn my new documents into vectors, I 
mean I could just use sequencedirectory and then
seq2sparse, but that would likely give me different weights and the feature 
vectors would just not be usable. What would be 
the way to consistently use the same dictionary and get meaningful feature 
vectors for different sets of documents ?



Reply via email to