Hi all,
After a while I've gotten a cluster I'm more or less happy with, form a corpus of textual articles. I did the standard thing, took a directory with text files -> seqfile -> TF-IDF -> canopies -> k-means. Now, I want to take a second set of documents and see how they would fit in the existent cluster, so the idea is to take each document transform it into a feature vector (TF-IDF), then see how close it is to the centroids (canopy it likely for speed). However, I'm having a hard time trying to turn my new documents into vectors, I mean I could just use sequencedirectory and then seq2sparse, but that would likely give me different weights and the feature vectors would just not be usable. What would be the way to consistently use the same dictionary and get meaningful feature vectors for different sets of documents ?
