> I was basically thinking of using lucene to generate document > vectors, and writing my custom similarity algorithms for measuring > distance. > > I could then run this data through k-means or SOM algorithms for > calculating clusters
First of all, I think it would already be great if there was some functionality for simply storing document vectors during the indexing process, so you could later on use IndexSearcher.docTerms(int i) to retrieve a BitSet or an array of floats that are weighted so that frequent terms have lower values. One difficulty I see here is that terms don't seem to have any unique identifiers, guess you'd have to manage those yourself... -- Eric Jain --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]