I'm building a TDM (Term Document Matrix) from my lucene index. As part of this, it would be useful to have the document term weights (the TF*IDF-weight) if they are already available. Naturally I can compute them, but I suspect they are lurking behind an API I've not discovered yet. Is there an API for getting them?

I'm doing this as a first step in discovering a good set of clustering labels. My data collection is 1200 research papers, all of which have good meta data: titles, authors, abstracts, keyphrases and so on.

One source for how to do this is the thesis of Stanislaw Osinski and others like it:
http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm
And the Carrot2 project which uses similar techniques.
http://www.cs.put.poznan.pl/dweiss/carrot/


My problem is simple: I need a fairly clear discussion on exactly how to generate the labels, and to assign documents to them. The thesis is quite good, but I'm not sure I can reduce it to practice in the 2-3 days I have to evaluate it! Lucene has made the TDM easy to calculate, but I basically don't know what to do next!

Can anyone comment on whether or not this will work, and if so, suggest a quick way to get a demo on the air? For example, I don't seem to be able to ask Carrot2 to do a Google "site" search. If I could, I could simply aim Carrot2 at my collection with a very general search and see what clusters it discovers. This may be a gross misuse of Carrot2's clustering anyway, so could easily be a blind alley.

Or is there a different stunt with Lucene that might work? For example, use Lucene to cluster the docs using a batch search where the queries are Library of Congress descriptions! Batch searching is *really fast* in Lucene -- I've been able to search the data collection against each distinct keyphrase in seconds!

Owen


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to