http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
Only caveat is I found that carrot2 tends to not scale beyond 200 or so docs, though this probably depends on length of docs & the # of different tokens.
I was able to use the above to integ w/ a lucene search results page in just an hour or so.
Owen Densmore wrote:
I'm building a TDM (Term Document Matrix) from my lucene index. As part of this, it would be useful to have the document term weights (the TF*IDF-weight) if they are already available. Naturally I can compute them, but I suspect they are lurking behind an API I've not discovered yet. Is there an API for getting them?
I'm doing this as a first step in discovering a good set of clustering labels. My data collection is 1200 research papers, all of which have good meta data: titles, authors, abstracts, keyphrases and so on.
One source for how to do this is the thesis of Stanislaw Osinski and others like it:
http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm
And the Carrot2 project which uses similar techniques.
http://www.cs.put.poznan.pl/dweiss/carrot/
My problem is simple: I need a fairly clear discussion on exactly how to generate the labels, and to assign documents to them. The thesis is quite good, but I'm not sure I can reduce it to practice in the 2-3 days I have to evaluate it! Lucene has made the TDM easy to calculate, but I basically don't know what to do next!
Can anyone comment on whether or not this will work, and if so, suggest a quick way to get a demo on the air? For example, I don't seem to be able to ask Carrot2 to do a Google "site" search. If I could, I could simply aim Carrot2 at my collection with a very general search and see what clusters it discovers. This may be a gross misuse of Carrot2's clustering anyway, so could easily be a blind alley.
Or is there a different stunt with Lucene that might work? For example, use Lucene to cluster the docs using a batch search where the queries are Library of Congress descriptions! Batch searching is *really fast* in Lucene -- I've been able to search the data collection against each distinct keyphrase in seconds!
Owen
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]