Re: Term Weights and Clustering

David Spencer Wed, 23 Feb 2005 10:43:26 -0800

I'm a little confused on exactly, exactly what you want but if your goal is to cluster your papers w/ carrot2 then I found these links helpful:

http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Only caveat is I found that carrot2 tends to not scale beyond 200 or so docs, though this probably depends on length of docs & the # of different tokens.

I was able to use the above to integ w/ a lucene search results page in just an hour or so.


Owen Densmore wrote:

I'm building a TDM (Term Document Matrix) from my lucene index. As part of this, it would be useful to have the document term weights (the TF*IDF-weight) if they are already available. Naturally I can compute them, but I suspect they are lurking behind an API I've not discovered yet. Is there an API for getting them?

I'm doing this as a first step in discovering a good set of clustering labels. My data collection is 1200 research papers, all of which have good meta data: titles, authors, abstracts, keyphrases and so on.

One source for how to do this is the thesis of Stanislaw Osinski and others like it: http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm And the Carrot2 project which uses similar techniques. http://www.cs.put.poznan.pl/dweiss/carrot/

My problem is simple: I need a fairly clear discussion on exactly how to generate the labels, and to assign documents to them. The thesis is quite good, but I'm not sure I can reduce it to practice in the 2-3 days I have to evaluate it! Lucene has made the TDM easy to calculate, but I basically don't know what to do next!

Can anyone comment on whether or not this will work, and if so, suggest a quick way to get a demo on the air? For example, I don't seem to be able to ask Carrot2 to do a Google "site" search. If I could, I could simply aim Carrot2 at my collection with a very general search and see what clusters it discovers. This may be a gross misuse of Carrot2's clustering anyway, so could easily be a blind alley.

Or is there a different stunt with Lucene that might work? For example, use Lucene to cluster the docs using a batch search where the queries are Library of Congress descriptions! Batch searching is *really fast* in Lucene -- I've been able to search the data collection against each distinct keyphrase in seconds!
Owen
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term Weights and Clustering

Reply via email to