On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:
A couple of questions about term frequencies and stemming:
- What's the best way to get the most common unstemmed form of a
Porter-stemmed word from the index? For example given the stem
'walk', find that 'walking' is the most common full word in the index.
Are both in the index? I would think this is going to take some
application specific logic, since Lucene doesn't inherently track
these relations. You might be able to string something together
using some of the regular expression/wildcard queries, but it is
going to take some work on your part.
Another approach might be to put some mechanisms in place during
analysis that track this information.
- Is there a way to get a list of all the terms in the index (or
maybe just the top n) ordered by descending frequency of usage? I
imagine it's related to docFreq, but can't see how to get a list of
terms in all documents.
Have a look at Luke if you just want the info as part of a UI. Also,
I _believe_ Solr has added a LukeRequestHandler (see http://
wiki.apache.org/solr/LukeRequestHandler), not sure if it does
everything you are looking for, but it might be a place to start.
You might ask your question on the Solr mailing list.
I'm using PyLucene and Solr, so if there are easy solutions in
either of those that would be ideal.
Thanks,
alf.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]