On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:

A couple of questions about term frequencies and stemming:

- What's the best way to get the most common unstemmed form of a Porter-stemmed word from the index? For example given the stem 'walk', find that 'walking' is the most common full word in the index.

Are both in the index? I would think this is going to take some application specific logic, since Lucene doesn't inherently track these relations. You might be able to string something together using some of the regular expression/wildcard queries, but it is going to take some work on your part.

Another approach might be to put some mechanisms in place during analysis that track this information.


- Is there a way to get a list of all the terms in the index (or maybe just the top n) ordered by descending frequency of usage? I imagine it's related to docFreq, but can't see how to get a list of terms in all documents.

Have a look at Luke if you just want the info as part of a UI. Also, I _believe_ Solr has added a LukeRequestHandler (see http:// wiki.apache.org/solr/LukeRequestHandler), not sure if it does everything you are looking for, but it might be a place to start. You might ask your question on the Solr mailing list.


I'm using PyLucene and Solr, so if there are easy solutions in either of those that would be ideal.

Thanks,
alf.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to