Re: Stemmed terms/common terms

Grant Ingersoll Thu, 16 Aug 2007 09:07:01 -0700


On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:

A couple of questions about term frequencies and stemming:
- What's the best way to get the most common unstemmed form of aPorter-stemmed word from the index? For example given the stem'walk', find that 'walking' is the most common full word in the index.

Are both in the index? I would think this is going to take someapplication specific logic, since Lucene doesn't inherently trackthese relations. You might be able to string something togetherusing some of the regular expression/wildcard queries, but it isgoing to take some work on your part.

Another approach might be to put some mechanisms in place duringanalysis that track this information.

- Is there a way to get a list of all the terms in the index (ormaybe just the top n) ordered by descending frequency of usage? Iimagine it's related to docFreq, but can't see how to get a list ofterms in all documents.

Have a look at Luke if you just want the info as part of a UI. Also,I _believe_ Solr has added a LukeRequestHandler (see http://wiki.apache.org/solr/LukeRequestHandler), not sure if it doeseverything you are looking for, but it might be a place to start.You might ask your question on the Solr mailing list.

I'm using PyLucene and Solr, so if there are easy solutions ineither of those that would be ideal.
Thanks,
alf.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemmed terms/common terms

Reply via email to