Re: Extract terms not by reader, but by documents

Grant Ingersoll Tue, 04 Sep 2007 19:05:03 -0700

Not sure if I am understanding what you are trying to do. I thinkyou are trying to find out which terms occurred in a particulardocument, correct?

I also am not sure about your first example. My understanding ofextractTerms is that it just gives you back the set of all terms thatoccur in the _query_, not necessarily those that matched in thedocument, although it has this effect for things like WildcardQueryand others that get expanded using TermEnum since they are expandedbased on what is in the index. I think this is best seen by theimplementation of extractTerms() in TermQuery.java in which it justadds the term from the query into the set. Likewise for BooleanQuerywhich loops over the clauses and extracts the terms from each clauseand adds them to the set. Thus, if you had a boolean query of allterm queries, you would get back the set of all the terms.

As for the problem it sounds like you are interested in, you coulduse SpanQuery functionality with some post processing analysis or tryusing Term Vectors and the new (unreleased) TermVectorMapper (TVM)functionality (or possibly a combination of both). In this case, youwill need to write your own implementation of the TVM that takes inthe query so it knows what terms to identify. If you go the latterroute, know that it is new functionality and probably doesn't have awhole lot of users yet, so there may still be issues with it. Seethe nightly build or nightly javadocs for info on these.

The other question that might be helpful, is what custom highlightingare you doing that isn't covered by the contrib/highlighter? Perhapsyou have some suggestions that are generic enough to help improveit? Just a thought.


Hope this helps,
Grant

On Sep 4, 2007, at 5:01 PM, Rafael Rossini wrote:

Hi all,

    In some custom highlighting, I often write a code like this:

       Set<Term> matchedTerms = new HashSet<Term>();
       query.rewrite(reader).extractTerms(matchedTerms);
With this code the Term Set gets populated by the matched queryin yourwhole index. Is it possible to this with a document instead of thereader?
Something like
query.rewrite(documentId).extractTerms(matchedTerms) ?

[]s
     Rossini


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Extract terms not by reader, but by documents

Reply via email to