Re: Extract terms not by reader, but by documents

Karl Wettin Wed, 05 Sep 2007 11:56:57 -0700

Rafael, are you looking for IndexReader.getTermFreqVector?


--
karl

5 sep 2007 kl. 16.48 skrev Rafael Rossini:

Thank´s for the reply Grant, let me try to explain exactly what I´dlike to

do. Take the 2 docs:

Doc1: "Microsoft is a nice software company, and Xbox seems to be anice

product too."

Doc2: "Nintendo and Sony have been in the game industry for a longtime, but

now, Microsoft is trying to enter with Xbox"

Now If I have a query like this, "(Nintendo AND Sony AND Microsoft) OR

(Xbox)" and perform a query.rewrite(reader).extractTerms(set), myset isgoing to have all the terms (Nintendo, Sony, Microsoft, Xbox)right? But

when I iterate the docs, I wanted to do something like

query.rewrite(doc).extracTerms(set), [I know this method does notexist, is

just and example of the funcionality].

So, for Doc1, my set would be populate with (Xbox) only, and forDoc2 the

set would be populated with (Nintendo, Sony, Microsoft, Xbox).
Is this possible? Is it clear now what I´m trying to achieve?

[]s
     Rossini

On 9/4/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


Not sure if I am understanding what you are trying to do.  I think
you are trying to find out which terms occurred in a particular
document, correct?

I also am not sure about your first example.  My understanding of
extractTerms is that it just gives you back the set of all terms that
occur in the _query_, not necessarily those that matched in the
document, although it has this effect for things like WildcardQuery
and others that get expanded using TermEnum since they are expanded
based on what is in the index.  I think this is best seen by the
implementation of extractTerms() in TermQuery.java in which it just
adds the term from the query into the set.  Likewise for BooleanQuery
which loops over the clauses and extracts the terms from each clause
and adds them to the set.  Thus, if you had a boolean query of all
term queries, you would get back the set of all the terms.

As for the problem it sounds like you are interested in, you could
use SpanQuery functionality with some post processing analysis or try
using Term Vectors and the new (unreleased) TermVectorMapper (TVM)
functionality (or possibly a combination of both).  In this case, you
will need to write your own implementation of the TVM that takes in
the query so it knows what terms to identify. If you go the latter
route, know that it is new functionality and probably doesn't have a
whole lot of users yet, so there may still be issues with it.  See
the nightly build or nightly javadocs for info on these.

The other question that might be helpful, is what custom highlighting
are you doing that isn't covered by the contrib/highlighter?  Perhaps
you have some suggestions that are generic enough to help improve
it?  Just a thought.

Hope this helps,
Grant

On Sep 4, 2007, at 5:01 PM, Rafael Rossini wrote:

Hi all,

    In some custom highlighting, I often write a code like this:

       Set<Term> matchedTerms = new HashSet<Term>();
       query.rewrite(reader).extractTerms(matchedTerms);

    With this code the Term Set gets populated by the matched query
in your
whole index. Is it possible to this with a document instead of the
reader?
Something like
query.rewrite(documentId).extractTerms(matchedTerms) ?

[]s
     Rossini


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Extract terms not by reader, but by documents

Reply via email to