Re: Phrase indexing and searching with Lucene

Erick Erickson Thu, 19 Feb 2009 05:54:54 -0800

It looks to me like what you're trying to do is akin to document similarity,
which
I haven't had to delve into. But it's been discussed on the user list a few
times,
so perhaps your best bet would be to search the mail archives for that
topic.


Best
Erick

On Thu, Feb 19, 2009 at 3:14 AM, Nada Mimouni <
[email protected]> wrote:

>
> Hello,
>
> Thank you Erick for this detailed answer, that makes things clearer in my
> mind.
>
> >I'm still not clear why the built-in phrase query syntax won't work.
>
> I have programmed a set of java classes (I use Lucene classes) to index and
> search into a collection of documents for a set of queries.
> To test my system, I use a corpus which consists in a collection of queries
> (n queries) and documents (m documents).
> I started by creating one index for all queries and another one for all
> documents. Then I make the search to match between the queries index and
> documents index.
> I use a trec evaluation tool to generate a file that gives all hits
> (matches) between the queryID and documentID with different scores.
>
> In this first step, I just index terms, therefore the search process (as I
> have it now) looks only for term matches between the query terms and the
> documents terms.
> Now I want to get better results (better matching) by adding phrases to
> terms.
>
> I don't know exactly whether it makes a difference if I index phrases and
> terms (erick, erickson, thinks, small, thoughts, erick erickson, erickson
> thinks, small thoughts, erickson thoughts) and then search for both, or just
> keep the indexing process as it is (erick, erickson, thinks, small,
> thoughts) and then make a search for phrases (PhraseQuery : erick erickson,
> erickson thinks, small thoughts, erickson thoughts) and terms.
> Any idea?
>
>
> >Some examples of what you put in your index and what searches
> >you expect to return results for your example AND searches you do
> >NOT want to hit that document would be a great help.
>
> input:
>
> *Query*
> 898    Why is the sun bright?
>
> *Documents*
> 7568  Star, large celestial body composed of gravitationally contained hot
> gases emitting electromagnetic radiation, especially light, as a result of
> nuclear reactions inside the star. The sun is a star.
> 7567  The sun has a magnitude of -26.7, inasmuch as it is about 10 billion
> times as bright as Sirius in the earth's sky.
>
> output:
>
> qID     dID     score
> 898     7568     0,13 (not relevant)
> 898     7567     1     (relevant)
>
>
> In this example, Lucene matches document 7567 to be relevant to he query
> (since it contains all query terms), however bright here is relative to
> Sirius (what we need is to get "sun bright").
>
>
>
>
> Best
> Nada
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:[email protected]]
> Sent: Wed 2/18/2009 3:24 PM
> To: [email protected]
> Subject: Re: Phrase indexing and searching with Lucene
>
> I'm still not clear why the built-in phrase query syntax won't work. If I
> index the following terms (erick, erickson, thinks, small, thoughts)
> in a single field, then searching for "erick erickson" (as a phrase query,
> i.e. with double quotes when sent through a query parser or constructing
> a PhraseQuery yourself) will generate a hit but "erick thinks" won't
> generate a hit (unless you specify slop).
>
> "thinks small thoughts" would also generate a hit
>
> If you're saying that you only want to match on *all* the tokens, i.e.
> the only way to get a hit on the above would be to search for
> "erick erickson thinks small thoughts", then you can create a
> field that's UN_ANALYZED. If you do this, though, beware
> that you have to do things like lower-case terms yourself when
> indexing.
>
> I have no idea what IndexTermGenerator is or what it does, but I'm
> assuming that it just generates single words.
>
> Some examples of what you put in your index and what searches
> you expect to return results for your example AND searches you do
> NOT want to hit that document would be a great help.
>
> As far as searching for both, constructing a BooleanQuery with regular
> TermQuerys and PhraseQuerys would work if you're constructing
> your queries programmatically, or just using a Lucene query
> like +termfield:word +phrasefield:"erick erickson thinks" would
> work. Or, if you just require that the phrase exists you could do
> it all in one field like
> +field:word +field:"erick erickson thinks"
>
>
>
> Best
> Erick
>
>
> On Wed, Feb 18, 2009 at 8:42 AM, Nada Mimouni <
> [email protected]> wrote:
>
> >
> >
> > Thank you Erick.
> >
> > I need first to index phrases, the built-in phrase processing (with
> double
> > quotes) comes in the search step.
> > Is there any difference between :
> >            1) start by indexing phrases and then make a phrase search
> >            2) index terms and then search for phrases
> >
> >
> > To make things clearer:
> >
> > What I am doing now:
> >  - In the indexing step:  I am using "IndexTermGenerator" to generate
> term
> > based indexes, one index for all queries I have and another one for
> > documents (term means single word).
> >  - In the search step : Lucene matches terms in queries index with terms
> in
> > documents index.
> >
> > What I need to do:
> >  - Index phrases ("multi" words) in addition to terms (single words)
> >  - Search for both : phrases and terms
> >
> >
> > Is there any idea on how to proceed?
> >
> > Regards
> > Nada
> >
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:[email protected]]
> > Sent: Wed 2/18/2009 2:10 PM
> > To: [email protected]
> > Subject: Re: Phrase indexing and searching with Lucene
> >
> > Have you tried the built-in phrase processing with double quotes? e.g.
> > "this is a phrase"?
> >
> > See the Term section at
> > http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
> >
> > Best
> > Erick
> >
> > On Wed, Feb 18, 2009 at 5:57 AM, Nada Mimouni <
> > [email protected]> wrote:
> >
> > >
> > >
> > > Hello everybody,
> > >
> > > I use Lucene to index and search into text documents.
> > > At present, I just index and search for single words. I want to extend
> > this
> > > to phrases (or nGrams).
> > >
> > > Could anyone please give me details on how to index phrases and then
> make
> > a
> > > phrase search?
> > >
> > > Thank you very much in advance for your help.
> > >
> > > Nada Mimouni
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

Re: Phrase indexing and searching with Lucene

Reply via email to