It looks to me like what you're trying to do is akin to document similarity, which I haven't had to delve into. But it's been discussed on the user list a few times, so perhaps your best bet would be to search the mail archives for that topic.
Best Erick On Thu, Feb 19, 2009 at 3:14 AM, Nada Mimouni < mimo...@tk.informatik.tu-darmstadt.de> wrote: > > Hello, > > Thank you Erick for this detailed answer, that makes things clearer in my > mind. > > >I'm still not clear why the built-in phrase query syntax won't work. > > I have programmed a set of java classes (I use Lucene classes) to index and > search into a collection of documents for a set of queries. > To test my system, I use a corpus which consists in a collection of queries > (n queries) and documents (m documents). > I started by creating one index for all queries and another one for all > documents. Then I make the search to match between the queries index and > documents index. > I use a trec evaluation tool to generate a file that gives all hits > (matches) between the queryID and documentID with different scores. > > In this first step, I just index terms, therefore the search process (as I > have it now) looks only for term matches between the query terms and the > documents terms. > Now I want to get better results (better matching) by adding phrases to > terms. > > I don't know exactly whether it makes a difference if I index phrases and > terms (erick, erickson, thinks, small, thoughts, erick erickson, erickson > thinks, small thoughts, erickson thoughts) and then search for both, or just > keep the indexing process as it is (erick, erickson, thinks, small, > thoughts) and then make a search for phrases (PhraseQuery : erick erickson, > erickson thinks, small thoughts, erickson thoughts) and terms. > Any idea? > > > >Some examples of what you put in your index and what searches > >you expect to return results for your example AND searches you do > >NOT want to hit that document would be a great help. > > input: > > *Query* > 898 Why is the sun bright? > > *Documents* > 7568 Star, large celestial body composed of gravitationally contained hot > gases emitting electromagnetic radiation, especially light, as a result of > nuclear reactions inside the star. The sun is a star. > 7567 The sun has a magnitude of -26.7, inasmuch as it is about 10 billion > times as bright as Sirius in the earth's sky. > > output: > > qID dID score > 898 7568 0,13 (not relevant) > 898 7567 1 (relevant) > > > In this example, Lucene matches document 7567 to be relevant to he query > (since it contains all query terms), however bright here is relative to > Sirius (what we need is to get "sun bright"). > > > > > Best > Nada > > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Wed 2/18/2009 3:24 PM > To: java-user@lucene.apache.org > Subject: Re: Phrase indexing and searching with Lucene > > I'm still not clear why the built-in phrase query syntax won't work. If I > index the following terms (erick, erickson, thinks, small, thoughts) > in a single field, then searching for "erick erickson" (as a phrase query, > i.e. with double quotes when sent through a query parser or constructing > a PhraseQuery yourself) will generate a hit but "erick thinks" won't > generate a hit (unless you specify slop). > > "thinks small thoughts" would also generate a hit > > If you're saying that you only want to match on *all* the tokens, i.e. > the only way to get a hit on the above would be to search for > "erick erickson thinks small thoughts", then you can create a > field that's UN_ANALYZED. If you do this, though, beware > that you have to do things like lower-case terms yourself when > indexing. > > I have no idea what IndexTermGenerator is or what it does, but I'm > assuming that it just generates single words. > > Some examples of what you put in your index and what searches > you expect to return results for your example AND searches you do > NOT want to hit that document would be a great help. > > As far as searching for both, constructing a BooleanQuery with regular > TermQuerys and PhraseQuerys would work if you're constructing > your queries programmatically, or just using a Lucene query > like +termfield:word +phrasefield:"erick erickson thinks" would > work. Or, if you just require that the phrase exists you could do > it all in one field like > +field:word +field:"erick erickson thinks" > > > > Best > Erick > > > On Wed, Feb 18, 2009 at 8:42 AM, Nada Mimouni < > mimo...@tk.informatik.tu-darmstadt.de> wrote: > > > > > > > Thank you Erick. > > > > I need first to index phrases, the built-in phrase processing (with > double > > quotes) comes in the search step. > > Is there any difference between : > > 1) start by indexing phrases and then make a phrase search > > 2) index terms and then search for phrases > > > > > > To make things clearer: > > > > What I am doing now: > > - In the indexing step: I am using "IndexTermGenerator" to generate > term > > based indexes, one index for all queries I have and another one for > > documents (term means single word). > > - In the search step : Lucene matches terms in queries index with terms > in > > documents index. > > > > What I need to do: > > - Index phrases ("multi" words) in addition to terms (single words) > > - Search for both : phrases and terms > > > > > > Is there any idea on how to proceed? > > > > Regards > > Nada > > > > > > -----Original Message----- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: Wed 2/18/2009 2:10 PM > > To: java-user@lucene.apache.org > > Subject: Re: Phrase indexing and searching with Lucene > > > > Have you tried the built-in phrase processing with double quotes? e.g. > > "this is a phrase"? > > > > See the Term section at > > http://lucene.apache.org/java/2_4_0/queryparsersyntax.html > > > > Best > > Erick > > > > On Wed, Feb 18, 2009 at 5:57 AM, Nada Mimouni < > > mimo...@tk.informatik.tu-darmstadt.de> wrote: > > > > > > > > > > > Hello everybody, > > > > > > I use Lucene to index and search into text documents. > > > At present, I just index and search for single words. I want to extend > > this > > > to phrases (or nGrams). > > > > > > Could anyone please give me details on how to index phrases and then > make > > a > > > phrase search? > > > > > > Thank you very much in advance for your help. > > > > > > Nada Mimouni > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >