: Document 1: : "united states is .... United airlines operates in 50 states. United : states government....." : : Document 2: : "united states is .... United airlines operates in 50 states. United : some other word states" : : If you consider the tf-idf weight of individual terms "united" and : "states" they would have exact score in both the documents. But the term : "united states" should have higher weight in Document 1. The higher : weight can be achieved through bi-grams which would include the word : "united states". My guess is, lucene will retrieve both documents with : equal score.
I believe yyour intuition is correct, but did you try what i suggested below? .. i don't believe you have to index all of hte possible n-grams, just construct a boolean query containing the n-grams, and boost the larger n-grams (by some factor only experimenting will tell you). you should also try out SpanNearQuery ... I believe it scores thingsw higher if they appear closer together -- but i'm not confident of my memory in htat case. Seriously, try what i suggest below ... experimentation is the best way to answer questions like this... : -----Original Message----- : Sent: Monday, July 18, 2005 5:11 PM : To: java-user@lucene.apache.org : Subject: RE: n-gram indexing : : : Your intuition is right, but i can't think of any reason why you need to : add the n-grams at indexing time -- or why using phrase queries would be : a bad thing in this case. When you get a multiword query, construct the : n-grams of the query words as multiple phrase queries and search for a : BooleanQuery of all those phrases (with bigger boosts given to longer : phrases) : : Using your example of "united states of america" generate a query that : looks something like... : : united states america : "united states"^10 "states america"^10 : "united states america"^100 -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]