I was wondering how Lucene's phrase query would work in case of n-gram indexing. There are two scenarios for popsition increments while adding the index for n-grams. For example consider tri-grams of "united states of america". Scenario 1: Index position token 0 "united" 1 "states", "united states" 2 "of", "states of", "united states of" 3 "america", "of america", "states of america" Scenario 2: Index position token 0 "united", "united states", "united states of" 1 "states", "states of", "states of america" 2 "of", "of america" 3 "america" Does Lucene's performance(fail to match) degrade if I choose one over another? thanks, Rajesh Munavalli
-----Original Message----- From: Sebastian Marius Kirsch [mailto:[EMAIL PROTECTED] Sent: Sunday, July 24, 2005 12:45 PM To: java-user@lucene.apache.org Subject: Re: n-gram indexing Hi Rajeev, I wrote a filter for generating n-grams a while back; I intended to use it for statistics, but I guess you can also use it for search. I also thought of the "boosting effect" you describe when I implemented it, though I never actually tried whether it works that way. It's in the Lucene bugzilla section: http://issues.apache.org/bugzilla/show_bug.cgi?id=35456 Let me know if you have any problems with it. If the lucene phrase queries honour position increments correctly, they should work out of the box with my code. (My code adds the unigram first with a position increment of 1, and then all n-grams starting with that unigram with an increment of 0.) Just make sure you use the same Analyzer for the queries as you do for the indexing, so that n-grams are added to the query as well. As regards the suggestion to rather use a very sloppy phrase or span query: I expect this approach to be faster, as it would only use TermQuery/BooleanQuery. You basically trade index size for search speed. If you get this to work with phrase queries, please add a test case demonstrating it. I saw something about n-gram queries in nutch, as far as I remember; does anyone know whether nutch uses n-grams to speed up phrase queries? Regards, Sebastian On Mon, Jul 18, 2005 at 02:27:28PM -0700, Rajesh Munavalli wrote: > At what point do I add n-grams? Does the order in which I add n-grams > affect exact phrase queries later? My questions are > > (1) Should I add all the 1-grams followed by 2-grams followed by > 3-grams..etc sentence by sentence OR > > (2) Add all the 1 grams of entire document first before starting > 2-grams for the entire document? -- Sebastian Kirsch <[EMAIL PROTECTED]> [http://www.sebastian-kirsch.org/] NOTE: New email address! Please update your address book. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]