Hi, Sorry if you get this mail second time. Having some trouble with mail box.
Is there a query in Lucene which matches sub phrases ? For example if the document text is "new york existing homes *3 bed 2 bath*homes 3 miles from city center 2 rooms" and if user enters "Brooklyn homes with *3 bed *rooms and swimming pools", I would like to recognize the fact the the document contained a sub prefix of the user query and give it more score compared to a document which contained all the terms, but not in phrases, for example " new york *2 bed 3 bath"* (should not match). Also note that I do not want the *3 *and *2 *in *3 *miles and *2* rooms to affect the query when I have a sub phrase match. This is something of a middle ground between pure 'boolean OR' query and an 'exact phrase query'. Here the more the number of sub phrases and their lengths, the higher the score compared to individual term matches. I was redirected to Shingle filter which is a token filter that spits out n-grams. But it does not seem to be best solution since one does not know in advance what n in n-grams should be. Also it means one has to get all these bi grams and then construct a boolean OR query which is not very efficient either. Anyone has tried any alternate approaches ? Appreciate your thoughts. The use case is for user entered queries when one does interpret or parse them, but to ensure that the better sub phrases user enters, the better results he gets. Thanks Preetam