> From: Lee Mallabone [mailto:[EMAIL PROTECTED]] > > > The > > index does not store the byte-position of words in the > original document. > > Does that rule out the potential to implement proximity > operators? I need to > implement NEAR (and then SAME for paragraph searches), but > I'm a novice in > terms of search engine implementations. Am I likely to be out > of my depth > attempting that right now with Lucene?
Lucene does not directly support paragraph-based searching. Lucene does support proximity searches, e.g., exact phrases, and within-N words (slop). Please see the documentation for PhraseQuery, especially the setSlop(int) method: http://jakarta.apache.org/lucene/api/org/apache/lucene/search/PhraseQuery.ht ml Phrase slop is thus essentially WITHIN. The queryParser class does not yet have a syntax to specify slop. > > Perhaps we should add a utility method such as: > > > > public static Set getHitTokens(Set queryTerms, Reader > text, Analyzer a) > This looks good, but what about the (future) case where you > have complex > (possibly nested) proximity searches and only want to > highlight the relevant > tokens when they appear near each other? As you point out, the method I suggest would highlight isolated occurrences of terms from query phrases in hit documents, even when they do not occur in phrases. (Note that for the document to be a hit, they will somewhere also occur together in a phrase, and possibly quite frequently for a high-scoring hit.) Google and most other search engines implement term highlighting this way, and I think it is acceptable. One could of course write a TokenStream-based query evaluator that correctly interpreted phrasal restrictions when highlighting. Personally, I do not think it is worth the effort, so I am not volunteering to do it myself. Doug _______________________________________________ Lucene-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/lucene-dev
