Marvin Humphrey wrote on 3/21/10 3:07 PM: > On Sun, Mar 21, 2010 at 02:01:41AM -0500, Peter Karman wrote: >> Marvin, please have a look when you have a chance, and let me know what needs >> changing. > > The current implementation has a limitation I think is probably pretty > important: 'b NEAR a' doesn't return the same result set as 'a NEAR b'. >
As you noted earlier in this thread, there is no concensus about what a proximity query is. :) I did consider the fact that proximity might imply that order does not matter. But I came down here: if I want order to matter, and the ProximityScorer ignores order as you're suggesting, then I have no options. I can't limit my search to 'a NEAR b'. If instead we leave the ProximityScorer as is, then this: (a NEAR b) OR (b NEAR a) does what you're describing. Consider too: (a NEAR b NEAR c) which might be written as: "a b c"~10 What order should I consider there? 'a' within 10 positions of 'b' and 'c'? or 'b' within 10 positions of 'a' and 'c'? or... You see how the possibilities multiply. I think simpler is better here: if you want order to not matter, then OR together the various orders you might be interested in. In fact, I may offer that as an option in the Search::Query::Parser, which could then do the ORing programmatically. Likewise, if we choose to support the "a b"~N syntax in the KS QueryParser, could do something similar. I note that one of the Lucene classes you mentioned earlier[0] makes inOrder an option. The Lucene PhraseScorer's slop feature, however, does seem to respect order with no option otherwise. [0] http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/search/spans/SpanNearQuery.java > > Superficial stylistic suggestion: I might propose "proximity" (or "nearness", > but "proximity" is better) instead of "near" for the name of that parameter. > Or alternately, "slop", but I understand why you went with nearness instead. I like 'proximity' for consistency's sake. And yes, 'near' is not quite right. How about 'within'? Or 'vicinity'? > >> In the end it was a one-line difference in the SI_winnow_anchors >> implementation >> to get the near/slop feature working. I left the original implementation >> intact >> and put a switch/case wrapper around it to leave the optimization (if any) >> intact for phrases (near==1). > > This doesn't technically need changing, but to cut down on the duplicated > code, the switch on self->near should theoretically happen here: ah yes, that's much better. -- Peter Karman . http://peknet.com/ . [email protected]
