Re: ProximityQuery

Marvin Humphrey Sun, 21 Mar 2010 13:07:38 -0700

On Sun, Mar 21, 2010 at 02:01:41AM -0500, Peter Karman wrote:
> Marvin, please have a look when you have a chance, and let me know what needs
> changing.


The current implementation has a limitation I think is probably pretty
important: 'b NEAR a' doesn't return the same result set as 'a NEAR b'.

I committed a TODO test illustrating the problem.  I think to fix it you're
going to need to...

   * Change the arrays from unsigned to signed so that subtraction doesn't
     produce any surprises.
   * Work with absolute values, so that (2 - 3) and (3 - 2) both produce a
     proximity of 1.
   * Rework both SPIN_CANDIDATES and SPIN_ANCHORS so that they bring the 
     elements in question "within range" instead of testing for ">=".  In
     other words, if "near" is 1 and the current anchor is 5, SPIN_CANDIDATES
     should stop at 4 rather than keep going until the candidate is at least 5
     as it does now.

It's going to be tricky business to get right.  You'll probably need a battery
of tests to cover the edge cases.

Superficial stylistic suggestion: I might propose "proximity" (or "nearness",
but "proximity" is better) instead of "near" for the name of that parameter.
Or alternately, "slop", but I understand why you went with nearness instead.

> In the end it was a one-line difference in the SI_winnow_anchors 
> implementation
> to get the near/slop feature working. I left the original implementation 
> intact
> and put a switch/case wrapper around it to leave the optimization (if any)
> intact for phrases (near==1).

This doesn't technically need changing, but to cut down on the duplicated
code, the switch on self->near should theoretically happen here:

--- ../core/KSx/Search/ProximityScorer.c    (revision 5936)
+++ ../core/KSx/Search/ProximityScorer.c    (working copy)
@@ -352,8 +352,14 @@
         
         // Splice out anchors that don't match the next term.  Bail out if
         // we've eliminated all possible anchors.
-        anchors_remaining = SI_winnow_anchors(anchors_start, anchors_end,
-            candidates_start, candidates_end, i, self->near);
+        if (self->near == 1) { // optimized case
+            anchors_remaining = SI_winnow_anchors(anchors_start, anchors_end,
+                candidates_start, candidates_end, i, 1);
+        }
+        else { // punt case
+            anchors_remaining = SI_winnow_anchors(anchors_start, anchors_end,
+                candidates_start, candidates_end, i, self->near);
+        }
         if (!anchors_remaining) { return 0.0f; }
 
         // Adjust end for number of anchors that remain. 


Note the compile-time constant "1" being passed to the static inline function
SI_winnow_anchors.  That's sufficient for an optimizing compiler to look into
the body of SI_winnow_anchors and replace all instances of the variable with
the compile-time constant, potentially finding optimizations for that one
inline use.  At least in theory, it's not necessary to create
SI_winnow_anchors1 and SI_winnow_anchorsN.

It's difficult to verify that the compiler exploited the intended
optimization, though, because you need to look at the assembler.  Hard to
write a test case for that. 

Marvin Humphrey

Re: ProximityQuery

Reply via email to