On Sun, Mar 21, 2010 at 02:01:41AM -0500, Peter Karman wrote:
> Marvin, please have a look when you have a chance, and let me know what needs
> changing.
The current implementation has a limitation I think is probably pretty
important: 'b NEAR a' doesn't return the same result set as 'a NEAR b'.
I committed a TODO test illustrating the problem. I think to fix it you're
going to need to...
* Change the arrays from unsigned to signed so that subtraction doesn't
produce any surprises.
* Work with absolute values, so that (2 - 3) and (3 - 2) both produce a
proximity of 1.
* Rework both SPIN_CANDIDATES and SPIN_ANCHORS so that they bring the
elements in question "within range" instead of testing for ">=". In
other words, if "near" is 1 and the current anchor is 5, SPIN_CANDIDATES
should stop at 4 rather than keep going until the candidate is at least 5
as it does now.
It's going to be tricky business to get right. You'll probably need a battery
of tests to cover the edge cases.
Superficial stylistic suggestion: I might propose "proximity" (or "nearness",
but "proximity" is better) instead of "near" for the name of that parameter.
Or alternately, "slop", but I understand why you went with nearness instead.
> In the end it was a one-line difference in the SI_winnow_anchors
> implementation
> to get the near/slop feature working. I left the original implementation
> intact
> and put a switch/case wrapper around it to leave the optimization (if any)
> intact for phrases (near==1).
This doesn't technically need changing, but to cut down on the duplicated
code, the switch on self->near should theoretically happen here:
--- ../core/KSx/Search/ProximityScorer.c (revision 5936)
+++ ../core/KSx/Search/ProximityScorer.c (working copy)
@@ -352,8 +352,14 @@
// Splice out anchors that don't match the next term. Bail out if
// we've eliminated all possible anchors.
- anchors_remaining = SI_winnow_anchors(anchors_start, anchors_end,
- candidates_start, candidates_end, i, self->near);
+ if (self->near == 1) { // optimized case
+ anchors_remaining = SI_winnow_anchors(anchors_start, anchors_end,
+ candidates_start, candidates_end, i, 1);
+ }
+ else { // punt case
+ anchors_remaining = SI_winnow_anchors(anchors_start, anchors_end,
+ candidates_start, candidates_end, i, self->near);
+ }
if (!anchors_remaining) { return 0.0f; }
// Adjust end for number of anchors that remain.
Note the compile-time constant "1" being passed to the static inline function
SI_winnow_anchors. That's sufficient for an optimizing compiler to look into
the body of SI_winnow_anchors and replace all instances of the variable with
the compile-time constant, potentially finding optimizations for that one
inline use. At least in theory, it's not necessary to create
SI_winnow_anchors1 and SI_winnow_anchorsN.
It's difficult to verify that the compiler exploited the intended
optimization, though, because you need to look at the assembler. Hard to
write a test case for that.
Marvin Humphrey