[PATCH] D44720: [clangd] Simplify fuzzy matcher (sequence alignment) by removing some condition checks.

Sam McCall via Phabricator via cfe-commits Thu, 29 Mar 2018 11:11:28 -0700

sammccall added inline comments.


================
Comment at: clangd/FuzzyMatch.cpp:96
     return None;
   return ScoreScale * std::min(PerfectBonus * PatN, std::max<int>(0, Best));
 }
----------------
MaskRay wrote:
> sammccall wrote:
> > MaskRay wrote:
> > > I also don't understand why it clamps the value to zero here. Negative 
> > > values are also meaningful to me. Given that perfectBonus is only 3 it is 
> > > very easy to get a negative value here.
> > An important part of the contract of `match()` is that it returns a value 
> > in `[0,1]`.
> > We rely on this range to combine this with other scoring signals - we 
> > multiply this by a quality signal in code completion.
> > (Currently the quality signal is just derived from Sema, but the global 
> > index will provide the number of uses).
> > 
> > It would be possible to use a different squash function here, but I found 
> > max(kFloor,x) worked well for the examples I looked at - anything <= some 
> > floor value was "not really a useful match at all", and most of the 
> > variance below the floor seemed to be noise to me.
> > (Then I tuned the bonuses/penalties so the floor was at zero)
> We could try other criteria in the future. I believe the current one can be 
> improved because negative scores may be returned but the scoring shouldn't 
> return 0 for all the cases.
Sure, we can try other things, and to gather more data.
(To be clear though - with the data I *did* look at, including the scores <0 
did not add more information, only noise)


================
Comment at: clangd/FuzzyMatch.cpp:230
 void FuzzyMatcher::buildGraph() {
+  Scores[0][0][Miss] = Scores[0][0][Match] = {0, Miss};
   for (int W = 0; W < WordN; ++W) {
----------------
MaskRay wrote:
> sammccall wrote:
> > why this change?
> > this has also been moved from the cheaper constructor to the more expensive 
> > per-match call. (also the diagonal assignment added in the next loop)
> > 
> > Also, shouldn't [0][0][Match] be AwfulScore?
> > 
> "The more expensive per-match call" is just two value assignments.
> 
> I have removed the expensive table initialization in the constructor.
> 
> [0][0][*] can be any value.
> "The more expensive per-match call" is just two value assignments.
Oops, sorry - by "more expensive" I mean "called thousands of times more often".

> I have removed the expensive table initialization in the constructor.
I don't want to be rude, but I asked why you changed this, and you didn't 
answer. Unless there's a strong reason, I'd prefer to revert this change, as I 
find this harder to reason about.
(Roughly: in the old version of the code, any data that didn't need to change 
for the life of the object was initialized in the constructor. That way I 
didn't need to worry what was performance-critical and what wasn't - match() 
only did what was strictly necessary).

> [0][0][*] can be any value.
Can you please explain why?


================
Comment at: clangd/FuzzyMatch.cpp:325
+  int W = PatN;
+  for (int I = PatN; ++I <= WordN; )
+    if (Scores[PatN][I][Match].Score > Scores[PatN][W][Match].Score)
----------------
MaskRay wrote:
> sammccall wrote:
> > nit: I -> P, move increment to the increment expression of the for loop?
> I -> P.
> 
> > move increment to the increment expression of the for loop?
> 
> Not sure about the coding standard here, but if you insist I'll have to 
> change it as you are the reviewer. If the loop variable was an iterator, `for 
> (It I = std::next(...); I != E; ++I)` would be uglier than `for (It I = ...; 
> ++I != E; )`
Uglier is subjective, but side-effects in the condition of a for-loop is 
sufficiently unusual and surprising that I'd prefer to avoid it in both cases.


================
Comment at: clangd/FuzzyMatch.cpp:340
+  A[WordN] = Miss;
+  for (int I = WordN, P = PatN; I > 0; I--) {
+    if (I == W)
----------------
MaskRay wrote:
> sammccall wrote:
> > sammccall wrote:
> > > sammccall wrote:
> > > > W is the right name in this file for a variable iterating over word 
> > > > indices, please don't change this.
> > > > The new variable above could be EndW or so?
> > > As far as I can see, this loop is setting `A[W+1:...] = Miss` and 
> > > populating `A[0...W]` with the exsting logic.
> > > I think this would be clearer as two loops, currently there's a lot of 
> > > conditionals around Last that obscure what's actually happening.
> > You've shifted P (and the old W, new I) by 1. This does reduce the number 
> > of +1 and -1 in this function, but it's inconsistent with how these are 
> > used elsewhere: P should be the index into Pat of the character that we're 
> > considering.
> I don't understand the rationale not to use the shifted indices. The code 
> actually use `Scores[P][W][*]` to mean the optimal match of the first `P` 
> characters of the pattern with the first `W` characters of the word, not the 
> position of the character.
> 
> On the other hand, C++ reverse iterators use the shifted one for `for (I = 
> rend(); I != rbegin(); ++I)`. The shifted one makes ending condition check 
> easier.
> I don't understand the rationale not to use the shifted indices
The rationale is entirely consistency with the surrounding code. The 
consistency helps avoid off-by-one errors when similar loops have different 
conventions.

In this file, when looping over word or pattern dimensions, P and W 
respectively are used for loop variables, and can be interpreted as indices 
into Pat/Word.
Here the interpretation would be "did we match or miss character Word[W]"


================
Comment at: clangd/FuzzyMatch.cpp:93
+  for (int I = PatN; I <= WordN; I++)
+    Best = std::max(Best, Scores[PatN][I][Match].Score);
   if (isAwful(Best))
----------------
MaskRay wrote:
> sammccall wrote:
> > MaskRay wrote:
> > > sammccall wrote:
> > > > this looks like a behavior change - why?
> > > This is a behavior change. Instead of choosing between `Match/Miss` in 
> > > the last position, we enumerate the last matching position in `Word`.
> > > 
> > > This saves `if (P < PatN - 1) {` check in the main loop at the cost of a 
> > > for loop here (use sites of ending values)
> > Ah, I see - the case where we match only part of the word is handled up 
> > here now.
> > (I think you mean this is not a behavior change? The result is the same 
> > AFAICS)
> > 
> > That does make more sense, but it's pretty subtle.
> > Can you add a comment like
> >  `// The pattern doesn't have to match the whole word (but the whole 
> > pattern must match).`
> Added
> ```
>   // Find the optimal prefix of Word to match Pattern.
> ```
> 
> I meant this is a behavior change but it makes the first row and the rest 
> rows of the score table more consistent.
That comment really doesn't capture what's significant about this line - it's 
the *policy*, rather than the mechanism, that needs highlighting here.

(Re: behavior change - I *think* there's no inputs for which we produce a 
different match result/score because of this patch, right?)


================
Comment at: clangd/FuzzyMatch.h:61
   bool allowMatch(int P, int W) const;
-  int skipPenalty(int W, Action Last) const;
-  int matchBonus(int P, int W, Action Last) const;
+  int missScore(int W, Action Last) const;
+  int matchScore(int P, int W, Action Last) const;
----------------
MaskRay wrote:
> MaskRay wrote:
> > sammccall wrote:
> > > FWIW, I don't think this is an improvement - I think the clarity of 
> > > purpose in names is more useful than having consistent signs in this case.
> > Keep `matchBonus` but rename `skipPenalty` to `missPenalty` ?
> Also note in the original scheme, the skip score does not need to be 
> negative. Because Scores[PatN][WordN][] is used and each path takes the same 
> number of positions (WordN). We can add an offset to all positional scores to 
> make all of them non-negative. In the new scheme it does make sense to make 
> them negative, as each path has now different number of positions.
missPenalty SGTM.

(I don't see any particular reason to avoid negative numbers here - it has a 
natural interpretation: a positive increment means the match is better than if 
that action wasn't taken, negative means it's worse, etc)


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D44720



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D44720: [clangd] Simplify fuzzy matcher (sequence alignment) by removing some condition checks.

Reply via email to