sammccall marked an inline comment as done.
sammccall added a comment.

**TL;DR**: i'm no longer convinced of my conclusions for short-query, look at 
the proposal below.

In https://reviews.llvm.org/D52808#1254899, @ioeric wrote:

> > Generate more short-query trigrams, e.g. "AbcDefGhi" now yields "d" and 
> > "ag".
>
> I am concerned about the impact on the size of posting lists (we can measure) 
> and retrieval quality by adding more incomplete trigrams.


There is indeed an impact here. Before: 22309480, After: 22531144.
About 1/3 of this is posting lists, so 1% to overall, 3% on posting lists.
I haven't measured quality (don't have a good way to do that for dex 
currently). It's true that the new results we're admitting are probably worse 
as they don't start with the right letter.

>> This is effectively required by LSP, having "ag" not match but "agh" match 
>> will lead to glitches due to client-side filtering.
> 
> It seems hard to make index behave the same as LSP clients, considering all 
> the ranking signals we have; there can be other reasons that cause `having 
> "ag" not match but "agh" match`.

No, there's only one such reason: we truncated the result list (the symbol 
matched, but it wasn't one of the best N).
This reason is accounted for by LSP: client-side filtering only occurs when 
there was no truncation.

> And it seems reasonable to assume that you would get better symbols as you 
> type more characters (i.e. stronger relevance signal).

The problem is LSP clients are free to assume that the result list is complete 
(unless marked as incomplete) and therefore will never retrieve the better 
symbols.

>> Drop leading-punctuation short-query trigrams. Nice idea, but current 
>> implementation is broken (competes with non-punctuation short query 
>> trigrams).
> 
> Could you elaborate how this is broken? We should probably fix it instead of 
> removing it. `__` not matching `__some_macro` sounds like a regression.

`fuzzyFind` considers `_ptr` to match `unique_ptr`, so it needs to consider `_` 
to match it too.

We will still match `__some_macro`, but the posting lists will match 
~everything and then postfilter. It may make sense to do optimizations or tweak 
overall trigram generation for this case, but the way it's currently done seems 
unprincipled and a little broken.

---

OK, so after some thought on the tradeoffs here, what about this alternative 
design:

- for query length <3, we support really restrictive matches: exact prefix, or 
first char + next head char. We
- we get around the LSP restrictions by always marking such result sets as 
incomplete


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D52808



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to