Re: FST codec for infix queries. No luck so far.

Michael Sokolov Tue, 26 Apr 2022 12:46:04 -0700

I'm not sure under which scenario ngrams (edgengrams) would not be an
option? Another to try maybe would be something like BPE (byte pair
encoding). In this encoding, you train a set of tokens from a
vocabulary based on frequency of occurrence, and agglomerate them
iteratively until you have the vocabulary at a size you like. You tend
to end up with commonly-ocurring subwords (morphemes) that can
possibly be good indexing choices for this sort of thing?


On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless
<luc...@mikemccandless.com> wrote:
>
> One small datapoint: Amazon's customer facing product search now includes 
> some infix suggestions (using Lucene's AnalyzingInfixSuggester), but only in 
> fallback cases when the prefix suggesters didn't find compelling options.
>
> And I think Netflix's suggester used to be primarily infix, but now when I 
> tested it, I get no suggestions at all, only live search results, which I 
> like less :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
>>
>> Hi Mikhail,
>>
>> I don't have any spectacular suggestions but something stemming from 
>> experience.
>>
>> 1) While the problem is intellectually interesting, I rarely found
>> anybody who'd be comfortable with using infix suggestions - people are
>> very used to "completions" happening on a prefix of one or multiple
>> words (see my note below, though).
>>
>> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
>> word suffix(es) -> complete word instead of offsets within the block?
>> This can be combined with term frequency to limit the number of
>> suggested words to just certain categories (or most frequent terms)
>> which would make the fst smaller still.
>>
>> 3) I'd never try to store infixes shorter than 2, 3 characters (you
>> said you did it - "I even limited suffixes length to reduce their
>> number"). This requires folks to type in longer input but prevents fst
>> bloat and in general leads to higher-quality suggestions (since
>> there'll be so many of them).
>>
>> > Otherwise, with many smaller segments fully scanning term dictionaries is 
>> > comparable to seeking suffixes FST and scanning certain blocks.
>>
>> Yeah, I'd expect the automaton here to be huge. The complexity of the
>> vocabulary and number of characters in the language will also play a
>> key role.
>>
>> 4) IntelliJ idea has this kind of "search everywhere" functionality
>> which greps for infixes (it is really nice). I recall looking at the
>> (open source engine) to see how it was done and my conclusion from
>> glancing over the code was that it's a fixed, coarse, n-gram based
>> index of consecutive letters pointing at potential matches, which are
>> then revalidated against the query. So you have a super-simple index,
>> with a very fast lookup and the cost of verifying and finding exact
>> matches is shifted to once you have a candidate list. While this
>> doesn't help with Lucene indexes, perhaps it's a sign that for this
>> particular task a different index/search paradigm is needed?
>>
>>
>> Dawid
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: FST codec for *infix* queries. No luck so far.

Reply via email to

Re: FST codec for infix queries. No luck so far.