I'm not sure under which scenario ngrams (edgengrams) would not be an
option? Another to try maybe would be something like BPE (byte pair
encoding). In this encoding, you train a set of tokens from a
vocabulary based on frequency of occurrence, and agglomerate them
iteratively until you have the vocabulary at a size you like. You tend
to end up with commonly-ocurring subwords (morphemes) that can
possibly be good indexing choices for this sort of thing?

On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless
<luc...@mikemccandless.com> wrote:
>
> One small datapoint: Amazon's customer facing product search now includes 
> some infix suggestions (using Lucene's AnalyzingInfixSuggester), but only in 
> fallback cases when the prefix suggesters didn't find compelling options.
>
> And I think Netflix's suggester used to be primarily infix, but now when I 
> tested it, I get no suggestions at all, only live search results, which I 
> like less :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
>>
>> Hi Mikhail,
>>
>> I don't have any spectacular suggestions but something stemming from 
>> experience.
>>
>> 1) While the problem is intellectually interesting, I rarely found
>> anybody who'd be comfortable with using infix suggestions - people are
>> very used to "completions" happening on a prefix of one or multiple
>> words (see my note below, though).
>>
>> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
>> word suffix(es) -> complete word instead of offsets within the block?
>> This can be combined with term frequency to limit the number of
>> suggested words to just certain categories (or most frequent terms)
>> which would make the fst smaller still.
>>
>> 3) I'd never try to store infixes shorter than 2, 3 characters (you
>> said you did it - "I even limited suffixes length to reduce their
>> number"). This requires folks to type in longer input but prevents fst
>> bloat and in general leads to higher-quality suggestions (since
>> there'll be so many of them).
>>
>> > Otherwise, with many smaller segments fully scanning term dictionaries is 
>> > comparable to seeking suffixes FST and scanning certain blocks.
>>
>> Yeah, I'd expect the automaton here to be huge. The complexity of the
>> vocabulary and number of characters in the language will also play a
>> key role.
>>
>> 4) IntelliJ idea has this kind of "search everywhere" functionality
>> which greps for infixes (it is really nice). I recall looking at the
>> (open source engine) to see how it was done and my conclusion from
>> glancing over the code was that it's a fixed, coarse, n-gram based
>> index of consecutive letters pointing at potential matches, which are
>> then revalidated against the query. So you have a super-simple index,
>> with a very fast lookup and the cost of verifying and finding exact
>> matches is shifted to once you have a candidate list. While this
>> doesn't help with Lucene indexes, perhaps it's a sign that for this
>> particular task a different index/search paradigm is needed?
>>
>>
>> Dawid
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to