Re: New feature idea - Backwards (FST) dictionary for approximate string search

Michael Sokolov Wed, 10 Jul 2019 18:00:33 -0700

You could create an alternate field in which you index the reverse tokens,
using a reverse token filter appended to the analysis chain of the main
field.


That leaves the question of how do you manage that in a service context,
and what kind of query makes use of two fields?

Otherwise you have to make a lower level change of some sort. I'm not sure
where you would do that cleanly.

On Wed, Jul 10, 2019, 7:40 AM Juan Caicedo <juan.caic...@cavorite.com>
wrote:

> I didn’t integrate it directly with a Lucene index. I used the FST class
> (and related utilities) to build a stand-alone spellchecker.
>
> However, the FST for the term dictionary is modified at index time only.
> At query time, we need to create a variation of the Levemshtein automata
> and match it with the FST accordingly.
>
> I need to take a closer look at the Lucene code to see how can we
> integrate it with the existing modules. I’ll do that this week.
>
> On Tue, Jul 9, 2019 at 03:18 Michael McCandless <luc...@mikemccandless.com>
> wrote:
>
>> This sounds interesting!
>>
>> Did your patch build the FST at index time, so at search time all that
>> was needed was to load the FST from the index directory?
>>
>> Is the FST per-segment?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Jul 9, 2019 at 1:43 AM Juan Caicedo <juan.caic...@cavorite.com>
>> wrote:
>>
>>> Hi Michael,
>>>
>>> I guess that I should have added more details :-)
>>>
>>> The main benefit from the technique is to do approximate search in
>>> large dictionaries. That is: find all the entries that are within 1 or
>>> 2 edit steps from the query. It's more useful for longer (starting
>>> from ~7 characters, iirc), but in general the technique can be applied
>>> to queries of any length. The main requirement is that we build the
>>> dictionary as an FST, using the backwards (reversed) keys of the
>>> original dictionary.
>>>
>>> I initially used the technique to implement a stand-alone
>>> spellchecker, but I think that it can also be used to optimize the
>>> Lucene fuzzy queries (e.g. for the spelling/suggest module). However,
>>> I'll need to look how can it be integrated with the part that creates
>>> the dictionary.
>>>
>>> I'll take a look at the code this week and I'll try to publish it in a
>>> public repository so that we can discuss about this with more concrete
>>> details.
>>>
>>> I skimmed the paper trying to understand possible applications of the
>>> technique. It sounds like efficient approximate (ie with some edits)
>>> substring search is the main idea? I don't believe such a query exists
>>> today in Lucene (nor any Suggester as far as I know). It sounds as if
>>> this would be useful for searching within large strings, eg DNA
>>> sequences or something like that, and maybe less applicable to typical
>>> "full text" (ie tokenized) search where the strings being searched are
>>> relatively shorter - does that sound right?
>>>
>>> On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov <msoko...@gmail.com>
>>> wrote:
>>> >
>>> > Juan, that sounds intriguing.
>>> >
>>> > I skimmed the paper trying to understand possible applications of the
>>> > technique. It sounds like efficient approximate (ie with some edits)
>>> > substring search is the main idea? I don't believe such a query exists
>>> > today in Lucene (nor any Suggester as far as I know). It sounds as if
>>> > this would be useful for searching within large strings, eg DNA
>>> > sequences or something like that, and maybe less applicable to typical
>>> > "full text" (ie tokenized) search where the strings being searched are
>>> > relatively shorter - does that sound right?
>>> >
>>> > On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo <
>>> juan.caic...@cavorite.com> wrote:
>>> > >
>>> > > Hello,
>>> > >
>>> > > I've been working on a project for extending LevenshteinAutomata and
>>> > > I'd like to know if it would be useful to add it to Lucene.
>>> > >
>>> > > I've implemented the 'backwards dictionary' technique (see [1],
>>> > > section 6) for speeding up approximate search. This technique allows
>>> > > us to narrow down the search and, therefore, reduce the running time
>>> > > (at the expense of using more memory).
>>> > >
>>> > > I implemented it quite some time ago using an older version of
>>> Lucene,
>>> > > so I need to revisit the code. However, the implementation was
>>> > > relatively simple and it didn't require major changes to the core
>>> > > classes. I can share the code in a public repository and iterate on
>>> > > it, while I make it compatible for new Lucene APIs, add benchmarks,
>>> > > and more unit tests.
>>> > >
>>> > > Ideally, I'd like to contribute to Lucene, either as part of core,
>>> > > suggest or a different module.
>>> > >
>>> > > What do you think?
>>> > >
>>> > > [1]
>>> https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf
>>> > >
>>> > > ---------------------------------------------------------------------
>>> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> > > For additional commands, e-mail: dev-h...@lucene.apache.org
>>> > >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>

Re: New feature idea - Backwards (FST) dictionary for approximate string search

Reply via email to