Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Robert Muir Tue, 06 Jan 2009 13:07:17 -0800

a deletion neighborhood can be pretty large (for example robert is something
like robert obert rbert robrt robet ...)
so if you have a 100 million docs with 1 billion words, but only 100k unique
terms, it definitely would be wasteful to have 1 billion deletion
neighborhoods when you only need 100k.


On Tue, Jan 6, 2009 at 4:02 PM, robert engels <[email protected]> wrote:

> Why not just create a new field for this? That is, if you have FieldA,
> create field FieldAFuzzy and put the various permutations there.
>
> The fuzzy scorer/parser can be changed to automatically use the XXXXFuzzy
> field when required.
>
> You could also store positions, and allow that the first term is the
> "closest", next is the second closest, etc. to add support for a slop
> factor.
>
> This is similar to the same way fast phonetic searches can be implemented.
>
> If you do it this way, you don't have any of the synchronization issues
> between the index and the external "fuzzy" index.
>
>
>
> On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:
>
>
>>    [ 
>> https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314
>> ]
>>
>> Robert Muir commented on LUCENE-1513:
>> -------------------------------------
>>
>> otis, discussion was on java-user.
>>
>> again, I apologize for the messy code. as mentioned there, my setup is
>> very specific to exactly what I am doing and in no way is this code ready.
>> But since i'm currently pretty busy with other things at work I just wanted
>> to put something up here for anyone else interested.
>>
>> theres the issues you mentioned, and also some i mentioned on java-user.
>> for example how to handle updates to indexes that introduce new terms (they
>> must be added to auxiliary index), or even if auxiliary index is the best
>> approach.
>>
>> the general idea is that instead of enumerating terms to find terms, the
>> deletion neighborhood as described in the paper is used instead. this way
>> search time is not linear based on number of terms. yes you are correct that
>> it only can guarantee edit distances of K which is determined at index time.
>> perhaps this should be configurable, but i hardcoded k=1 for simplicity. i
>> think its something like 80% of typos...
>>
>> as i mentioned on the list another idea is you could implement FastSS (not
>> the wC variant) with deletion positions maybe by using payloads. This would
>> require more space but eliminate the candidate verification step. maybe it
>> would be nice to have some of their other algorithms such as block-based,etc
>> available also.
>>
>>
>>
>>  fastss fuzzyquery
>>> -----------------
>>>
>>>                Key: LUCENE-1513
>>>                URL: https://issues.apache.org/jira/browse/LUCENE-1513
>>>            Project: Lucene - Java
>>>         Issue Type: New Feature
>>>         Components: contrib/*
>>>           Reporter: Robert Muir
>>>           Priority: Minor
>>>        Attachments: fastSSfuzzy.zip
>>>
>>>
>>> code for doing fuzzyqueries with fastssWC algorithm.
>>> FuzzyIndexer: given a lucene field, it enumerates all terms and creates
>>> an auxiliary offline index for fuzzy queries.
>>> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary
>>> index to retrieve a candidate list. this list is then verified with
>>> levenstein algorithm.
>>> sorry but the code is a bit messy... what I'm actually using is very
>>> different from this so its pretty much untested. but at least you can see
>>> whats going on or fix it up.
>>>
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Reply via email to