a deletion neighborhood can be pretty large (for example robert is something like robert obert rbert robrt robet ...) so if you have a 100 million docs with 1 billion words, but only 100k unique terms, it definitely would be wasteful to have 1 billion deletion neighborhoods when you only need 100k.
On Tue, Jan 6, 2009 at 4:02 PM, robert engels <[email protected]> wrote: > Why not just create a new field for this? That is, if you have FieldA, > create field FieldAFuzzy and put the various permutations there. > > The fuzzy scorer/parser can be changed to automatically use the XXXXFuzzy > field when required. > > You could also store positions, and allow that the first term is the > "closest", next is the second closest, etc. to add support for a slop > factor. > > This is similar to the same way fast phonetic searches can be implemented. > > If you do it this way, you don't have any of the synchronization issues > between the index and the external "fuzzy" index. > > > > On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote: > > >> [ >> https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314 >> ] >> >> Robert Muir commented on LUCENE-1513: >> ------------------------------------- >> >> otis, discussion was on java-user. >> >> again, I apologize for the messy code. as mentioned there, my setup is >> very specific to exactly what I am doing and in no way is this code ready. >> But since i'm currently pretty busy with other things at work I just wanted >> to put something up here for anyone else interested. >> >> theres the issues you mentioned, and also some i mentioned on java-user. >> for example how to handle updates to indexes that introduce new terms (they >> must be added to auxiliary index), or even if auxiliary index is the best >> approach. >> >> the general idea is that instead of enumerating terms to find terms, the >> deletion neighborhood as described in the paper is used instead. this way >> search time is not linear based on number of terms. yes you are correct that >> it only can guarantee edit distances of K which is determined at index time. >> perhaps this should be configurable, but i hardcoded k=1 for simplicity. i >> think its something like 80% of typos... >> >> as i mentioned on the list another idea is you could implement FastSS (not >> the wC variant) with deletion positions maybe by using payloads. This would >> require more space but eliminate the candidate verification step. maybe it >> would be nice to have some of their other algorithms such as block-based,etc >> available also. >> >> >> >> fastss fuzzyquery >>> ----------------- >>> >>> Key: LUCENE-1513 >>> URL: https://issues.apache.org/jira/browse/LUCENE-1513 >>> Project: Lucene - Java >>> Issue Type: New Feature >>> Components: contrib/* >>> Reporter: Robert Muir >>> Priority: Minor >>> Attachments: fastSSfuzzy.zip >>> >>> >>> code for doing fuzzyqueries with fastssWC algorithm. >>> FuzzyIndexer: given a lucene field, it enumerates all terms and creates >>> an auxiliary offline index for fuzzy queries. >>> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary >>> index to retrieve a candidate list. this list is then verified with >>> levenstein algorithm. >>> sorry but the code is a bit messy... what I'm actually using is very >>> different from this so its pretty much untested. but at least you can see >>> whats going on or fix it up. >>> >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Robert Muir [email protected]
