robert theres only one problem i see: i don't see how you can do a single search since fastssWC returns some false positives (with k=1 it will still return some things with ED of 2). maybe if you store the deletion position information as a payload (thus using original fastss where there are no false positives) it would work though. i looked at storing position information but it appeared like it might be complex and the api was (is) still marked experimental so i didn't go that route.
i also agree lucene index might not be the best possible data structure... just convenient thats all. i used it because i store other things related to the term besides deletion neighborhoods for my fuzzy matching. i guess i'll also mention that i do think storage size should be a big consideration. you really don't need this kind of stuff unless you are searching pretty big indexes in the first place (for <= few million docs the default fuzzy is probably just fine for a lot of people). for me, the whole thing was about turning 30second queries into 1 second queries by removing a linear algorithm, i didn't really optimize much beyond that because i was just very happy to have reasonable performance.. On Tue, Jan 6, 2009 at 6:26 PM, robert engels <[email protected]> wrote: > I understand now. > The index in my case would definitely be MUCH larger, but I think it would > perform better, as you only need to do a single search - for obert (if you > assume it was a misspelling). > > In your case you would eventually do an OR search in the lucene index for > all possible matches (robert, roberta, roberto, ...) which could be much > larger with some commonly prefixed/postfixed words). > > Classic performance vs. size trade-off. In your case where it is not for > misspellings, the performance difference might be worthwhile. > > Still, in your case, I am not sure using a Lucene index as the external > index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit > permutation) with a a key on both would allow easy search and update. If > implemented as a service, some intelligent caching of common misspellings > could really improve the performance. > > On Jan 6, 2009, at 4:29 PM, Robert Muir wrote: > > > > On Tue, Jan 6, 2009 at 5:15 PM, robert engels <[email protected]>wrote: > >> It is definitely going to increase the index size, but not any more than >> than the external one would (if my understanding is correct). >> The nice thing is that you don't have to try and keep documents numbers in >> sync - it will be automatic. >> >> Maybe I don't understand what your external index is storing. Given that >> the document contains 'robert' but the user enters' obert', what is the >> process to find the matching documents? >> > > heres a simple example. neighborhood stored for robert is 'robert obert > rbert roert ...' this is indexed in a tokenized field. > > at query time user typoes robert and enters 'tobert'. again neighborhood is > generated 'tobert obert tbert ...' > the system does a query on tobert OR obert OR tbert ... and robert is > returned because 'obert' is present in both neighborhoods. > in this example, by storing k=1 deletions you guarantee to satisfy all edit > distance matches <= 1 without linear scan. > you get some false positives too with this approach, thats why what comes > back is only a CANDIDATE and true edit distance must be used to verify. this > might be tricky to do with your method, i don't know. > > > > >> >> Is the external index essentially a constant list, that given obert, the >> source words COULD BE robert, tobert, reobert etc., and it contains no >> document information so: >> > > no. see above, you generate all possible 1-character deletions of the index > term and store them, then at query time you generate all possible > 1-character deletions of the query term. basically, LUCENE and LUBENE are 1 > character different, but they are the same (LUENE) if you delete 1 character > from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just > store LUENE. > >> >> given the source word X, and an edit distance k, you ask the external >> dictionary for possible indexed words, and it returns the list, and then use >> search lucene using each of those words? >> >> If the above is the case, it certainly seems you could generate this list >> in real-time rather efficiently with no IO (unless the external index only >> stores words which HAVE BEEN indexed). >> >> I think the confusion may be because I understand Otis's comments, but >> they don't seem to match what you are stating. >> >> Essentially performing any term match requires efficient >> searching/matching of the term index. If this is efficient enough, I don't >> think either process is needed - just an improved real-time fuzzy >> possibilities word generator. >> >> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote: >> >> i see, your idea would definitely simplify some things. >> >> What about the index size difference between this approach and using >> separate index? Would this separate field increase index size? >> >> I guess my line of thinking is if you have 10 docs with robert, with >> separate index you just have robert, and its deletion neighborhood one time. >> with this approach you have the same thing, but at least you must have >> document numbers and the other inverted index stuff with each neighborhood >> term. would this be a significant change to size and/or performance? and >> since the documents have multiple terms there is additional positional >> information for slop factor for each neighborhood term... >> >> i think its worth investigating, maybe performance would actually be >> better, just curious. i think i boxed myself in to auxiliary index because >> of some other irrelevant thigns i am doing. >> >> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <[email protected]>wrote: >> >>> I don't think that is the case. You will have single deletion >>> neighborhood. The number of unique terms in the field is going to be the >>> union of the deletion dictionaries of each source term. >>> For example, given the following documents A which have field 'X' with >>> value best, and document B with value jest (and k == 1). >>> >>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes >>> >>> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) >>> >>> I don't think the storage requirement is any greater doing it this way. >>> >>> >>> 3.2.1 Indexing >>> For all words in a dictionary, and a given number of edit operations k, >>> FastSS >>> generates all variant spellings recursively and save them as tuples of >>> type >>> vā² ā Ud (v, k) ā (v, x) where v is a dictionary word and x a list of >>> deletion >>> positions. >>> >>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for >>> n >>> dictionary words of length m with k mismatches. >>> >>> >>> 3.2.2 Retrieval >>> For a query p and edit distance k, first generate the neighborhood Ud (p, >>> k). >>> Then compare the words in the neighborhood with the index, and find >>> matching candidates. Compare deletion positions for each candidate with >>> the deletion positions in U(p, k), using Theorem 4. >>> >>> >>> >> >> >> -- >> Robert Muir >> [email protected] >> >> >> > > > -- > Robert Muir > [email protected] > > > -- Robert Muir [email protected]
