On Tue, Jan 6, 2009 at 5:15 PM, robert engels <reng...@ix.netcom.com> wrote:
> It is definitely going to increase the index size, but not any more than > than the external one would (if my understanding is correct). > The nice thing is that you don't have to try and keep documents numbers in > sync - it will be automatic. > > Maybe I don't understand what your external index is storing. Given that > the document contains 'robert' but the user enters' obert', what is the > process to find the matching documents? > heres a simple example. neighborhood stored for robert is 'robert obert rbert roert ...' this is indexed in a tokenized field. at query time user typoes robert and enters 'tobert'. again neighborhood is generated 'tobert obert tbert ...' the system does a query on tobert OR obert OR tbert ... and robert is returned because 'obert' is present in both neighborhoods. in this example, by storing k=1 deletions you guarantee to satisfy all edit distance matches <= 1 without linear scan. you get some false positives too with this approach, thats why what comes back is only a CANDIDATE and true edit distance must be used to verify. this might be tricky to do with your method, i don't know. > > Is the external index essentially a constant list, that given obert, the > source words COULD BE robert, tobert, reobert etc., and it contains no > document information so: > no. see above, you generate all possible 1-character deletions of the index term and store them, then at query time you generate all possible 1-character deletions of the query term. basically, LUCENE and LUBENE are 1 character different, but they are the same (LUENE) if you delete 1 character from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just store LUENE. > > given the source word X, and an edit distance k, you ask the external > dictionary for possible indexed words, and it returns the list, and then use > search lucene using each of those words? > > If the above is the case, it certainly seems you could generate this list > in real-time rather efficiently with no IO (unless the external index only > stores words which HAVE BEEN indexed). > > I think the confusion may be because I understand Otis's comments, but they > don't seem to match what you are stating. > > Essentially performing any term match requires efficient searching/matching > of the term index. If this is efficient enough, I don't think either process > is needed - just an improved real-time fuzzy possibilities word generator. > > On Jan 6, 2009, at 3:58 PM, Robert Muir wrote: > > i see, your idea would definitely simplify some things. > > What about the index size difference between this approach and using > separate index? Would this separate field increase index size? > > I guess my line of thinking is if you have 10 docs with robert, with > separate index you just have robert, and its deletion neighborhood one time. > with this approach you have the same thing, but at least you must have > document numbers and the other inverted index stuff with each neighborhood > term. would this be a significant change to size and/or performance? and > since the documents have multiple terms there is additional positional > information for slop factor for each neighborhood term... > > i think its worth investigating, maybe performance would actually be > better, just curious. i think i boxed myself in to auxiliary index because > of some other irrelevant thigns i am doing. > > On Tue, Jan 6, 2009 at 4:42 PM, robert engels <reng...@ix.netcom.com>wrote: > >> I don't think that is the case. You will have single deletion >> neighborhood. The number of unique terms in the field is going to be the >> union of the deletion dictionaries of each source term. >> For example, given the following documents A which have field 'X' with >> value best, and document B with value jest (and k == 1). >> >> A will generate est bst, bet, bes, B will generate est, jest, jst, jes >> >> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) >> >> I don't think the storage requirement is any greater doing it this way. >> >> >> 3.2.1 Indexing >> For all words in a dictionary, and a given number of edit operations k, >> FastSS >> generates all variant spellings recursively and save them as tuples of >> type >> vā² ā Ud (v, k) ā (v, x) where v is a dictionary word and x a list of >> deletion >> positions. >> >> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for >> n >> dictionary words of length m with k mismatches. >> >> >> 3.2.2 Retrieval >> For a query p and edit distance k, first generate the neighborhood Ud (p, >> k). >> Then compare the words in the neighborhood with the index, and find >> matching candidates. Compare deletion positions for each candidate with >> the deletion positions in U(p, k), using Theorem 4. >> >> >> > > > -- > Robert Muir > rcm...@gmail.com > > > -- Robert Muir rcm...@gmail.com