Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

robert engels Tue, 06 Jan 2009 14:15:09 -0800

It is definitely going to increase the index size, but not any morethan than the external one would (if my understanding is correct).

The nice thing is that you don't have to try and keep documentsnumbers in sync - it will be automatic.

Maybe I don't understand what your external index is storing. Giventhat the document contains 'robert' but the user enters' obert', whatis the process to find the matching documents?

Is the external index essentially a constant list, that given obert,the source words COULD BE robert, tobert, reobert etc., and itcontains no document information so:

given the source word X, and an edit distance k, you ask the externaldictionary for possible indexed words, and it returns the list, andthen use search lucene using each of those words?

If the above is the case, it certainly seems you could generate thislist in real-time rather efficiently with no IO (unless the externalindex only stores words which HAVE BEEN indexed).

I think the confusion may be because I understand Otis's comments,but they don't seem to match what you are stating.

Essentially performing any term match requires efficient searching/matching of the term index. If this is efficient enough, I don'tthink either process is needed - just an improved real-time fuzzypossibilities word generator.


On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

i see, your idea would definitely simplify some things.
What about the index size difference between this approach andusing separate index? Would this separate field increase index size?
I guess my line of thinking is if you have 10 docs with robert,with separate index you just have robert, and its deletionneighborhood one time. with this approach you have the same thing,but at least you must have document numbers and the other invertedindex stuff with each neighborhood term. would this be asignificant change to size and/or performance? and since thedocuments have multiple terms there is additional positionalinformation for slop factor for each neighborhood term...
i think its worth investigating, maybe performance would actuallybe better, just curious. i think i boxed myself in to auxiliaryindex because of some other irrelevant thigns i am doing.
On Tue, Jan 6, 2009 at 4:42 PM, robert engels<[email protected]> wrote:I don't think that is the case. You will have single deletionneighborhood. The number of unique terms in the field is going tobe the union of the deletion dictionaries of each source term.
For example, given the following documents A which have field 'X'with value best, and document B with value jest (and k == 1).
A will generate est bst, bet, bes, B will generate est, jest, jst, jes
so field FieldXFuzzy contains(est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
I don't think the storage requirement is any greater doing it thisway.
3.2.1 Indexing
For all words in a dictionary, and a given number of editoperations k, FastSSgenerates all variant spellings recursively and save them as tuplesof typev′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x alist of deletion
positions.
Theorem 5. Index uses O(nmk+1) space, as it stores al l thevariants for n
dictionary words of length m with k mismatches.


3.2.2 Retrieval
For a query p and edit distance k, first generate the neighborhoodUd (p, k).
Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for each candidatewith
the deletion positions in U(p, k), using Theorem 4.





--
Robert Muir
[email protected]

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Reply via email to