Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Robert Muir Tue, 06 Jan 2009 14:29:51 -0800

On Tue, Jan 6, 2009 at 5:15 PM, robert engels <[email protected]> wrote:


> It is definitely going to increase the index size, but not any more than
> than the external one would (if my understanding is correct).
> The nice thing is that you don't have to try and keep documents numbers in
> sync - it will be automatic.
>
> Maybe I don't understand what your external index is storing. Given that
> the document contains 'robert' but the user enters' obert', what is the
> process to find the matching documents?
>

heres a simple example. neighborhood stored for robert is 'robert obert
rbert roert ...'  this is indexed in a tokenized field.

at query time user typoes robert and enters 'tobert'. again neighborhood is
generated 'tobert obert tbert ...'
the system does a query on tobert OR obert OR tbert ... and robert is
returned because 'obert' is present in both neighborhoods.
in this example, by storing k=1 deletions you guarantee to satisfy all edit
distance matches <= 1 without linear scan.
you get some false positives too with this approach, thats why what comes
back is only a CANDIDATE and true edit distance must be used to verify. this
might be tricky to do with your method, i don't know.




>
> Is the external index essentially a constant list, that given obert, the
> source words COULD BE robert, tobert, reobert etc., and it contains no
> document information so:
>

no. see above, you generate all possible 1-character deletions of the index
term and store them, then at query time you generate all possible
1-character deletions of the query term. basically, LUCENE and LUBENE are 1
character different, but they are the same (LUENE) if you delete 1 character
from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
store LUENE.

>
> given the source word X, and an edit distance k, you ask the external
> dictionary for possible indexed words, and it returns the list, and then use
> search lucene using each of those words?
>
> If the above is the case, it certainly seems you could generate this list
> in real-time rather efficiently with no IO (unless the external index only
> stores words which HAVE BEEN indexed).
>
> I think the confusion may be because I understand Otis's comments, but they
> don't seem to match what you are stating.
>
> Essentially performing any term match requires efficient searching/matching
> of the term index. If this is efficient enough, I don't think either process
> is needed - just an improved real-time fuzzy possibilities word generator.
>
> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>
> i see, your idea would definitely simplify some things.
>
> What about the index size difference between this approach and using
> separate index? Would this separate field increase index size?
>
> I guess my line of thinking is if you have 10 docs with robert, with
> separate index you just have robert, and its deletion neighborhood one time.
> with this approach you have the same thing, but at least you must have
> document numbers and the other inverted index stuff with each neighborhood
> term. would this be a significant change to size and/or performance? and
> since the documents have multiple terms there is additional positional
> information for slop factor for each neighborhood term...
>
> i think its worth investigating, maybe performance would actually be
> better, just curious. i think i boxed myself in to auxiliary index because
> of some other irrelevant thigns i am doing.
>
> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <[email protected]>wrote:
>
>>  I don't think that is the case. You will have single deletion
>> neighborhood. The number of unique terms in the field is going to be the
>> union of the deletion dictionaries of each source term.
>> For example, given the following documents A which have field 'X' with
>> value best, and document B with value jest (and k == 1).
>>
>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>>
>> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>
>> I don't think the storage requirement is any greater doing it this way.
>>
>>
>> 3.2.1 Indexing
>> For all words in a dictionary, and a given number of edit operations k,
>> FastSS
>> generates all variant spellings recursively and save them as tuples of
>> type
>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
>> deletion
>> positions.
>>
>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
>> n
>> dictionary words of length m with k mismatches.
>>
>>
>> 3.2.2 Retrieval
>> For a query p and edit distance k, first generate the neighborhood Ud (p,
>> k).
>> Then compare the words in the neighborhood with the index, and find
>> matching candidates. Compare deletion positions for each candidate with
>> the deletion positions in U(p, k), using Theorem 4.
>>
>>
>>
>
>
> --
> Robert Muir
> [email protected]
>
>
>


-- 
Robert Muir
[email protected]

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Reply via email to