robert theres only one problem i see: i don't see how you can do a single
search since fastssWC returns some false positives (with k=1 it will still
return some things with ED of 2). maybe if you store the deletion position
information as a payload (thus using original fastss where there are no
false positives) it would work though. i looked at storing position
information but it appeared like it might be complex and the api was (is)
still marked experimental so i didn't go that route.

i also agree lucene index might not be the best possible data structure...
just convenient thats all. i used it because i store other things related to
the term besides deletion neighborhoods for my fuzzy matching.

i guess i'll also mention that i do think storage size should be a big
consideration. you really don't need this kind of stuff unless you are
searching pretty big indexes in the first place (for <= few million docs the
default fuzzy is probably just fine for a lot of people).

for me, the whole thing was about turning 30second queries into 1 second
queries by removing a linear algorithm, i didn't really optimize much beyond
that because i was just very happy to have reasonable performance..

On Tue, Jan 6, 2009 at 6:26 PM, robert engels <[email protected]> wrote:

> I understand now.
> The index in my case would definitely be MUCH larger, but I think it would
> perform better, as you only need to do a single search - for obert (if you
> assume it was a misspelling).
>
> In your case you would eventually do an OR search in the lucene index for
> all possible matches (robert, roberta, roberto, ...) which could be much
> larger with some commonly prefixed/postfixed words).
>
> Classic performance vs. size trade-off.  In your case where it is not for
> misspellings, the performance difference might be worthwhile.
>
> Still, in your case, I am not sure using a Lucene index as the external
> index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit
> permutation) with a a key on both would allow easy search and update. If
> implemented as a service, some intelligent caching of common misspellings
> could really improve the performance.
>
> On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:
>
>
>
> On Tue, Jan 6, 2009 at 5:15 PM, robert engels <[email protected]>wrote:
>
>>  It is definitely going to increase the index size, but not any more than
>> than the external one would (if my understanding is correct).
>> The nice thing is that you don't have to try and keep documents numbers in
>> sync - it will be automatic.
>>
>> Maybe I don't understand what your external index is storing. Given that
>> the document contains 'robert' but the user enters' obert', what is the
>> process to find the matching documents?
>>
>
> heres a simple example. neighborhood stored for robert is 'robert obert
> rbert roert ...'  this is indexed in a tokenized field.
>
> at query time user typoes robert and enters 'tobert'. again neighborhood is
> generated 'tobert obert tbert ...'
> the system does a query on tobert OR obert OR tbert ... and robert is
> returned because 'obert' is present in both neighborhoods.
> in this example, by storing k=1 deletions you guarantee to satisfy all edit
> distance matches <= 1 without linear scan.
> you get some false positives too with this approach, thats why what comes
> back is only a CANDIDATE and true edit distance must be used to verify. this
> might be tricky to do with your method, i don't know.
>
>
>
>
>>
>> Is the external index essentially a constant list, that given obert, the
>> source words COULD BE robert, tobert, reobert etc., and it contains no
>> document information so:
>>
>
> no. see above, you generate all possible 1-character deletions of the index
> term and store them, then at query time you generate all possible
> 1-character deletions of the query term. basically, LUCENE and LUBENE are 1
> character different, but they are the same (LUENE) if you delete 1 character
> from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
> store LUENE.
>
>>
>> given the source word X, and an edit distance k, you ask the external
>> dictionary for possible indexed words, and it returns the list, and then use
>> search lucene using each of those words?
>>
>> If the above is the case, it certainly seems you could generate this list
>> in real-time rather efficiently with no IO (unless the external index only
>> stores words which HAVE BEEN indexed).
>>
>> I think the confusion may be because I understand Otis's comments, but
>> they don't seem to match what you are stating.
>>
>> Essentially performing any term match requires efficient
>> searching/matching of the term index. If this is efficient enough, I don't
>> think either process is needed - just an improved real-time fuzzy
>> possibilities word generator.
>>
>> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>>
>> i see, your idea would definitely simplify some things.
>>
>> What about the index size difference between this approach and using
>> separate index? Would this separate field increase index size?
>>
>> I guess my line of thinking is if you have 10 docs with robert, with
>> separate index you just have robert, and its deletion neighborhood one time.
>> with this approach you have the same thing, but at least you must have
>> document numbers and the other inverted index stuff with each neighborhood
>> term. would this be a significant change to size and/or performance? and
>> since the documents have multiple terms there is additional positional
>> information for slop factor for each neighborhood term...
>>
>> i think its worth investigating, maybe performance would actually be
>> better, just curious. i think i boxed myself in to auxiliary index because
>> of some other irrelevant thigns i am doing.
>>
>> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <[email protected]>wrote:
>>
>>>  I don't think that is the case. You will have single deletion
>>> neighborhood. The number of unique terms in the field is going to be the
>>> union of the deletion dictionaries of each source term.
>>> For example, given the following documents A which have field 'X' with
>>> value best, and document B with value jest (and k == 1).
>>>
>>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>>>
>>> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>>
>>> I don't think the storage requirement is any greater doing it this way.
>>>
>>>
>>> 3.2.1 Indexing
>>> For all words in a dictionary, and a given number of edit operations k,
>>> FastSS
>>> generates all variant spellings recursively and save them as tuples of
>>> type
>>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
>>> deletion
>>> positions.
>>>
>>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
>>> n
>>> dictionary words of length m with k mismatches.
>>>
>>>
>>> 3.2.2 Retrieval
>>> For a query p and edit distance k, first generate the neighborhood Ud (p,
>>> k).
>>> Then compare the words in the neighborhood with the index, and find
>>> matching candidates. Compare deletion positions for each candidate with
>>> the deletion positions in U(p, k), using Theorem 4.
>>>
>>>
>>>
>>
>>
>> --
>> Robert Muir
>> [email protected]
>>
>>
>>
>
>
> --
> Robert Muir
> [email protected]
>
>
>


-- 
Robert Muir
[email protected]

Reply via email to