I think you would need to store the position in the stream using
position == to the k factor. Pretty straightforward, both for
indexing and for searching.
I think if you want the utmost in performance this is the way to go.
If you don't want to store all of the additional data, I still think
a better fuzzy search can be done without the external index
entirely. As I see it, the external index's sole purpose (in your
case) is to provide "indexed" words at a certain edit distance given
a certain source word. Using a combination of inverting the alg, and
a binary selective search on the term index.
On Jan 6, 2009, at 5:44 PM, Robert Muir wrote:
robert theres only one problem i see: i don't see how you can do a
single search since fastssWC returns some false positives (with k=1
it will still return some things with ED of 2). maybe if you store
the deletion position information as a payload (thus using original
fastss where there are no false positives) it would work though. i
looked at storing position information but it appeared like it
might be complex and the api was (is) still marked experimental so
i didn't go that route.
i also agree lucene index might not be the best possible data
structure... just convenient thats all. i used it because i store
other things related to the term besides deletion neighborhoods for
my fuzzy matching.
i guess i'll also mention that i do think storage size should be a
big consideration. you really don't need this kind of stuff unless
you are searching pretty big indexes in the first place (for <= few
million docs the default fuzzy is probably just fine for a lot of
people).
for me, the whole thing was about turning 30second queries into 1
second queries by removing a linear algorithm, i didn't really
optimize much beyond that because i was just very happy to have
reasonable performance..
On Tue, Jan 6, 2009 at 6:26 PM, robert engels
<reng...@ix.netcom.com> wrote:
I understand now.
The index in my case would definitely be MUCH larger, but I think
it would perform better, as you only need to do a single search -
for obert (if you assume it was a misspelling).
In your case you would eventually do an OR search in the lucene
index for all possible matches (robert, roberta, roberto, ...)
which could be much larger with some commonly prefixed/postfixed
words).
Classic performance vs. size trade-off. In your case where it is
not for misspellings, the performance difference might be worthwhile.
Still, in your case, I am not sure using a Lucene index as the
external index is appropriate. Maybe a simple BTREE (Derby?) index
of (word,edit permutation) with a a key on both would allow easy
search and update. If implemented as a service, some intelligent
caching of common misspellings could really improve the performance.
On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:
On Tue, Jan 6, 2009 at 5:15 PM, robert engels
<reng...@ix.netcom.com> wrote:
It is definitely going to increase the index size, but not any
more than than the external one would (if my understanding is
correct).
The nice thing is that you don't have to try and keep documents
numbers in sync - it will be automatic.
Maybe I don't understand what your external index is storing.
Given that the document contains 'robert' but the user enters'
obert', what is the process to find the matching documents?
heres a simple example. neighborhood stored for robert is 'robert
obert rbert roert ...' this is indexed in a tokenized field.
at query time user typoes robert and enters 'tobert'. again
neighborhood is generated 'tobert obert tbert ...'
the system does a query on tobert OR obert OR tbert ... and robert
is returned because 'obert' is present in both neighborhoods.
in this example, by storing k=1 deletions you guarantee to satisfy
all edit distance matches <= 1 without linear scan.
you get some false positives too with this approach, thats why
what comes back is only a CANDIDATE and true edit distance must be
used to verify. this might be tricky to do with your method, i
don't know.
Is the external index essentially a constant list, that given
obert, the source words COULD BE robert, tobert, reobert etc., and
it contains no document information so:
no. see above, you generate all possible 1-character deletions of
the index term and store them, then at query time you generate all
possible 1-character deletions of the query term. basically,
LUCENE and LUBENE are 1 character different, but they are the same
(LUENE) if you delete 1 character from both of them. so you dont
need to store LUCENE LUBENE LUDENE, you just store LUENE.
given the source word X, and an edit distance k, you ask the
external dictionary for possible indexed words, and it returns the
list, and then use search lucene using each of those words?
If the above is the case, it certainly seems you could generate
this list in real-time rather efficiently with no IO (unless the
external index only stores words which HAVE BEEN indexed).
I think the confusion may be because I understand Otis's comments,
but they don't seem to match what you are stating.
Essentially performing any term match requires efficient searching/
matching of the term index. If this is efficient enough, I don't
think either process is needed - just an improved real-time fuzzy
possibilities word generator.
On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
i see, your idea would definitely simplify some things.
What about the index size difference between this approach and
using separate index? Would this separate field increase index size?
I guess my line of thinking is if you have 10 docs with robert,
with separate index you just have robert, and its deletion
neighborhood one time. with this approach you have the same
thing, but at least you must have document numbers and the other
inverted index stuff with each neighborhood term. would this be a
significant change to size and/or performance? and since the
documents have multiple terms there is additional positional
information for slop factor for each neighborhood term...
i think its worth investigating, maybe performance would actually
be better, just curious. i think i boxed myself in to auxiliary
index because of some other irrelevant thigns i am doing.
On Tue, Jan 6, 2009 at 4:42 PM, robert engels
<reng...@ix.netcom.com> wrote:
I don't think that is the case. You will have single deletion
neighborhood. The number of unique terms in the field is going to
be the union of the deletion dictionaries of each source term.
For example, given the following documents A which have field 'X'
with value best, and document B with value jest (and k == 1).
A will generate est bst, bet, bes, B will generate est, jest,
jst, jes
so field FieldXFuzzy contains
(est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
I don't think the storage requirement is any greater doing it
this way.
3.2.1 Indexing
For all words in a dictionary, and a given number of edit
operations k, FastSS
generates all variant spellings recursively and save them as
tuples of type
vā² ā Ud (v, k) ā (v, x) where v is a dictionary word and x a
list of deletion
positions.
Theorem 5. Index uses O(nmk+1) space, as it stores al l the
variants for n
dictionary words of length m with k mismatches.
3.2.2 Retrieval
For a query p and edit distance k, first generate the
neighborhood Ud (p, k).
Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for each
candidate with
the deletion positions in U(p, k), using Theorem 4.
--
Robert Muir
rcm...@gmail.com
--
Robert Muir
rcm...@gmail.com
--
Robert Muir
rcm...@gmail.com