Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

robert engels Tue, 06 Jan 2009 17:04:26 -0800

I think you would need to store the position in the stream usingposition == to the k factor. Pretty straightforward, both forindexing and for searching.


I think if you want the utmost in performance this is the way to go.

If you don't want to store all of the additional data, I still thinka better fuzzy search can be done without the external indexentirely. As I see it, the external index's sole purpose (in yourcase) is to provide "indexed" words at a certain edit distance givena certain source word. Using a combination of inverting the alg, anda binary selective search on the term index.


On Jan 6, 2009, at 5:44 PM, Robert Muir wrote:

robert theres only one problem i see: i don't see how you can do asingle search since fastssWC returns some false positives (with k=1it will still return some things with ED of 2). maybe if you storethe deletion position information as a payload (thus using originalfastss where there are no false positives) it would work though. ilooked at storing position information but it appeared like itmight be complex and the api was (is) still marked experimental soi didn't go that route.
i also agree lucene index might not be the best possible datastructure... just convenient thats all. i used it because i storeother things related to the term besides deletion neighborhoods formy fuzzy matching.
i guess i'll also mention that i do think storage size should be abig consideration. you really don't need this kind of stuff unlessyou are searching pretty big indexes in the first place (for <= fewmillion docs the default fuzzy is probably just fine for a lot ofpeople).
for me, the whole thing was about turning 30second queries into 1second queries by removing a linear algorithm, i didn't reallyoptimize much beyond that because i was just very happy to havereasonable performance..
On Tue, Jan 6, 2009 at 6:26 PM, robert engels<[email protected]> wrote:
I understand now.
The index in my case would definitely be MUCH larger, but I thinkit would perform better, as you only need to do a single search -for obert (if you assume it was a misspelling).
In your case you would eventually do an OR search in the luceneindex for all possible matches (robert, roberta, roberto, ...)which could be much larger with some commonly prefixed/postfixedwords).
Classic performance vs. size trade-off. In your case where it isnot for misspellings, the performance difference might be worthwhile.
Still, in your case, I am not sure using a Lucene index as theexternal index is appropriate. Maybe a simple BTREE (Derby?) indexof (word,edit permutation) with a a key on both would allow easysearch and update. If implemented as a service, some intelligentcaching of common misspellings could really improve the performance.
On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:
On Tue, Jan 6, 2009 at 5:15 PM, robert engels<[email protected]> wrote:It is definitely going to increase the index size, but not anymore than than the external one would (if my understanding iscorrect).
The nice thing is that you don't have to try and keep documentsnumbers in sync - it will be automatic.
Maybe I don't understand what your external index is storing.Given that the document contains 'robert' but the user enters'obert', what is the process to find the matching documents?
heres a simple example. neighborhood stored for robert is 'robertobert rbert roert ...' this is indexed in a tokenized field.
at query time user typoes robert and enters 'tobert'. againneighborhood is generated 'tobert obert tbert ...'the system does a query on tobert OR obert OR tbert ... and robertis returned because 'obert' is present in both neighborhoods.in this example, by storing k=1 deletions you guarantee to satisfyall edit distance matches <= 1 without linear scan.you get some false positives too with this approach, thats whywhat comes back is only a CANDIDATE and true edit distance must beused to verify. this might be tricky to do with your method, idon't know.
Is the external index essentially a constant list, that givenobert, the source words COULD BE robert, tobert, reobert etc., andit contains no document information so:
no. see above, you generate all possible 1-character deletions ofthe index term and store them, then at query time you generate allpossible 1-character deletions of the query term. basically,LUCENE and LUBENE are 1 character different, but they are the same(LUENE) if you delete 1 character from both of them. so you dontneed to store LUCENE LUBENE LUDENE, you just store LUENE.
given the source word X, and an edit distance k, you ask theexternal dictionary for possible indexed words, and it returns thelist, and then use search lucene using each of those words?
If the above is the case, it certainly seems you could generatethis list in real-time rather efficiently with no IO (unless theexternal index only stores words which HAVE BEEN indexed).
I think the confusion may be because I understand Otis's comments,but they don't seem to match what you are stating.
Essentially performing any term match requires efficient searching/matching of the term index. If this is efficient enough, I don'tthink either process is needed - just an improved real-time fuzzypossibilities word generator.
On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
i see, your idea would definitely simplify some things.
What about the index size difference between this approach andusing separate index? Would this separate field increase index size?
I guess my line of thinking is if you have 10 docs with robert,with separate index you just have robert, and its deletionneighborhood one time. with this approach you have the samething, but at least you must have document numbers and the otherinverted index stuff with each neighborhood term. would this be asignificant change to size and/or performance? and since thedocuments have multiple terms there is additional positionalinformation for slop factor for each neighborhood term...
i think its worth investigating, maybe performance would actuallybe better, just curious. i think i boxed myself in to auxiliaryindex because of some other irrelevant thigns i am doing.
On Tue, Jan 6, 2009 at 4:42 PM, robert engels<[email protected]> wrote:I don't think that is the case. You will have single deletionneighborhood. The number of unique terms in the field is going tobe the union of the deletion dictionaries of each source term.
For example, given the following documents A which have field 'X'with value best, and document B with value jest (and k == 1).
A will generate est bst, bet, bes, B will generate est, jest,jst, jes
so field FieldXFuzzy contains(est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
I don't think the storage requirement is any greater doing itthis way.
3.2.1 Indexing
For all words in a dictionary, and a given number of editoperations k, FastSSgenerates all variant spellings recursively and save them astuples of typev′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x alist of deletion
positions.
Theorem 5. Index uses O(nmk+1) space, as it stores al l thevariants for n
dictionary words of length m with k mismatches.


3.2.2 Retrieval
For a query p and edit distance k, first generate theneighborhood Ud (p, k).
Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for eachcandidate with
the deletion positions in U(p, k), using Theorem 4.





--
Robert Muir
[email protected]
--
Robert Muir
[email protected]
--
Robert Muir
[email protected]

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Reply via email to