On Sat, Dec 29, 2012 at 9:58 AM, Mikhail Khludnev <[email protected]> wrote: > Happy New Year, Devs! > > Excuse me for the noob's question. I'm not able to get deep into FST > internals. I run trivial benchmark and not really enjoyed by the results. > > I'm looking for the ultra-fast spelling correction. Right now I use 3.x > SpellChecker which is backed on separate Lucene Ngram index.FWIW, it's > persistent, not in RAMDirectory. Now the bottleneck is I/O. Reading that > Lucene Ngram index takes too much time. I guess it might be solved by > loading Lucene Ngram index into RAMDirectory, but I want to exploit FST > spell check from 4.0. > > What I see, and what makes me wonder. Every > DirectSpellChecker.suggestSimilar() creates new FuzzyTermsEnum and every > time it scans the termsEnum by FilteredTermsEnum.next(). And here I hit the > same slow IO bummer. It might be necessary detail: I read 3.x index by 4.0 > code. I don't think it changes something.
Actually, it does: when 4.x reads a 3.x index it has some non-trivial code to handle the reordering of terms from UTF16 to Unicode sort order. So before concluding anything about the results you should test on a new 4.0 index ... > I don't know anything about FST, but I've thought that it's a compact graph > of syllables, which is visited for finding string similar to the given i.e. > I expect it won't scan termsEnum for every lookup. It would be possible to create an FST and do fuzzy lookup directly from that ... to "approximate" that you could try using MemoryPostingsFormat (stores all tersm + docs in an FST). That should avoid all IO (assuming your OS never swaps out your process RAM ;) ), but it will be a (maybe sizable) lower bound on the perf you'd get with a dedicated Fuzzy search on an FST ... Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
