On Wed, Sep 7, 2011 at 2:49 PM, Nalini Kartha <[email protected]> wrote: > > We're still stuck on Lucene 2.2 though so it looks like the old version that > requires a separate dictionary index to be built is the only option - is > that correct?
yes > > For the k-gram based spell checker that requires the separate dictionary > index, is there any supported method for keeping the dictionary index in > sync with the original index, i.e. what's the best way to propagate > adds/deletes on the original index to the dictionary index? Do you recommend > just rebuilding the dictionary index at some regular interval? yes, though I think it depends upon your content, how rapidly your terms change, etc. how often you want to rebuild it. note that in more recent lucene 3.x releases, the k-gram spellchecker becomes more efficient to rebuild and uses less memory, e.g. you can specify not to optimize(), and it now omits norms, positions, and frequencies for its internal fields that don't require it. > > I'm also trying to understand how the new spellchecker (FST based) works. My > understanding so far is that we build an FST from the term we're trying to > find corrections for and then I'm sort of fuzzy on how the FST is > intersected with the term dictionary. Is there any detailed documentation > that explains this? it builds an FSA actually, see http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 for the exact algorithm and ugly details: for the lucene implementation of how an FSA is intersected with the term dictionary, you can see the code here: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/AutomatonTermsEnum.java Note, its a little convoluted because it can also efficiently intersect FSAs that accept an infinite language (e.g. arbitrary regular expressions and wildcards), not just levenshtein automata. This is just the default 'intersection' implementation for any term dictionary, but term dictionary implementations in lucene 4 can override the intersection by implementing Terms.intersect(), one that does this to improve perfornance is the BlockTree implementation: https://issues.apache.org/jira/browse/LUCENE-3030 > > Also, are there changes to the term dictionary structure (post 2.2) that are > required to support FST based spell correction? If so, what exactly are > those changes? A presentation I saw alluded to Fast Numeric Range queries > introduced in Lucene 2.9 but I didn't quite understand how the two are > related. theoretically, no, however its implemented much easier and more efficient given API improvements to MultiTermQuery (Uwe's work with NumericRangeQuery opened this up) additionally: another reason why this stuff is a lot easier in 4.0 is because terms in 4.0 are indexed (or at least "appear to be") in binary order, but previous lucene versions indexed and presented them in character (UTF-16) order. a lot of the API changes that were necessary to get this stuff performing well were fairly big changes, making it difficult/dangerous to backport to older versions of lucene such as 3.x -- lucidimagination.com
