File based spellcheck with doc frequencies supplied
---------------------------------------------------

                 Key: LUCENE-1532
                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/spellchecker
            Reporter: David Bowen


The file-based spellchecker treats all words in the dictionary as equally 
valid, so it can suggest a very obscure word rather than a more common word 
which is equally close to the misspelled word that was entered.  It would be 
very useful to have the option of supplying an integer with each word which 
indicates its commonness.  I.e. the integer could be the document frequency in 
some index or set of indexes.

I've implemented a modification to the spellcheck API to support this by 
defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
word, and a class which implements the interface by looking up the frequency in 
an index.  So Lucene users can provide alternative implementations of 
DocFrequencyInfo.  I could submit this as a patch if there is interest.  
Alternatively, it might be better to just extend the spellcheck API to have a 
way to supply the frequencies when you create a PlainTextDictionary, but that 
would mean storing the frequencies somewhere when building the spellcheck 
index, and I'm not sure how best to do that.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to