On Mon, Jan 11, 2021 at 9:38 AM Peter Gromov <[email protected]> wrote: > > Hi, > > I'd like to contribute to the support of Hunspell in Lucene, specifically: > * support the flags necessary for English, German, French, Spanish and > Russian dictionaries, possibly more languages later > * provide a public API to check if a word is misspelled > * mirror Hunspell's suggestion algorithm in Lucene, probably in the > "src/suggest" module > > For context: I work on natural language support for IntelliJ-based IDEs. We'd > like to use Hunspell dictionaries there, but interfacing with native binaries > proved to be slow and unreliable. So we'd prefer a JVM-only reimplementation > of Hunspell spellchecker and suggester. Lucene's Hunspell-related code > currently seems closest to that goal, so we thought we can enhance it further. > > Is there anything non-obvious that I should know before diving into the > implementation?
great! note that currently the code tries to determine stems for a word only. For that, it should already support the dictionaries languages you mentioned (various flag encodings and all that). There's no decompounding logic to support languages like german (for search purposes you can find some alternatives for this in the source tree). There's no "suggest" logic to try to generate potential correctly-spelled-words. I'm not sure how many dictionaries in practice really provide the options to "tweak" the default hunspell correction algorithm. Most of the code was written based on the documentation in the hunspell(4) manual page. It is best to keep the tests small by making a "mini dictionary" and associated test case when trying to fix something. Of course it is not always so easy to boil problems down into such a test, but it at least ensures things are improving and prevents playing whack-a-mole. Example: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestZeroAffix.java https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/zeroaffix.aff https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/zeroaffix.dic > The contribution will likely consist of many commits, dedicated to specific > subtasks or small refactorings. Should I file separate JIRA issues for each > of them, or having a single big one (e.g. "Hunspell improvements") is enough? > > Peter Gromov IMO smaller issues are better here. If improvements have a test and don't break the other tests then it can keep getting better. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
