Re: simple compound words with hyphen support in speller

Jaume Ortolà i Font Wed, 15 May 2013 01:58:11 -0700

Hi Andriy,

I do something very similar in Catalan at the word tokenizer level. If a
word containing hyphens is not in the dictionary, then it is splited.


Your changes are OK to me.

Jaume



2013/5/15 Andriy Rysin <ary...@gmail.com>

> Hi all
>
> I had some requests for Ukrainian module to support hyphenated words
> better in our spellchecker. In general Ukrainian has a lot of special
> words with hyphen which need to be in the dictionary but it also has a
> lot of compound words with hyphen where both parts are just independent
> words (possibly inflected).
> So in general I want spellchecker to first check the whole word but if
> it's not in the dictionary split it by hyphen and check if both parts
> are there.
> I did some trick to support this in hunspell with BREAK option.
>
> I could not find quickly if it's easy to support in LT so wrote this
> change which seem to work pretty well (albeit needs some more work to
> get better suggestions for misspelled compound words).
>
> As this change touches the core code I wanted to review it here first to
> make sure it's the right way to do it and if anybody objects.
>
> Thanks
> Andriy
>
> Index:
>
> languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
> ===================================================================
> ---
>
> languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
> (revision 10080)
> +++
>
> languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
> (working copy)
> @@ -119,14 +119,19 @@
>       return toRuleMatchArray(ruleMatches);
>     }
>
> +
> +  protected boolean isMisspelled(MorfologikSpeller speller, String word) {
> +    return speller.isMisspelled(word);
> +  }
> +
>     private List<RuleMatch> getRuleMatch(final String word, final int
> startPos) {
>       final List<RuleMatch> ruleMatches = new ArrayList<RuleMatch>();
> -    if (speller.isMisspelled(word)) {
> +    if (isMisspelled(speller, word)) {
>         final RuleMatch ruleMatch = new RuleMatch(this, startPos, startPos
>             + word.length(), messages.getString("spelling"),
>             messages.getString("desc_spelling_short"));
>         //If lower case word is not a misspelled word, return it as the
> only suggestion
> -      if (!speller.isMisspelled(word.toLowerCase(conversionLocale))) {
> +      if (!isMisspelled(speller, word.toLowerCase(conversionLocale))) {
>           List<String> suggestion =
> Arrays.asList(word.toLowerCase(conversionLocale));
>           ruleMatch.setSuggestedReplacements(suggestion);
>           ruleMatches.add(ruleMatch);
> Index:
>
> languagetool-language-modules/uk/src/main/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRule.java
> ===================================================================
> ---
>
> languagetool-language-modules/uk/src/main/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRule.java
> (revision 10080)
> +++
>
> languagetool-language-modules/uk/src/main/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRule.java
> (working copy)
> @@ -25,10 +25,12 @@
>
>   import org.languagetool.Language;
>   import org.languagetool.rules.spelling.morfologik.MorfologikSpellerRule;
> +import org.languagetool.rules.spelling.morfologik.MorfologikSpeller;
>
>   public final class MorfologikUkrainianSpellerRule extends
> MorfologikSpellerRule {
>
> -  private static final String RESOURCE_FILENAME =
> "/uk/hunspell/uk_UA.dict";
> +  private static final String COMPOUND_CHAR = "-";
> +    private static final String RESOURCE_FILENAME =
> "/uk/hunspell/uk_UA.dict";
>     private static final Pattern UKRAINIAN_LETTERS =
> Pattern.compile(".*[а-яіїєґА-ЯІЇЄҐ].*");
>
>     public MorfologikUkrainianSpellerRule(ResourceBundle messages,
> @@ -52,5 +54,21 @@
>         return ! UKRAINIAN_LETTERS.matcher(word).matches() ||
> super.ignoreWord(word);
>     }
>
> +  @Override
> +  protected boolean isMisspelled(MorfologikSpeller speller, String word) {
> +    if( ! super.isMisspelled(speller, word) )
> +        return false;
> +
> +    if( word.contains(COMPOUND_CHAR) ) {
> +        String[] words = word.split(COMPOUND_CHAR);
> +        for(String singleWord: words) {
> +            if( speller.isMisspelled(singleWord) )
> +                return true;
> +        }
> +        return false;
> +    }
> +
> +    return true;
> +  }
>
>   }
>
>
>
> ------------------------------------------------------------------------------
> AlienVault Unified Security Management (USM) platform delivers complete
> security visibility with the essential security capabilities. Easily and
> efficiently configure, manage, and operate all of your security controls
> from a single console and one unified framework. Download a free trial.
> http://p.sf.net/sfu/alienvault_d2d
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>

------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: simple compound words with hyphen support in speller

Reply via email to