On 05/16/2013 02:02 PM, Daniel Naber wrote:
> Am 15.05.2013 22:25, schrieb Andriy Rysin:
>
>> override the spellchecker code to do some
>> look-ahead if word is in abbreviated-with-dot list
> This seems the most natural approach to me. Doing something in the
> disambiguator and then relying on that in Java code seems not that
> robust.
ok, I thought about it there's a lot of possibilities here :) but the 
simplest way to achieve what I need today is adjust the speller to know 
the context of the word.
I allows the language module to ignore words based on context.
The patch is below. Basically it adds 
ignoreToken(AnalyzedTokenReadings[] tokens, int idx) method that 
provides context of the token (whole sentence) instead of just a word 
text (defaulting to calling old ignoreWord() method). With this I can 
override that method to make sure the word is abbreviated (e.g. dot 
after) but also that it's in correct context (e.g. some abbreviations in 
Ukrainian would require captial/latin/... letter in the next word or 
some digit before etc).

Alternative way of doing it is to do the same for isMisspelled() method 
- probably technically a bit more correct but will include a bit more 
changes.

I would appreciate any feedback,
Thanks,
Andriy

P.S. and yes "idx" implementation is a bit ugly, we can make it indexed 
for loop instead of this etc :)


Index: 
languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
===================================================================
--- 
languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
 
(revision 10119)
+++ 
languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
 
(working copy)
@@ -83,13 +83,14 @@
          return toRuleMatchArray(ruleMatches);
        }
      }
+    int idx = -1;
      skip:
      for (AnalyzedTokenReadings token : tokens) {
+        idx++;
        if (isUrl(token.getToken())) {
          continue;
        }
-      final String word = token.getToken();
-      if (ignoreWord(word) || token.isImmunized()) {
+      if (ignoreToken(tokens, idx) || token.isImmunized()) {
          continue;
        }
        if (ignoreTaggedWords) {
@@ -98,6 +99,7 @@
              continue skip; // if it HAS a POS tag then it is a known 
word.
          }
        }
+      final String word = token.getToken();
        if (tokenizingPattern() == null) {
          ruleMatches.addAll(getRuleMatch(word, token.getStartPos()));
        } else {
@@ -119,6 +121,9 @@
      return toRuleMatchArray(ruleMatches);
    }

+  protected boolean ignoreToken(AnalyzedTokenReadings[] tokens, int 
idx) throws IOException {
+      return ignoreWord(tokens[idx].getToken());
+  }

    protected boolean isMisspelled(MorfologikSpeller speller, String word) {
      return speller.isMisspelled(word);


------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to