On 05/16/2013 02:02 PM, Daniel Naber wrote:
> Am 15.05.2013 22:25, schrieb Andriy Rysin:
>
>> override the spellchecker code to do some
>> look-ahead if word is in abbreviated-with-dot list
> This seems the most natural approach to me. Doing something in the
> disambiguator and then relying on that in Java code seems not that
> robust.
ok, I thought about it there's a lot of possibilities here :) but the
simplest way to achieve what I need today is adjust the speller to know
the context of the word.
I allows the language module to ignore words based on context.
The patch is below. Basically it adds
ignoreToken(AnalyzedTokenReadings[] tokens, int idx) method that
provides context of the token (whole sentence) instead of just a word
text (defaulting to calling old ignoreWord() method). With this I can
override that method to make sure the word is abbreviated (e.g. dot
after) but also that it's in correct context (e.g. some abbreviations in
Ukrainian would require captial/latin/... letter in the next word or
some digit before etc).
Alternative way of doing it is to do the same for isMisspelled() method
- probably technically a bit more correct but will include a bit more
changes.
I would appreciate any feedback,
Thanks,
Andriy
P.S. and yes "idx" implementation is a bit ugly, we can make it indexed
for loop instead of this etc :)
Index:
languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
===================================================================
---
languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
(revision 10119)
+++
languagetool-core/src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
(working copy)
@@ -83,13 +83,14 @@
return toRuleMatchArray(ruleMatches);
}
}
+ int idx = -1;
skip:
for (AnalyzedTokenReadings token : tokens) {
+ idx++;
if (isUrl(token.getToken())) {
continue;
}
- final String word = token.getToken();
- if (ignoreWord(word) || token.isImmunized()) {
+ if (ignoreToken(tokens, idx) || token.isImmunized()) {
continue;
}
if (ignoreTaggedWords) {
@@ -98,6 +99,7 @@
continue skip; // if it HAS a POS tag then it is a known
word.
}
}
+ final String word = token.getToken();
if (tokenizingPattern() == null) {
ruleMatches.addAll(getRuleMatch(word, token.getStartPos()));
} else {
@@ -119,6 +121,9 @@
return toRuleMatchArray(ruleMatches);
}
+ protected boolean ignoreToken(AnalyzedTokenReadings[] tokens, int
idx) throws IOException {
+ return ignoreWord(tokens[idx].getToken());
+ }
protected boolean isMisspelled(MorfologikSpeller speller, String word) {
return speller.isMisspelled(word);
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel