Re: [PATCH] Ignoring characters

Andriy Rysin Sun, 08 Mar 2015 10:21:53 -0700

I've found one problem with ignored characters: Morfologik speller
does not skip ignored characters as it gets the sentence with those
chars left inside. This is demonstrated by the test patch below.


It seems like JLanguageTool.getRawAnalyzedSentence() removes those
characters, then does the tagging, then puts original tokens back in
the sentence, and that sentence is fed to speller (as well as other
rules).
Speller uses AnalyzedTokenReadings.getToken() which returns the word
with ignored character. But other rules may work right if they use
AnalyzedTokenReadings.getAnalyzedToken(0).getToken() (which returns a
token without those chars). I think we may also want to check
PatternRule to see which method it uses and if it needs to be ajusted.

I could put a workaround in Ukrainian but it feels like a common
problem, so if everybody agrees we can fix it in common code. It looks
like the easiest solution is to make MorfologikSpellerRule use tokens
without those chars.

Andriy

diff --git 
a/languagetool-language-modules/uk/src/test/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRuleTest.java
b/languagetool-language-modules/uk/src/test/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRuleTest.java
index 3118b4e..cd6f011 100644
--- 
a/languagetool-language-modules/uk/src/test/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRuleTest.java
+++ 
b/languagetool-language-modules/uk/src/test/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRuleTest.java
@@ -45,6 +45,10 @@

     assertEquals(0, rule.match(langTool.getAnalyzedSentence("До нас
приїде The Beatles!")).length);

+    // soft hyphen
+    assertEquals(0,
rule.match(langTool.getAnalyzedSentence("колискової пісні")).length);
+
+
     //incorrect sentences:

     RuleMatch[] matches =
rule.match(langTool.getAnalyzedSentence("атакуючий"));

2015-01-21 22:33 GMT-05:00 Andriy Rysin <ary...@gmail.com>:
> Ok, I've pushed a change to allow per-language set of characters to be
> ignored in tokens (e.g. Ukrainian adds an accent U+0301 to the soft
> hypen). Adding a reading with null tag seems to have affected correct
> position markup so I've adjusted my rules to take that to account.
>
> Please try it and let me know how it works for you,
> Thanks
> Andriy
>
> P.S. One thing I could not figure out (yet) is correct markup for
> tokens with ignored characters in xml rules, see
> languagetool-language-modules/uk/src/main/resources/org/languagetool/rules/uk/grammar-spelling.xml:93
>
>
> 2015-01-20 11:55 GMT-05:00 Andriy Rysin <ary...@gmail.com>:
>> Ok, so I have a token agreement rule which checks if any of the token
>> readings have the required form. If it found good, if it didn't it'll
>> show error, but if it finds a reading with null tag it assumes we
>> don't know enough and will skip the check for this token. It seems for
>> untagged words we use null tag so this works when reading with null
>> POSTAG is the only one. If we're saying we can have additional
>> readings with null which are "information-only" I can probably adjust
>> the logic I have.
>>
>> We could also tag the reading with ignored chars inside the same way
>> the "cleaned" token is but I am afraid the "dirty" token reading will
>> affect suggestions etc in the way we don't want.
>>
>> Andriy
>>
>> 2015-01-20 9:58 GMT-05:00 Daniel Naber <daniel.na...@languagetool.org>:
>>> On 2015-01-20 14:29, Andriy Rysin wrote:
>>>
>>>> So in JLanguageToolTest.testAnalyzedSentence() (line 133) the expected
>>>> reading for token with soft hyphen excpects tested/null, but I don't
>>>> really understand this logic.
>>>
>>> I think the null is probably not the point, the code in
>>> JLanguageTool.getRawAnalyzedSentence() seems to re-add the token with
>>> the soft hyphen again. It probably simply uses null as a POS tag because
>>> I (or whoever added it) though it shouldn't hurt. So maybe just the
>>> token needs to be set, not another reading (adding the null reading may
>>> be just a side effect).
>>>
>>> Regards
>>>   Daniel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
>>> GigeNET is offering a free month of service with a new server in Ashburn.
>>> Choose from 2 high performing configs, both with 100TB of bandwidth.
>>> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
>>> http://p.sf.net/sfu/gigenet
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [PATCH] Ignoring characters

Reply via email to