Re: The SENT_END challenge
Thank you for your answer, Jaume. On 9 August 2014 11:30, Jaume Ortolà i Font jaumeort...@gmail.com wrote: Hi, A possible and simple solution is to write two rules. One for sentences with ending punctuation: pattern marker token regexp=yes(you|thei|ou)r/token /marker token regexp=yes[.?!]/token /pattern And another one for sentences without ending punctuation: pattern marker token postag=SENT_END regexp=yes(you|thei|ou)r/token /marker /pattern They are in fact two different patterns, so it is logical to use two different rules. Actually they are the same issue, only separated by the lack of a EOS symbol in the second case. The pattern varies because the tokenizing, but the facts is that every rule regarding SENT_END must be duplicated. Since there are many potential rules based on the end of sentence, I think it is worth thinking on a way to avoid this duplication. BTW duplicated code http://en.wikipedia.org/wiki/Duplicate_code is generally considered a code smell http://en.wikipedia.org/wiki/Code_smell and thus should be avoided. Regards, Juan Martorell -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
[RFC][PATCH] Add analyzed token readings to failed bad sentence test output
If a rule test fails because no error has been found in the bad example sentence, one of the reason can be that the tokenization of the bad sentence example does not match the expected one in the rule itself. To identify these cases more easily, add the token readings to the assertion message. Signed-off-by: Silvan Jegen s.je...@gmail.com --- Hi I had difficulties when creating Japanese rules because the mecab program I used to determine the tokenization of the example phrases produced different tokens than the tokenization library used in languagetool. It took me quite a while to find out why the errors in my bad example sentences where not found. Having the tokenized readings of the bad sentence examples in the assertion message makes it easier to identify issues with tokenization. I realize that this change may be less useful for languages with easier tokenization but I still think it would be nice to discuss whether it would make sense to include this output. Maybe there is another functionality in languagetool, that I do not know of, that would make the suggested changes unnecessary? If including the analyzed token readings is useful in other assertion messages as well, it may also be better to refactor the token reading code into its own function and making it less ad hoc. What do you think ? (If you want to include the patch, I can open a pull request on Github if you prefer) Cheers, Silvan .../org/languagetool/rules/patterns/PatternRuleTest.java | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java b/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java index 0d5580d..d279b36 100644 --- a/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java +++ b/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java @@ -22,6 +22,7 @@ import java.io.File; import java.io.IOException; import java.io.InputStream; import java.lang.String; +import java.lang.StringBuilder; import java.util.*; import junit.framework.TestCase; @@ -281,9 +282,17 @@ public class PatternRuleTest extends TestCase { } if (!rule.isWithComplexPhrase()) { -assertTrue(lang + : Did expect one error in: \ + badSentence -+ \ (Rule: + rule + ), but found + matches.size() -+ . Additional info: + rule.getMessage() + , Matches: + matches, matches.size() == 1); +if (matches.size() != 1) { + final AnalyzedSentence analyzedSentence = languageTool.getAnalyzedSentence(badSentence); + final AnalyzedTokenReadings[] analyzedTR = analyzedSentence.getTokens(); + final StringBuilder sb = new StringBuilder(Analyzed token readings:); + for (AnalyzedTokenReadings atr : analyzedTR) { + sb.append( + atr.toString()); + } + assertTrue(lang + : Did expect one error in: \ + badSentence + + \ (Rule: + rule + ), but found + matches.size() + + . Additional info: + rule.getMessage() + , + sb.toString() + , Matches: + matches, matches.size() == 1); +} assertEquals(lang + : Incorrect match position markup (start) for rule + rule + , sentence: + badSentence, expectedMatchStart, matches.get(0).getFromPos()); -- 2.0.4 -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [RFC][PATCH] Add analyzed token readings to failed bad sentence test output
On 2014-08-10 17:37, Silvan Jegen wrote: If including the analyzed token readings is useful in other assertion messages as well, it may also be better to refactor the token reading code into its own function and making it less ad hoc. What do you think ? Thanks, I have committed your changes (plus some more newlines, to make the output easier to read). Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
chunks in exceptions
Hi all I was writing a rule were I had to catch a phrase with last word being noun, but only if that noun is not part of adverb chunk (with another word following). The best way to do that seems to use adverb chunk in exception but looks like this is not supported. So after multiple experiments and reading our wiki pages I wrote a rule like this and it works but it does generate a warning: Running pattern rule tests for Ukrainian... The Ukrainian rule: PASSIVE_PREDICATE:1 (exception in token [4]), token [4], contains тією мірою that contains token separators, so can't possibly be matched. Current solution looks ugly and although it works I'd like to make it right. So the question is is there a reason why chunks are not supported in exceptions? And if there's a reason we should not support chunks in exceptions what's the best way to write a rule below (without splitting it to two or complicating it even more)? Thanks Andriy pattern token postag_regexp=yes postag=(noun|pron).*v_zna.*/ marker token postag=impers/ token postag_regexp=yes postag=adj:.*v_oru.* min=0 max=1/ token postag_regexp=yes postag=noun.*v_oru.*:ist.*|pron.*v_oru.* exception postag_regexp=yes postag=noun.*v_oru$/ exception inflected=yesтією мірою/exception /token /marker /pattern -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel