Re: The SENT_END challenge

2014-08-10 Thread Juan Martorell
Thank you for your answer, Jaume.


On 9 August 2014 11:30, Jaume Ortolà i Font jaumeort...@gmail.com wrote:

 Hi,

 A possible and simple solution is to write two rules. One for sentences
 with ending punctuation:

 pattern
 marker
 token regexp=yes(you|thei|ou)r/token
 /marker
 token regexp=yes[.?!]/token
 /pattern

 And another one for sentences without ending punctuation:

 pattern
 marker
 token postag=SENT_END regexp=yes(you|thei|ou)r/token
 /marker
 /pattern


 They are in fact two different patterns, so it is logical to use two
 different rules.


Actually they are the same issue, only separated by the lack of a EOS
symbol in the second case. The pattern varies because the tokenizing, but
the facts is that every rule regarding SENT_END must be duplicated. Since
there are many potential rules based on the end of sentence, I think it is
worth thinking on a way to avoid this duplication. BTW duplicated code
http://en.wikipedia.org/wiki/Duplicate_code is generally considered a code
smell http://en.wikipedia.org/wiki/Code_smell and thus should be avoided.

Regards,

Juan Martorell
--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


[RFC][PATCH] Add analyzed token readings to failed bad sentence test output

2014-08-10 Thread Silvan Jegen
If a rule test fails because no error has been found in the bad example
sentence, one of the reason can be that the tokenization of the bad
sentence example does not match the expected one in the rule itself.

To identify these cases more easily, add the token readings to the
assertion message.

Signed-off-by: Silvan Jegen s.je...@gmail.com
---

Hi

I had difficulties when creating Japanese rules because the mecab program
I used to determine the tokenization of the example phrases produced
different tokens than the tokenization library used in languagetool.

It took me quite a while to find out why the errors in my bad example
sentences where not found. Having the tokenized readings of the bad
sentence examples in the assertion message makes it easier to identify
issues with tokenization.

I realize that this change may be less useful for languages with easier
tokenization but I still think it would be nice to discuss whether
it would make sense to include this output. Maybe there is another
functionality in languagetool, that I do not know of, that would make
the suggested changes unnecessary?

If including the analyzed token readings is useful in other assertion
messages as well, it may also be better to refactor the token reading
code into its own function and making it less ad hoc.

What do you think ?

(If you want to include the patch, I can open a pull request on Github
if you prefer)


Cheers,

Silvan

 .../org/languagetool/rules/patterns/PatternRuleTest.java  | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git 
a/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
 
b/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
index 0d5580d..d279b36 100644
--- 
a/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
+++ 
b/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
@@ -22,6 +22,7 @@ import java.io.File;
 import java.io.IOException;
 import java.io.InputStream;
 import java.lang.String;
+import java.lang.StringBuilder;
 import java.util.*;
 
 import junit.framework.TestCase;
@@ -281,9 +282,17 @@ public class PatternRuleTest extends TestCase {
   }
   
   if (!rule.isWithComplexPhrase()) {
-assertTrue(lang + : Did expect one error in: \ + badSentence
-+ \ (Rule:  + rule + ), but found  + matches.size()
-+ . Additional info: + rule.getMessage() + , Matches:  + 
matches, matches.size() == 1);
+if (matches.size() != 1) {
+ final AnalyzedSentence analyzedSentence = 
languageTool.getAnalyzedSentence(badSentence);
+ final AnalyzedTokenReadings[] analyzedTR = 
analyzedSentence.getTokens();
+ final StringBuilder sb = new StringBuilder(Analyzed token 
readings:);
+ for (AnalyzedTokenReadings atr : analyzedTR) {
+   sb.append(  + atr.toString());
+ }
+ assertTrue(lang + : Did expect one error in: \ + badSentence
+   + \ (Rule:  + rule + ), but found  + matches.size()
+   + . Additional info: + rule.getMessage() + ,  + 
sb.toString() + , Matches:  + matches, matches.size() == 1);
+}
 assertEquals(lang
 + : Incorrect match position markup (start) for rule  + rule 
+ , sentence:  + badSentence,
 expectedMatchStart, matches.get(0).getFromPos());
-- 
2.0.4

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [RFC][PATCH] Add analyzed token readings to failed bad sentence test output

2014-08-10 Thread Daniel Naber
On 2014-08-10 17:37, Silvan Jegen wrote:

 If including the analyzed token readings is useful in other assertion
 messages as well, it may also be better to refactor the token reading
 code into its own function and making it less ad hoc.
 
 What do you think ?

Thanks, I have committed your changes (plus some more newlines, to make 
the output easier to read).

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


chunks in exceptions

2014-08-10 Thread Andriy Rysin
Hi all

I was writing a rule were I had to catch a phrase with last word being
noun, but only if that noun is not part of adverb chunk (with another
word following). The best way to do that seems to use adverb chunk in
exception but looks like this is not supported. So after multiple
experiments and reading our wiki pages I wrote a rule like this and it
works but it does generate a warning:

Running pattern rule tests for Ukrainian... The Ukrainian rule:
PASSIVE_PREDICATE:1 (exception in token [4]), token [4], contains
тією мірою that contains token separators, so can't possibly be
matched.

Current solution looks ugly and although it works I'd like to make it
right. So the question is is there a reason why chunks are not
supported in exceptions? And if there's a reason we should not support
chunks in exceptions what's the best way to write a rule below
(without splitting it to two or complicating it even more)?

Thanks
Andriy

   pattern
token postag_regexp=yes postag=(noun|pron).*v_zna.*/
marker
token postag=impers/
token postag_regexp=yes postag=adj:.*v_oru.*
min=0 max=1/
token postag_regexp=yes
postag=noun.*v_oru.*:ist.*|pron.*v_oru.*
  exception postag_regexp=yes postag=noun.*v_oru$/
  exception inflected=yesтією мірою/exception
/token
/marker
/pattern

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel