[jira] [Commented] (OPENNLP-1479) Write better tests for pattern verification (tokenizers)

ASF GitHub Bot (Jira) Fri, 08 Dec 2023 02:19:06 -0800


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794624#comment-17794624
 ]


ASF GitHub Bot commented on OPENNLP-1479:
-----------------------------------------

mawiesne commented on code in PR #559:
URL: https://github.com/apache/opennlp/pull/559#discussion_r1420232094


##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
     Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
   }
 
+  void checkCustomPatternForTokenizerME(String lang, String pattern, String 
sentence,
+      int expectedNumTokens) throws IOException {
+
+    TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+        Pattern.compile(pattern)));
+
+    TokenizerME tokenizer = new TokenizerME(model);
+    String[] tokens = tokenizer.tokenize(sentence);
+
+    Assertions.assertEquals(expectedNumTokens, tokens.length);
+    String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+    for (int i = 0; i < sentSplit.length; i++) {
+      Assertions.assertEquals(sentSplit[i], tokens[i]);
+    }
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEDeu() throws IOException {
+    String lang = "deu";
+    String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+    String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der 
botanischen Monographie.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);

Review Comment:
   Side note: 
   -
   Given the German example with two abbreviated words ("S." and "ff.") one 
would expect 14 tokens only. However, TokenizerME does not cover abbreviations 
at all, atm. For this reason, I've opened a new Jira issue and propose an 
enhancement similar to `SentenceDetectorME` in which abbreviations are 
correctly used, see OPENNLP-570.
   
   @l-ma Keep it as is for now. We can take it. The new issue 
[OPENNLP-1525](https://issues.apache.org/jira/browse/OPENNLP-1525) references 
your improvements with these tests and the aim is to change "16" to "14" in a 
future PR.
   
   Hooray, you uncovered a missing feature! 🙃



##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
     Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
   }
 
+  void checkCustomPatternForTokenizerME(String lang, String pattern, String 
sentence,
+      int expectedNumTokens) throws IOException {
+
+    TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+        Pattern.compile(pattern)));
+
+    TokenizerME tokenizer = new TokenizerME(model);
+    String[] tokens = tokenizer.tokenize(sentence);
+
+    Assertions.assertEquals(expectedNumTokens, tokens.length);
+    String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+    for (int i = 0; i < sentSplit.length; i++) {
+      Assertions.assertEquals(sentSplit[i], tokens[i]);
+    }
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEDeu() throws IOException {
+    String lang = "deu";
+    String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+    String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der 
botanischen Monographie.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);

Review Comment:
   Side note: 
   -
   Given the German example with two abbreviated words ("S." and "ff.") one 
would expect 14 tokens only. However, TokenizerME does not cover abbreviations 
at all, atm. For this reason, I've opened a new Jira issue and propose an 
enhancement similar to `SentenceDetectorME` in which abbreviations are 
correctly used, see OPENNLP-570.
   
   @l-ma Keep it as is for now. We can take it. The new issue 
[OPENNLP-1525](https://issues.apache.org/jira/browse/OPENNLP-1525) references 
your improvements with these tests and the aim is to change "16" to "14" in a 
future PR.
   
   Hooray, you uncovered a missing feature! 🙃





> Write better tests for pattern verification (tokenizers)
> --------------------------------------------------------
>
>                 Key: OPENNLP-1479
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1479
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>    Affects Versions: 2.1.1
>            Reporter: Bruno P. Kinoshita
>            Assignee: Lara Marinov
>            Priority: Major
>             Fix For: 2.3.2
>
>
> From [https://github.com/apache/opennlp/pull/516#issuecomment-1455015772]
> At the moment our tests verify that the tokenizer objects are created 
> correctly (i.e. tests getters and setters, constructor, etc.), without 
> verifying the actual behavior when used in conjunction with other classes 
> (factory, tokenizer, trainers, etc).
> It would be best to test the patterns used in the factories for different 
> languages with some interesting sample data (maybe something from project 
> gutenberg, open source news sites, etc.).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (OPENNLP-1479) Write better tests for pattern verification (tokenizers)

Reply via email to