Re: [PR] OPENNLP-1479: Write better tests for pattern verification (tokenizers) (opennlp)

via GitHub Fri, 08 Dec 2023 05:35:49 -0800


kinow commented on code in PR #559:
URL: https://github.com/apache/opennlp/pull/559#discussion_r1420446484



##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
     Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
   }
 
+  void checkCustomPatternForTokenizerME(String lang, String pattern, String 
sentence,
+      int expectedNumTokens) throws IOException {
+
+    TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+        Pattern.compile(pattern)));
+
+    TokenizerME tokenizer = new TokenizerME(model);
+    String[] tokens = tokenizer.tokenize(sentence);
+
+    Assertions.assertEquals(expectedNumTokens, tokens.length);
+    String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+    for (int i = 0; i < sentSplit.length; i++) {
+      Assertions.assertEquals(sentSplit[i], tokens[i]);
+    }
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEDeu() throws IOException {
+    String lang = "deu";
+    String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+    String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der 
botanischen Monographie.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEPor() throws IOException {
+    String lang = "por";
+    String pattern = "^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$";
+    String sentence = "Na floresta mágica a raposa dança com unicórnios 
felizes.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 10);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMESpa() throws IOException {
+    String lang = "spa";
+    String pattern = "^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$";
+    String sentence = "En el verano los niños juegan en el parque y sus risas 
crean alegría.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 15);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMECat() throws IOException {
+    String lang = "cat";
+    String pattern = "^[0-9a-zàèéíïòóúüçA-ZÀÈÉÍÏÒÓÚÜÇ]+$";
+    String sentence = "Als xiuxiuejants avets el ós blau neda amb cignes i se 
ho passen bé.";

Review Comment:
   I have only basic (very basic at the moment) Catalan, but I think this would 
be written as
   
   “Als xiuxiuejants avets l'ós blau neda amb cignes i s'ho passen bé.”
   
   So the apostrophe is always used with `le` following a vowel, like in French 
— https://en.wikipedia.org/wiki/Catalan_orthography#Apostrophe



##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
     Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
   }
 
+  void checkCustomPatternForTokenizerME(String lang, String pattern, String 
sentence,
+      int expectedNumTokens) throws IOException {
+
+    TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+        Pattern.compile(pattern)));
+
+    TokenizerME tokenizer = new TokenizerME(model);
+    String[] tokens = tokenizer.tokenize(sentence);
+
+    Assertions.assertEquals(expectedNumTokens, tokens.length);
+    String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+    for (int i = 0; i < sentSplit.length; i++) {
+      Assertions.assertEquals(sentSplit[i], tokens[i]);
+    }
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEDeu() throws IOException {
+    String lang = "deu";
+    String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+    String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der 
botanischen Monographie.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);

Review Comment:
   Thanks for creating the issue, and +1 to this improvement. Thanks!



##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
     Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
   }
 
+  void checkCustomPatternForTokenizerME(String lang, String pattern, String 
sentence,
+      int expectedNumTokens) throws IOException {
+
+    TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+        Pattern.compile(pattern)));
+
+    TokenizerME tokenizer = new TokenizerME(model);
+    String[] tokens = tokenizer.tokenize(sentence);
+
+    Assertions.assertEquals(expectedNumTokens, tokens.length);
+    String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+    for (int i = 0; i < sentSplit.length; i++) {
+      Assertions.assertEquals(sentSplit[i], tokens[i]);
+    }
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEDeu() throws IOException {
+    String lang = "deu";
+    String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+    String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der 
botanischen Monographie.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEPor() throws IOException {
+    String lang = "por";
+    String pattern = "^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$";
+    String sentence = "Na floresta mágica a raposa dança com unicórnios 
felizes.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 10);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMESpa() throws IOException {
+    String lang = "spa";
+    String pattern = "^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$";
+    String sentence = "En el verano los niños juegan en el parque y sus risas 
crean alegría.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 15);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMECat() throws IOException {
+    String lang = "cat";
+    String pattern = "^[0-9a-zàèéíïòóúüçA-ZÀÈÉÍÏÒÓÚÜÇ]+$";
+    String sentence = "Als xiuxiuejants avets el ós blau neda amb cignes i se 
ho passen bé.";

Review Comment:
   I have only basic (very basic at the moment) Catalan, but I think this would 
be written as
   
   “Als xiuxiuejants avets l'ós blau neda amb cignes i s'ho passen bé.”
   
   So the apostrophe is always used with `le` following a vowel, like in French 
— https://en.wikipedia.org/wiki/Catalan_orthography#Apostrophe



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1479: Write better tests for pattern verification (tokenizers) (opennlp)

Reply via email to