mawiesne commented on code in PR #559:
URL: https://github.com/apache/opennlp/pull/559#discussion_r1420232094
##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
}
+ void checkCustomPatternForTokenizerME(String lang, String pattern, String
sentence,
+ int expectedNumTokens) throws IOException {
+
+ TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+ Pattern.compile(pattern)));
+
+ TokenizerME tokenizer = new TokenizerME(model);
+ String[] tokens = tokenizer.tokenize(sentence);
+
+ Assertions.assertEquals(expectedNumTokens, tokens.length);
+ String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+ for (int i = 0; i < sentSplit.length; i++) {
+ Assertions.assertEquals(sentSplit[i], tokens[i]);
+ }
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMEDeu() throws IOException {
+ String lang = "deu";
+ String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+ String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der
botanischen Monographie.";
+ checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);
Review Comment:
Side note:
-
Given the German example with two abbreviated words ("S." and "ff.") one
would expect 14 tokens only. However, TokenizerME does not cover abbreviations
at all, atm. For this reason, I've opened a new Jira issue and propose an
enhancement similar to `SentenceDetectorME` in which abbreviations are
correctly used, see OPENNLP-570.
@l-ma Keep it as is for now. We can take it. The new issue
[OPENNLP-1525](https://issues.apache.org/jira/browse/OPENNLP-1525) references
your improvements with these tests and the aim is to change "16" to "14" in a
future PR.
Hooray, you uncovered a missing feature! 🙃
##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
}
+ void checkCustomPatternForTokenizerME(String lang, String pattern, String
sentence,
+ int expectedNumTokens) throws IOException {
+
+ TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+ Pattern.compile(pattern)));
+
+ TokenizerME tokenizer = new TokenizerME(model);
+ String[] tokens = tokenizer.tokenize(sentence);
+
+ Assertions.assertEquals(expectedNumTokens, tokens.length);
+ String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+ for (int i = 0; i < sentSplit.length; i++) {
+ Assertions.assertEquals(sentSplit[i], tokens[i]);
+ }
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMEDeu() throws IOException {
+ String lang = "deu";
+ String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+ String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der
botanischen Monographie.";
+ checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);
Review Comment:
Side note:
-
Given the German example with two abbreviated words ("S." and "ff.") one
would expect 14 tokens only. However, TokenizerME does not cover abbreviations
at all, atm. For this reason, I've opened a new Jira issue and propose an
enhancement similar to `SentenceDetectorME` in which abbreviations are
correctly used, see OPENNLP-570.
@l-ma Keep it as is for now. We can take it. The new issue
[OPENNLP-1525](https://issues.apache.org/jira/browse/OPENNLP-1525) references
your improvements with these tests and the aim is to change "16" to "14" in a
future PR.
Hooray, you uncovered a missing feature! 🙃
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]