[ https://issues.apache.org/jira/browse/OPENNLP-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839131#comment-17839131 ]
ASF GitHub Bot commented on OPENNLP-1554: ----------------------------------------- kinow commented on code in PR #597: URL: https://github.com/apache/opennlp/pull/597#discussion_r1573009751 ########## opennlp-tools/src/test/java/opennlp/tools/sentdetect/AbstractSentenceDetectorTest.java: ########## @@ -30,13 +30,16 @@ public abstract class AbstractSentenceDetectorTest { - protected static final Locale LOCALE_SPANISH = new Locale("es"); + protected static final Locale LOCALE_DUTCH = new Locale("nl"); protected static final Locale LOCALE_POLISH = new Locale("pl"); protected static final Locale LOCALE_PORTUGUESE = new Locale("pt"); + protected static final Locale LOCALE_SPANISH = new Locale("es"); Review Comment: The small details that normally pass unnoticed! Thanks for sorting alphabetically :clap: ########## opennlp-tools/src/test/java/opennlp/tools/sentdetect/SentenceDetectorMEDutchTest.java: ########## @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.sentdetect; + +import java.io.IOException; + +import org.junit.jupiter.api.Assertions; +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.Test; + +import opennlp.tools.dictionary.Dictionary; + +/** + * Tests for the {@link SentenceDetectorME} class. + * <p> + * Demonstrates OPENNLP-1554. + * <p> + * In this context, well-known known Dutch (nl_NL) abbreviations must be respected, + * so that words abbreviated with one or more '.' characters do not + * result in incorrect sentence boundaries. + * <p> + * See: + * <a href="https://issues.apache.org/jira/projects/OPENNLP/issues/OPENNLP-1554">OPENNLP-1554</a> + */ +public class SentenceDetectorMEDutchTest extends AbstractSentenceDetectorTest { + + private static final char[] EOS_CHARS = {'.', '?', '!'}; + + private static SentenceModel sentdetectModel; + + @BeforeAll + public static void prepareResources() throws IOException { + Dictionary abbreviationDict = loadAbbDictionary(LOCALE_DUTCH); + SentenceDetectorFactory factory = new SentenceDetectorFactory( + "dut", true, abbreviationDict, EOS_CHARS); + sentdetectModel = train(factory, LOCALE_DUTCH); + Assertions.assertNotNull(sentdetectModel); + Assertions.assertEquals("dut", sentdetectModel.getLanguage()); + } + + // Example taken from 'Sentences_NL.txt' + @Test + void testSentDetectWithInlineAbbreviationsEx1() { + final String sent1 = "Een droom, tot de vorming waarvan een bijzonder sterke compressie " + + "heeft bijgedragen, zal het meest gunstige materiaal zijn voor dit onderzoek."; + // Here we have one abbreviations "p." => pagina (page) + final String sent2 = "Ik kies voor de droom van de botanische monografie die " + + "op p. 183 en volgende wordt beschreven."; + + SentenceDetectorME sentDetect = new SentenceDetectorME(sentdetectModel); + String sampleSentences = sent1 + " " + sent2; + String[] sents = sentDetect.sentDetect(sampleSentences); + Assertions.assertEquals(2, sents.length); + Assertions.assertEquals(sent1, sents[0]); + Assertions.assertEquals(sent2, sents[1]); + double[] probs = sentDetect.getSentenceProbabilities(); + Assertions.assertEquals(2, probs.length); + } + + // Reduced example taken from 'Sentences_NL.txt' + @Test + void testSentDetectWithInlineAbbreviationsEx2() { + // Here we have one abbreviations: "d.w.z." = dat wil zeggen (eng.: that is to say) + final String sent1 = "Met het oog op de overvloed aan ideeën die de analyse op elk " + + "afzonderlijk element van de droominhoud brengt, zullen sommige lezers twijfels " + + "hebben over het principe of alles wat later tijdens de analyse in je opkomt, " + + "tot de droomgedachten gerekend mag worden, d.w.z. of aangenomen mag worden " + + "dat al deze gedachten al tijdens de slaaptoestand actief waren en bijdroegen " + + "aan de vorming van de droom?"; Review Comment: I was reviewing a pull request in Jena today, and noticed the `"""` for multiline, which I normally use for Python, but never for Java. Maybe we can use that in the future to simplify instead of `"strings strings strings " +` (I think `"""` is Java 15 > ?). > Add Dutch abbreviation dictionary > --------------------------------- > > Key: OPENNLP-1554 > URL: https://issues.apache.org/jira/browse/OPENNLP-1554 > Project: OpenNLP > Issue Type: Improvement > Components: Sentence Detector, Tokenizer > Affects Versions: 2.3.2 > Reporter: Martin Wiesner > Assignee: Martin Wiesner > Priority: Minor > Fix For: 2.3.3 > > > Similar to the addition in OPENNLP-1526, an abbreviation dictionary for Dutch > sentence detection and tokenisation might be beneficial. > Aims: > Create and add a new file {{abb_NL.xml}} in {{opennlp-tools/lang/nl}} > Add basic set of test cases -- This message was sent by Atlassian Jira (v8.20.10#820010)