[ https://issues.apache.org/jira/browse/OPENNLP-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798111#comment-17798111 ]
Martin Wiesner commented on OPENNLP-1163: ----------------------------------------- Thanks [~StarWalker777] for reporting the issue and providing valuable input by attaching those files. Will work on it soon. Projected to be included in the 2.3.2 release. > Sentence detector doesn't spot abbreviations next to punctuation > ---------------------------------------------------------------- > > Key: OPENNLP-1163 > URL: https://issues.apache.org/jira/browse/OPENNLP-1163 > Project: OpenNLP > Issue Type: Bug > Components: Sentence Detector > Affects Versions: 1.8.3 > Environment: Reproduced on Windows 10 > Reporter: Gabriele Vaccari > Assignee: Martin Wiesner > Priority: Critical > Labels: abbreviation, sentence-detector > Fix For: 2.3.2 > > Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt > > > The Sentence Detector trained with an abbreviations list (see attachment) > fails to spot them within a text if they are preceded by a punctuation mark. > In Italian, words starting with a vowel may be preceded by an article plus > apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term > ARTICOLO, especially in legal text, is frequently abbreviated to ART. > Repro steps: > 1) add the "art." abbreviation in the abbreviations XML file (enclosed, > ctrl+F "art.", case insensitive) > 2) train a model for the Italian language (training set enclosed) with the > following command: > opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model > it-sen.bin -data training-set.txt -encoding UTF-8 > 3) run the model against a test text with the following command: > opennlp SentenceDetector it-sen.bin < test.txt > Even though the abbreviation "art." was included in the XML file, the > sentence detector breaks the sentence on instances of this abbreviation > preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). > See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17. > The issue isn't observed if the apostrophe (single quote) is replaced by a > space character. > -- This message was sent by Atlassian Jira (v8.20.10#820010)