[ https://issues.apache.org/jira/browse/OPENNLP-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799135#comment-17799135 ]
ASF GitHub Bot commented on OPENNLP-1163: ----------------------------------------- mawiesne opened a new pull request, #570: URL: https://github.com/apache/opennlp/pull/570 Change - - verifies `OPENNLP-1163` is no longer a concern, thanks to [OPENNLP-793](https://issues.apache.org/jira/browse/OPENNLP-793) being resolved with OpenNLP version 2.3.1 - adds related test case to SentenceDetectorMEItalianTest to verify abbreviated "articolo" (art.) is handled correctly - enhances Italian corpus (see `Sentences_IT.txt`) introduced in [OPENNLP-1530](https://issues.apache.org/jira/browse/OPENNLP-1530) with further examples for use of "nell'art." - resolves `OPENNLP-1163` Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [x] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Sentence detector doesn't spot abbreviations next to punctuation > ---------------------------------------------------------------- > > Key: OPENNLP-1163 > URL: https://issues.apache.org/jira/browse/OPENNLP-1163 > Project: OpenNLP > Issue Type: Bug > Components: Sentence Detector > Affects Versions: 1.8.3 > Environment: Reproduced on Windows 10 > Reporter: Gabriele Vaccari > Assignee: Martin Wiesner > Priority: Critical > Labels: abbreviation, sentence-detector > Fix For: 2.3.2 > > Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt > > > The Sentence Detector trained with an abbreviations list (see attachment) > fails to spot them within a text if they are preceded by a punctuation mark. > In Italian, words starting with a vowel may be preceded by an article plus > apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term > ARTICOLO, especially in legal text, is frequently abbreviated to ART. > Repro steps: > 1) add the "art." abbreviation in the abbreviations XML file (enclosed, > ctrl+F "art.", case insensitive) > 2) train a model for the Italian language (training set enclosed) with the > following command: > opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model > it-sen.bin -data training-set.txt -encoding UTF-8 > 3) run the model against a test text with the following command: > opennlp SentenceDetector it-sen.bin < test.txt > Even though the abbreviation "art." was included in the XML file, the > sentence detector breaks the sentence on instances of this abbreviation > preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). > See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17. > The issue isn't observed if the apostrophe (single quote) is replaced by a > space character. > -- This message was sent by Atlassian Jira (v8.20.10#820010)