[ 
https://issues.apache.org/jira/browse/OPENNLP-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799135#comment-17799135
 ] 

ASF GitHub Bot commented on OPENNLP-1163:
-----------------------------------------

mawiesne opened a new pull request, #570:
URL: https://github.com/apache/opennlp/pull/570

   Change
   -
   - verifies `OPENNLP-1163` is no longer a concern, thanks to 
[OPENNLP-793](https://issues.apache.org/jira/browse/OPENNLP-793) being resolved 
with OpenNLP version 2.3.1
   - adds related test case to SentenceDetectorMEItalianTest to verify 
abbreviated "articolo" (art.) is handled correctly
   - enhances Italian corpus (see `Sentences_IT.txt`) introduced in 
[OPENNLP-1530](https://issues.apache.org/jira/browse/OPENNLP-1530) with further 
examples for use of "nell'art."
   - resolves `OPENNLP-1163` 
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
        in the commit message?
   
   - [x] Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Sentence detector doesn't spot abbreviations next to punctuation
> ----------------------------------------------------------------
>
>                 Key: OPENNLP-1163
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1163
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Sentence Detector
>    Affects Versions: 1.8.3
>         Environment: Reproduced on Windows 10
>            Reporter: Gabriele Vaccari
>            Assignee: Martin Wiesner
>            Priority: Critical
>              Labels: abbreviation, sentence-detector
>             Fix For: 2.3.2
>
>         Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt
>
>
> The Sentence Detector trained with an abbreviations list (see attachment) 
> fails to spot them within a text if they are preceded by a punctuation mark. 
> In Italian, words starting with a vowel may be preceded by an article plus 
> apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
> ARTICOLO, especially in legal text, is frequently abbreviated to ART.
> Repro steps:
> 1) add the "art." abbreviation in the abbreviations XML file (enclosed, 
> ctrl+F "art.", case insensitive)
> 2) train a model for the Italian language (training set enclosed) with the 
> following command:
> opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
> it-sen.bin -data training-set.txt -encoding UTF-8 
> 3) run the model against a test text with the following command:
> opennlp SentenceDetector it-sen.bin < test.txt
> Even though the abbreviation "art." was included in the XML file, the 
> sentence detector breaks the sentence on instances of this abbreviation 
> preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). 
> See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
> The issue isn't observed if the apostrophe (single quote) is replaced by a 
> space character.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to