[jira] [Updated] (OPENNLP-1163) Sentence detector doesn't spot abbreviations next to punctuation

2023-12-21 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1163:

Priority: Major  (was: Critical)

> Sentence detector doesn't spot abbreviations next to punctuation
> 
>
> Key: OPENNLP-1163
> URL: https://issues.apache.org/jira/browse/OPENNLP-1163
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Sentence Detector
>Affects Versions: 1.8.3
> Environment: Reproduced on Windows 10
>Reporter: Gabriele Vaccari
>Assignee: Martin Wiesner
>Priority: Major
>  Labels: abbreviation, sentence-detector
> Fix For: 2.3.2
>
> Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt
>
>  Time Spent: 1.75h
>  Remaining Estimate: 0h
>
> The Sentence Detector trained with an abbreviations list (see attachment) 
> fails to spot them within a text if they are preceded by a punctuation mark. 
> In Italian, words starting with a vowel may be preceded by an article plus 
> apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
> ARTICOLO, especially in legal text, is frequently abbreviated to ART.
> Repro steps:
> 1) add the "art." abbreviation in the abbreviations XML file (enclosed, 
> ctrl+F "art.", case insensitive)
> 2) train a model for the Italian language (training set enclosed) with the 
> following command:
> opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
> it-sen.bin -data training-set.txt -encoding UTF-8 
> 3) run the model against a test text with the following command:
> opennlp SentenceDetector it-sen.bin < test.txt
> Even though the abbreviation "art." was included in the XML file, the 
> sentence detector breaks the sentence on instances of this abbreviation 
> preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). 
> See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
> The issue isn't observed if the apostrophe (single quote) is replaced by a 
> space character.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1163) Sentence detector doesn't spot abbreviations next to punctuation

2023-12-18 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1163:

Fix Version/s: 2.3.2

> Sentence detector doesn't spot abbreviations next to punctuation
> 
>
> Key: OPENNLP-1163
> URL: https://issues.apache.org/jira/browse/OPENNLP-1163
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Sentence Detector
>Affects Versions: 1.8.3
> Environment: Reproduced on Windows 10
>Reporter: Gabriele Vaccari
>Priority: Critical
>  Labels: abbreviation, sentence-detector
> Fix For: 2.3.2
>
> Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt
>
>
> The Sentence Detector trained with an abbreviations list (see attachment) 
> fails to spot them within a text if they are preceded by a punctuation mark. 
> In Italian, words starting with a vowel may be preceded by an article plus 
> apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
> ARTICOLO, especially in legal text, is frequently abbreviated to ART.
> Repro steps:
> 1) add the "art." abbreviation in the abbreviations XML file (enclosed, 
> ctrl+F "art.", case insensitive)
> 2) train a model for the Italian language (training set enclosed) with the 
> following command:
> opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
> it-sen.bin -data training-set.txt -encoding UTF-8 
> 3) run the model against a test text with the following command:
> opennlp SentenceDetector it-sen.bin < test.txt
> Even though the abbreviation "art." was included in the XML file, the 
> sentence detector breaks the sentence on instances of this abbreviation 
> preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). 
> See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
> The issue isn't observed if the apostrophe (single quote) is replaced by a 
> space character.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1163) Sentence detector doesn't spot abbreviations next to punctuation

2023-02-27 Thread Richard Zowalla (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Zowalla updated OPENNLP-1163:
-
Fix Version/s: (was: 2.1.2)

> Sentence detector doesn't spot abbreviations next to punctuation
> 
>
> Key: OPENNLP-1163
> URL: https://issues.apache.org/jira/browse/OPENNLP-1163
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Sentence Detector
>Affects Versions: 1.8.3
> Environment: Reproduced on Windows 10
>Reporter: Gabriele Vaccari
>Priority: Critical
>  Labels: abbreviation, sentence-detector
> Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt
>
>
> The Sentence Detector trained with an abbreviations list (see attachment) 
> fails to spot them within a text if they are preceded by a punctuation mark. 
> In Italian, words starting with a vowel may be preceded by an article plus 
> apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
> ARTICOLO, especially in legal text, is frequently abbreviated to ART.
> Repro steps:
> 1) add the "art." abbreviation in the abbreviations XML file (enclosed, 
> ctrl+F "art.", case insensitive)
> 2) train a model for the Italian language (training set enclosed) with the 
> following command:
> opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
> it-sen.bin -data training-set.txt -encoding UTF-8 
> 3) run the model against a test text with the following command:
> opennlp SentenceDetector it-sen.bin < test.txt
> Even though the abbreviation "art." was included in the XML file, the 
> sentence detector breaks the sentence on instances of this abbreviation 
> preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). 
> See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
> The issue isn't observed if the apostrophe (single quote) is replaced by a 
> space character.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (OPENNLP-1163) Sentence detector doesn't spot abbreviations next to punctuation

2017-11-30 Thread Gabriele Vaccari (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriele Vaccari updated OPENNLP-1163:
--
Description: 
The Sentence Detector trained with an abbreviations list (see attachment) fails 
to spot them within a text if they are preceded by a punctuation mark. 

In Italian, words starting with a vowel may be preceded by an article plus 
apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
ARTICOLO, especially in legal text, is frequently abbreviated to ART.

Repro steps:
1) add the "art." abbreviation in the abbreviations XML file (enclosed, ctrl+F 
"art.", case insensitive)
2) train a model for the Italian language (training set enclosed) with the 
following command:
opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
it-sen.bin -data training-set.txt -encoding UTF-8 
3) run the model against a test text with the following command:
opennlp SentenceDetector it-sen.bin < test.txt

Even though the abbreviation "art." was included in the XML file, the sentence 
detector breaks the sentence on instances of this abbreviation preceded by 
article and apostrophe (e.g. nell'art., dall'art., dell'art.). See also the 
enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
The issue isn't observed if the apostrophe (single quote) is replaced by a 
space character.



 


  was:
The Sentence Detector trained with an abbreviations list (see attachment) fails 
to spot them within a text if they are preceded by a punctuation mark. 

In Italian, words starting with a vowel may be preceded by an article plus 
apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
ARTICOLO, especially in legal text, is frequently abbreviated to ART.

Repro steps:
1) add the "art." abbreviation in the abbreviations XML file (enclosed, ctrl+F 
"art.", case insensitive)
2) train a model for the Italian language (training set enclosed) with the 
following command:
opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
it-sen.bin -data training-set.txt -encoding UTF-8 
3) run the model against a test text with the following command:
opennlp SentenceDetector it-sen.bin < test.txt

Even though the abbreviation "art." was included in the XML file, the sentence 
detector breaks the sentence on instances of this abbreviation preceded by 
article and apostrophe (e.g. nell'art., dall'art., dell'art.). See also the 
enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.



 



> Sentence detector doesn't spot abbreviations next to punctuation
> 
>
> Key: OPENNLP-1163
> URL: https://issues.apache.org/jira/browse/OPENNLP-1163
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Sentence Detector
>Affects Versions: 1.8.3
> Environment: Reproduced on Windows 10
>Reporter: Gabriele Vaccari
>Priority: Critical
>  Labels: abbreviation, sentence-detector
> Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt
>
>
> The Sentence Detector trained with an abbreviations list (see attachment) 
> fails to spot them within a text if they are preceded by a punctuation mark. 
> In Italian, words starting with a vowel may be preceded by an article plus 
> apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
> ARTICOLO, especially in legal text, is frequently abbreviated to ART.
> Repro steps:
> 1) add the "art." abbreviation in the abbreviations XML file (enclosed, 
> ctrl+F "art.", case insensitive)
> 2) train a model for the Italian language (training set enclosed) with the 
> following command:
> opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
> it-sen.bin -data training-set.txt -encoding UTF-8 
> 3) run the model against a test text with the following command:
> opennlp SentenceDetector it-sen.bin < test.txt
> Even though the abbreviation "art." was included in the XML file, the 
> sentence detector breaks the sentence on instances of this abbreviation 
> preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). 
> See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
> The issue isn't observed if the apostrophe (single quote) is replaced by a 
> space character.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OPENNLP-1163) Sentence detector doesn't spot abbreviations next to punctuation

2017-11-30 Thread Gabriele Vaccari (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriele Vaccari updated OPENNLP-1163:
--
Description: 
The Sentence Detector trained with an abbreviations list (see attachment) fails 
to spot them within a text if they are preceded by a punctuation mark. 

In Italian, words starting with a vowel may be preceded by an article plus 
apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
ARTICOLO, especially in legal text, is frequently abbreviated to ART.

Repro steps:
1) add the "art." abbreviation in the abbreviations XML file (enclosed, ctrl+F 
"art.", case insensitive)
2) train a model for the Italian language (training set enclosed) with the 
following command:
opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
it-sen.bin -data training-set.txt -encoding UTF-8 
3) run the model against a test text with the following command:
opennlp SentenceDetector it-sen.bin < test.txt

Even though the abbreviation "art." was included in the XML file, the sentence 
detector breaks the sentence on instances of this abbreviation preceded by 
article and apostrophe (e.g. nell'art., dall'art., dell'art.). See also the 
enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.



 


  was:
The Sentence Detector trained with an abbreviations list (see attachment) fails 
to spot them within a text if they are preceded by a punctuation mark. 

In Italian, words starting with a vowel may be preceded by an article plus 
apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
ARTICOLO, especially in legal text, is frequently abbreviated to ART.

Repro steps:
1) add the ART. abbreviation in the abbreviations XML file (enclosed, ctrl+F 
"art.")
2) train a model for the Italian language (training set enclosed) with the 
following command:
opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
it-sen.bin -data training-set.txt -encoding UTF-8 
3) run the model against a test text with the following command:
opennlp SentenceDetector it-sen.bin < test.txt

Even though the abbreviation "art." was included in the XML file, the sentence 
detector breaks the sentence on instances of this abbreviation preceded by 
article and apostrophe (e.g. nell'art., dall'art., dell'art.). See also the 
enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.



 



> Sentence detector doesn't spot abbreviations next to punctuation
> 
>
> Key: OPENNLP-1163
> URL: https://issues.apache.org/jira/browse/OPENNLP-1163
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Sentence Detector
>Affects Versions: 1.8.3
> Environment: Reproduced on Windows 10
>Reporter: Gabriele Vaccari
>Priority: Critical
>  Labels: abbreviation, sentence-detector
> Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt
>
>
> The Sentence Detector trained with an abbreviations list (see attachment) 
> fails to spot them within a text if they are preceded by a punctuation mark. 
> In Italian, words starting with a vowel may be preceded by an article plus 
> apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
> ARTICOLO, especially in legal text, is frequently abbreviated to ART.
> Repro steps:
> 1) add the "art." abbreviation in the abbreviations XML file (enclosed, 
> ctrl+F "art.", case insensitive)
> 2) train a model for the Italian language (training set enclosed) with the 
> following command:
> opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
> it-sen.bin -data training-set.txt -encoding UTF-8 
> 3) run the model against a test text with the following command:
> opennlp SentenceDetector it-sen.bin < test.txt
> Even though the abbreviation "art." was included in the XML file, the 
> sentence detector breaks the sentence on instances of this abbreviation 
> preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). 
> See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)