Re: [PR] OPENNLP-1163: Sentence detector doesn't spot abbreviations next to punctuation (opennlp)

2023-12-25 Thread via GitHub


mawiesne merged PR #570:
URL: https://github.com/apache/opennlp/pull/570


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] OPENNLP-1163: Sentence detector doesn't spot abbreviations next to punctuation (opennlp)

2023-12-20 Thread via GitHub


mawiesne opened a new pull request, #570:
URL: https://github.com/apache/opennlp/pull/570

   Change
   -
   - verifies `OPENNLP-1163` is no longer a concern, thanks to 
[OPENNLP-793](https://issues.apache.org/jira/browse/OPENNLP-793) being resolved 
with OpenNLP version 2.3.1
   - adds related test case to SentenceDetectorMEItalianTest to verify 
abbreviated "articolo" (art.) is handled correctly
   - enhances Italian corpus (see `Sentences_IT.txt`) introduced in 
[OPENNLP-1530](https://issues.apache.org/jira/browse/OPENNLP-1530) with further 
examples for use of "nell'art."
   - resolves `OPENNLP-1163` 
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] OPENNLP-912: Rule based sentence detector (opennlp)

2023-11-29 Thread via GitHub


rzo1 commented on code in PR #390:
URL: https://github.com/apache/opennlp/pull/390#discussion_r1408871552


##
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java:
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.io.IOException;
+import java.io.Reader;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import opennlp.tools.util.StringUtil;
+
+
+public class SentenceTokenizerME implements SentenceTokenizer {

Review Comment:
   +1 to @jzonthemtn comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] OPENNLP-912: Rule based sentence detector (opennlp)

2023-11-29 Thread via GitHub
  private String sentence;
+
+  private int start;
+
+  private int end;
+
+  private CharSequence text;
+
+  private Reader reader;
+
+  private int bufferLength;
+
+  private LanguageTool languageTool;
+
+  private Matcher beforeMatcher;
+
+  private Matcher afterMatcher;
+
+  boolean found;
+
+  private Set breakSections;
+
+  private List noBreakSections;
+
+  public SentenceTokenizerME(LanguageTool languageTool, CharSequence text) {

Review Comment:
   I wonder if we can rebuild this to avoid creating a Tokenizer for every 
piece of text? Wouldn't it be of more value to provide the text as a method 
parameter and compute the stuff on the fly? It would also allow us to make it 
threadsafe in the future.



##
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/Section.java:
##
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+public class Section {

Review Comment:
   record?



##
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizer.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.util.List;
+
+/**
+ * The interface for rule based sentence detector
+ */
+public interface SentenceTokenizer {

Review Comment:
   +1 (and the method would require a proper description). It is totally 
unclear what the provided method does/shall do from an implementor perspective.



##
opennlp-tools/src/test/java/opennlp/tools/sentdetect/segment/GoldenRulesTest.java:
##
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.junit.jupiter.api.Disabled;
+import org.junit.jupiter.api.Test;
+
+import opennlp.tools.util.featuregen.GeneratorFactory;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.fail;
+
+/**
+ * Thanks for the GoldenRules of
+ * https://github.com/diasks2/pragmatic_segmenter;>pragmatic_segmenter
+ */
+public class GoldenRulesTest {
+
+  public Cleaner cleaner = new Cleaner();
+
+  public List segment(String text) {
+if (cleaner != null) {
+  text = cleaner.clean(text);
+}
+
+InputStream inputStream = getClass().getResourceAsStream(

Review Comment:
   we should close the stream + read it once and consume the cached result for 
every test run.



##
opennlp-tools/src/m

Re: [PR] OPENNLP-912: Rule based sentence detector (opennlp)

2023-11-28 Thread via GitHub


rzo1 commented on PR #390:
URL: https://github.com/apache/opennlp/pull/390#issuecomment-1831371994

   > I would like to better understand the origins of the rules used. Does 
there need to be license attribution?
   
   @jzonthemtn Looks these "golden-rules.txt" is from 
https://github.com/diasks2/pragmatic_segmenter#the-golden-rules (at least, if 
we trust the textual description). Also in some other languages: 
https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt - the library 
itself with the content is MIT, so no compliance issue but we would need to 
attribute it accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [opennlp] jzonthemtn merged pull request #447: OPENNLP-1407: Adding sentence detector to NameFinderDL.

2022-12-10 Thread GitBox


jzonthemtn merged PR #447:
URL: https://github.com/apache/opennlp/pull/447


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [opennlp] jzonthemtn commented on pull request #447: OPENNLP-1407: Adding sentence detector to NameFinderDL.

2022-12-05 Thread GitBox


jzonthemtn commented on PR #447:
URL: https://github.com/apache/opennlp/pull/447#issuecomment-1338051998

   This adds a `sentenceDetector` to the `NameFinderDL` to give an option to do 
inference over individual sentences instead of the whole text at once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [opennlp] jzonthemtn opened a new pull request, #447: OPENNLP-1407: Adding sentence detector to NameFinderDL.

2022-12-05 Thread GitBox


jzonthemtn opened a new pull request, #447:
URL: https://github.com/apache/opennlp/pull/447

   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [X] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [X] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [X] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   
   - [X] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [X] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [X] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [opennlp] jzonthemtn commented on pull request #390: OPENNLP-912: Rule based sentence detector

2022-03-28 Thread GitBox


jzonthemtn commented on pull request #390:
URL: https://github.com/apache/opennlp/pull/390#issuecomment-1080742326


   I would like to better understand the origins of the rules used. Does there 
need to be license attribution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] jzonthemtn commented on a change in pull request #390: OPENNLP-912: Rule based sentence detector

2022-03-28 Thread GitBox


jzonthemtn commented on a change in pull request #390:
URL: https://github.com/apache/opennlp/pull/390#discussion_r836508689



##
File path: 
opennlp-tools/src/main/resources/opennlp/tools/sentdetect/segment/rules.xml
##
@@ -0,0 +1,131 @@
+

Review comment:
   What is the origin of these rules?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] jzonthemtn commented on a change in pull request #390: OPENNLP-912: Rule based sentence detector

2022-03-28 Thread GitBox


jzonthemtn commented on a change in pull request #390:
URL: https://github.com/apache/opennlp/pull/390#discussion_r836500546



##
File path: 
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.io.IOException;
+import java.io.Reader;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import opennlp.tools.util.StringUtil;
+
+
+public class SentenceTokenizerME implements SentenceTokenizer {

Review comment:
   The `ME` name in the file names denotes "maximum entropy" as the method 
for the implementation. Since this implementation doesn't use a trained model, 
could it be named something like `RulesBasedSentenceDetector`? (Open to other 
names, too.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] jzonthemtn commented on a change in pull request #390: OPENNLP-912: Rule based sentence detector

2022-03-28 Thread GitBox


jzonthemtn commented on a change in pull request #390:
URL: https://github.com/apache/opennlp/pull/390#discussion_r836497822



##
File path: 
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizer.java
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.util.List;
+
+/**
+ * The interface for rule based sentence detector
+ */
+public interface SentenceTokenizer {

Review comment:
   There is an `opennlp.tools.sentdetect.SentenceDetector` interface that 
resembles this interface. Since the purpose of the two interfaces seem the same 
(to break text into sentences), is it possible to reuse the other interface?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] jzonthemtn commented on pull request #390: OPENNLP-912: Rule based sentence detector

2021-02-07 Thread GitBox


jzonthemtn commented on pull request #390:
URL: https://github.com/apache/opennlp/pull/390#issuecomment-774679660


   Thanks a lot for this contribution! This is something OpenNLP has needed. I 
will take a closer look.
   
   Built and tested successfully.
   
   ```
   Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
   Maven home: /opt/apache-maven
   Java version: 11.0.9.1, vendor: Ubuntu, runtime: 
/usr/lib/jvm/java-11-openjdk-amd64
   Default locale: en_US, platform encoding: UTF-8
   OS name: "linux", version: "5.8.0-40-generic", arch: "amd64", family: "unix"
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Re:Re: Rule based sentence detector

2021-02-03 Thread Alan Wang



Hi all, I have created a PR in github repo: 
https://github.com/apache/opennlp/pull/390, anyone has any comments please feel 
free.




Thanks
Alan








At 2021-01-22 02:59:40, "William Colen"  wrote:
>Hi Alan,
>
>Do you have a PR for the implementation?
>
>Thank you,
>William
>
>Em ter., 19 de jan. de 2021 às 23:52, Alan Wang  escreveu:
>
>> Hi all,
>>
>> I created a rule based sentence detector for OpenNLP
>> <https://issues.apache.org/jira/browse/OPENNLP-912>.
>> There are two kinds of rules:
>>
>> 1. break rules: specifying the sentence break
>> 2. no-break rules: disallowing the sentence break
>>
>> All rules have two parts:
>>
>> Before the break
>> After the break
>>
>> The algorithm idea:
>>
>> Retrieves the break rules.
>> If none of the no-break rules is matched at the break location, the text
>> is marked as split and a new segment is created
>>
>> Features:
>>
>> Text Cleanup and Preprocessing
>> Easy to extend other languages
>>
>> Reference:
>>
>> This library use "Golden Rule" test of pragmatic_segmenter
>> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules>
>>
>> Currently, the pass rate of test cases is 92.31%. The following test cases
>> fail: 39, 50, 53, 52
>> For details, see the attachment.
>>
>> --
>>
>>
>>
>>
>>
>>


[GitHub] [opennlp] Alanscut commented on pull request #390: OPENNLP-912: Rule based sentence detector

2021-02-03 Thread GitBox


Alanscut commented on pull request #390:
URL: https://github.com/apache/opennlp/pull/390#issuecomment-772945561


   Somehow the Travis CI always timed out here, but succeeded in my fork repo: 
https://travis-ci.org/github/Alanscut/opennlp/builds/757336242 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] Alanscut opened a new pull request #390: OPENNLP-912: Rule based sentence detector

2021-01-28 Thread GitBox


Alanscut opened a new pull request #390:
URL: https://github.com/apache/opennlp/pull/390


   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [ ] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [ ] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Re: Rule based sentence detector

2021-01-21 Thread William Colen
Hi Alan,

Do you have a PR for the implementation?

Thank you,
William

Em ter., 19 de jan. de 2021 às 23:52, Alan Wang  escreveu:

> Hi all,
>
> I created a rule based sentence detector for OpenNLP
> <https://issues.apache.org/jira/browse/OPENNLP-912>.
> There are two kinds of rules:
>
> 1. break rules: specifying the sentence break
> 2. no-break rules: disallowing the sentence break
>
> All rules have two parts:
>
> Before the break
> After the break
>
> The algorithm idea:
>
> Retrieves the break rules.
> If none of the no-break rules is matched at the break location, the text
> is marked as split and a new segment is created
>
> Features:
>
> Text Cleanup and Preprocessing
> Easy to extend other languages
>
> Reference:
>
> This library use "Golden Rule" test of pragmatic_segmenter
> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules>
>
> Currently, the pass rate of test cases is 92.31%. The following test cases
> fail: 39, 50, 53, 52
> For details, see the attachment.
>
> --
>
>
>
>
>
>


Rule based sentence detector

2021-01-19 Thread Alan Wang
Hi all,


I created a rule based sentence detector for OpenNLP.
There are two kinds of rules:
1. break rules: specifying the sentence break
2. no-break rules: disallowing the sentence break
All rules have two parts:
Before the break
After the break
The algorithm idea:
Retrieves the break rules.
If none of the no-break rules is matched at the break location, the text is 
marked as split and a new segment is created
Features:
Text Cleanup and Preprocessing
Easy to extend other languages
Reference:
This library use "Golden Rule" test of pragmatic_segmenter
Currently, the pass rate of test cases is 92.31%. The following test cases 
fail: 39, 50, 53, 52
For details, see the attachment.





1.) Simple period to end sentence
Hello World. My name is Jonas.
=> ["Hello World.", "My name is Jonas."]

2.) Question mark to end sentence
What is your name? My name is Jonas.
=> ["What is your name?", "My name is Jonas."]

3.) Exclamation point to end sentence
There it is! I found it.
=> ["There it is!", "I found it."]

4.) One letter upper case abbreviations
My name is Jonas E. Smith.
=> ["My name is Jonas E. Smith."]

5.) One letter lower case abbreviations
Please turn to p. 55.
=> ["Please turn to p. 55."]

6.) Two letter lower case abbreviations in the middle of a sentence
Were Jane and co. at the party?
=> ["Were Jane and co. at the party?"]

7.) Two letter upper case abbreviations in the middle of a sentence
They closed the deal with Pitt, Briggs & Co. at noon.
=> ["They closed the deal with Pitt, Briggs & Co. at noon."]

8.) Two letter lower case abbreviations at the end of a sentence
Let's ask Jane and co. They should know.
=> ["Let's ask Jane and co.", "They should know."]

9.) Two letter upper case abbreviations at the end of a sentence
They closed the deal with Pitt, Briggs & Co. It closed yesterday.
=> ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]

10.) Two letter (prepositive) abbreviations
I can see Mt. Fuji from here.
=> ["I can see Mt. Fuji from here."]

11.) Two letter (prepositive & postpositive) abbreviations
St. Michael's Church is on 5th st. near the light.
=> ["St. Michael's Church is on 5th st. near the light."]

12.) Possesive two letter abbreviations
That is JFK Jr.'s book.
=> ["That is JFK Jr.'s book."]

13.) Multi-period abbreviations in the middle of a sentence
I visited the U.S.A. last year.
=> ["I visited the U.S.A. last year."]

14.) Multi-period abbreviations at the end of a sentence
I live in the E.U. How about you?
=> ["I live in the E.U.", "How about you?"]

15.) U.S. as sentence boundary
I live in the U.S. How about you?
=> ["I live in the U.S.", "How about you?"]

16.) U.S. as non sentence boundary with next word capitalized
I work for the U.S. Government in Virginia.
=> ["I work for the U.S. Government in Virginia."]

17.) U.S. as non sentence boundary
I have lived in the U.S. for 20 years.
=> ["I have lived in the U.S. for 20 years."]

18.) A.M. / P.M. as non sentence boundary and sentence boundary
At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then 
went to the store.
=> ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. 
Smith then went to the store."]

19.) Number as non sentence boundary
She has $100.00 in her bag.
=> ["She has $100.00 in her bag."]

20.) Number as sentence boundary
She has $100.00. It is in her bag.
=> ["She has $100.00.", "It is in her bag."]

21.) Parenthetical inside sentence
He teaches science (He previously worked for 5 years as an engineer.) at the 
local University.
=> ["He teaches science (He previously worked for 5 years as an engineer.) at 
the local University."]

22.) Email addresses
Her email is jane@example.com. I sent her an email.
=> ["Her email is jane@example.com.", "I sent her an email."]

23.) Web addresses
The site is: https://www.example.50.com/new-site/awesome_content.html. Please 
check it out.
=> ["The site is: https://www.example.50.com/new-site/awesome_content.html.;, 
"Please check it out."]

24.) Single quotations inside sentence
She turned to him, 'This is great.' she said.
=> ["She turned to him, 'This is great.' she said."]

25.) Double quotations inside sentence
She turned to him, "This is great." she said.
=> ["She turned to him, \"This is great.\" she said."]

26.) Double quotations at the end of a sentence
She turned to him, \"This is great.\" She held the book out to show him.
=> ["She turned to him, \"This is great.\"", &q

Re: OPENNLP-912 : Add a rule based sentence detector

2018-04-06 Thread Joern Kottmann
Hello,

could you elaborate a bit on the approach?

Jörn

On Tue, Apr 3, 2018 at 5:24 PM, Isuranga Perera
 wrote:
> Hi All,
>
> I would like to contribute $subject feature. Appreciate if anyone can guide
> me through the process.
>
> Best Regards
> Isuranga Perera


OPENNLP-912 : Add a rule based sentence detector

2018-04-03 Thread Isuranga Perera
Hi All,

I would like to contribute $subject feature. Appreciate if anyone can guide
me through the process.

Best Regards
Isuranga Perera


I: openNLP best practices - sentence detector

2017-12-07 Thread Gabriele Vaccari
Hi all,
I sent this message to the users mailing list but got no response so far. 
Reposting to the dev mailing list.

Also: I'm trying to make some modifications to the code relating to issue 
1163<https://issues.apache.org/jira/browse/OPENNLP-1163> mentioned below but I 
have troubles with the style checker. I keep getting a lot of 
NewlineAtEndOfFile errors, even though the files do have a new line at the end 
of file. I've also made sure to replacing \r\n's with \n's, to no avail. I'm 
using Maven 3.3.9 and Eclipse Neon.2

Thank you

Gabriele


Da: Gabriele Vaccari
Inviato: Friday, December 1, 2017 13:02
A: 'us...@opennlp.apache.org' <us...@opennlp.apache.org>
Oggetto: openNLP best practices - sentence detector

Hi all,

I'm trying to use openNLP to train some models for Italian, basically to get 
some familiarity with the API. To provide some background, I'm familiar with 
machine learning concepts and understand what an NLP pipeline looks like, 
however this is the first time I actually have to go ahead and put together an 
application with all this.

So I started with the sentence detector. I was able to train an Italian SD with 
a corpus of sentences from http://www.corpusitaliano.it/en/. However the 
performance of the detector is somewhat below my expectations. It makes pretty 
obvious mistakes, like failing to recognize an end-of-sentence full stop 
(example below*), or failing to spot an abbreviation preceded by punctuation 
(I've posted the issue 1163 on 
Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).

Even though the documentation is very good, I feel it lacks some best practices 
and suggestions. For instance:

  *   Is my sentence detection training set supposed to have consistent 
documents or will a bunch of random sentences with a blank line every 20-30 
work?
  *   Do my training examples in openNLP native format need to be formatted in 
a special way? Will the algo ignore stuff like extra white spaces or tabs 
between words? Do examples with a lot of punctuation like quotes or parenthesis 
somehow affect the outcome?
  *   How many training examples (or events) are recommended?
  *   Is it better to provide a case sensitive abbreviation dictionary vs case 
insensitive?
  *   Is the issue 1163 a known problem? I think other languages as French 
might have the same thing happening.
  *   Are there examples of complete production-grade data sets in Italian or 
other languages that have been successfully used to train openNLP tools?

I believe I could find most of these questions by just looking at the code, but 
someone who already went through it maybe could point me in the right direction.
Basically, I'm asking for best practices and pro tips.

Thank you

* failure to recognize EOS full stop:
SENT_1: Molteplici furono i passi che portarono alla nascita di questa 
disciplina.
SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è 
l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel 
1623, grazie a Willhelm 
Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart=edit=1>,
 si arrivò a creare macchine in grado di effettuare calcoli matematici con 
numeri fino a sei cifre, anche se non in maniera autonoma.


Gabriele Vaccari
Dedalus SpA



Re: Sentence Detector

2017-08-27 Thread Manoj B. Narayanan
Thanks William.

On Fri, Aug 25, 2017 at 7:38 PM, William Colen <william.co...@gmail.com>
wrote:

> The writer did a mistake by not adding a space after the dot. The sentence
> detector model will not know how to deal with it because not very often
> there are dots without space splitting sentences.
>
> This is very common in social network. I apply some regex to check if it is
> not a UR, email or number, them add the missing space.
>
>
>
> 2017-08-25 9:31 GMT-03:00 Manoj B. Narayanan <manojb.narayanan2011@gmail.
> com
> >:
>
> > Hi,
> >
> > The OpenNLP sentence detector detects sentences when the period at the
> end
> > of a sentence and the next word are separated by a . If there is
> no
> >  in between it doesn't split them. Is there a way that could help
> me
> > solve this?
> >
> > *Example.1*
> >
> > It is with great pleasure that I write to invite you to the launch of the
> > University of Reading’s Centre for Food Security on Thursday 25 November
> > 2010.** The Centre offers a new focus for research on the
> challenges
> > of meeting global demands for food in a sustainable way.
> >
> > *Output1*
> > It is with great pleasure that I write to invite you to the launch of the
> > University of Reading’s Centre for Food Security on Thursday 25 November
> > 2010.
> > The Centre offers a new focus for research on the challenges of meeting
> > global demands for food in a sustainable way.
> >
> >
> > *Example.2*
> >
> > It is with great pleasure that I write to invite you to the launch of the
> > University of Reading’s Centre for Food Security on Thursday 25 November
> > 2010.The Centre offers a new focus for research on the challenges of
> > meeting global demands for food in a sustainable way.
> >
> > *Output2*
> > It is with great pleasure that I write to invite you to the launch of the
> > University of Reading’s Centre for Food Security on Thursday 25 November
> > 2010.The Centre offers a new focus for research on the challenges of
> > meeting global demands for food in a sustainable way.
> >
> > Thanks,
> > Manoj.
> >
>


Re: Sentence Detector

2017-08-25 Thread William Colen
The writer did a mistake by not adding a space after the dot. The sentence
detector model will not know how to deal with it because not very often
there are dots without space splitting sentences.

This is very common in social network. I apply some regex to check if it is
not a UR, email or number, them add the missing space.



2017-08-25 9:31 GMT-03:00 Manoj B. Narayanan <manojb.narayanan2...@gmail.com
>:

> Hi,
>
> The OpenNLP sentence detector detects sentences when the period at the end
> of a sentence and the next word are separated by a . If there is no
>  in between it doesn't split them. Is there a way that could help me
> solve this?
>
> *Example.1*
>
> It is with great pleasure that I write to invite you to the launch of the
> University of Reading’s Centre for Food Security on Thursday 25 November
> 2010.** The Centre offers a new focus for research on the challenges
> of meeting global demands for food in a sustainable way.
>
> *Output1*
> It is with great pleasure that I write to invite you to the launch of the
> University of Reading’s Centre for Food Security on Thursday 25 November
> 2010.
> The Centre offers a new focus for research on the challenges of meeting
> global demands for food in a sustainable way.
>
>
> *Example.2*
>
> It is with great pleasure that I write to invite you to the launch of the
> University of Reading’s Centre for Food Security on Thursday 25 November
> 2010.The Centre offers a new focus for research on the challenges of
> meeting global demands for food in a sustainable way.
>
> *Output2*
> It is with great pleasure that I write to invite you to the launch of the
> University of Reading’s Centre for Food Security on Thursday 25 November
> 2010.The Centre offers a new focus for research on the challenges of
> meeting global demands for food in a sustainable way.
>
> Thanks,
> Manoj.
>


Sentence Detector

2017-08-25 Thread Manoj B. Narayanan
Hi,

The OpenNLP sentence detector detects sentences when the period at the end
of a sentence and the next word are separated by a . If there is no
 in between it doesn't split them. Is there a way that could help me
solve this?

*Example.1*

It is with great pleasure that I write to invite you to the launch of the
University of Reading’s Centre for Food Security on Thursday 25 November
2010.** The Centre offers a new focus for research on the challenges
of meeting global demands for food in a sustainable way.

*Output1*
It is with great pleasure that I write to invite you to the launch of the
University of Reading’s Centre for Food Security on Thursday 25 November
2010.
The Centre offers a new focus for research on the challenges of meeting
global demands for food in a sustainable way.


*Example.2*

It is with great pleasure that I write to invite you to the launch of the
University of Reading’s Centre for Food Security on Thursday 25 November
2010.The Centre offers a new focus for research on the challenges of
meeting global demands for food in a sustainable way.

*Output2*
It is with great pleasure that I write to invite you to the launch of the
University of Reading’s Centre for Food Security on Thursday 25 November
2010.The Centre offers a new focus for research on the challenges of
meeting global demands for food in a sustainable way.

Thanks,
Manoj.


Re: Unicode danda in sentence detector

2012-06-03 Thread সৌভিক
On Thu, May 31, 2012 at 1:36 PM, Jörn Kottmann kottm...@gmail.com wrote:

 The wikipedia reference says its commonly used for
 Indian languages, maybe we just should just include them,
 e.g. like we did for Portuguese.

 On the other side we might also need custom feature
 generation to get good results.
 How are words are delimited in Indian? With spaces?

words are delimited by spaces in bengali, hindi and most other Indian
languages.


 I suggest to first test with passing in the danda char,
 measure how it performs, and then decide if we might also
 need an adaption of the feature generation for Indian languages.

I started with a very small docset (about 1500 sentences from
news/blogs downloaded from the internet) and no abbreviations, no
custom features. I used the -eosChars '।?!' and got the following
result:

Precision: 0.8967468175388967
Recall: 0.8386243386243386
F-Measure: 0.8667122351332877

as you've mentioned, the danda is a sentence break in multiple Indian
languages. so does it make sense to add it in the Factory?


 Do you have training data you can train it on? If there is a publicly
 available data set me would appreciate having format support for it
 directly in OpenNLP.


I'll refine the model using a larger dataset and possibly, with an
abbreviations dictionary. I believe it should be possible to do it on
stuff openly available.

Cheers!
Soubhik.

 What do you think?

 Jörn


 On 05/31/2012 03:35 AM, William Colen wrote:

 As far as I know you don't need a CLA for a patch. Simply open a Jira and
 attach your patch to it.

 Besides what James pointed, you may also want change the EOS characters.
 There are two related new features that are already implemented in the
 trunk:

 https://issues.apache.org/jira/browse/OPENNLP-428
 This one added an optional command line argument where you set the
 end-of-sentence characters. This setting will be persisted to the model.
 If
 you are using the API you can create a SentenceDetectorFactory and use it
 to set the EOS chars.

 https://issues.apache.org/jira/browse/OPENNLP-434
 This is a new feature that allow customizing the SentenceDetector. You
 can
 extend the SentenceDetectorFactory and override methods as needed. You
 can
 pass in the customized factory using both the command line or the API.


 On Wed, May 30, 2012 at 7:19 PM, James Kosinjames.ko...@gmail.com
  wrote:

 Hi Soubhik,

 Should already be supported.
 You have to pass the -encoding utf8 to the command line interface.

 James

 On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:

 Hi,

 I'm trying to use OpenNLP to train a sentence detector for Bengali

 language

 (bn). I would like to add support for Unicode danda character in
 opennlp.tools.sentdetect.lang.Factory
 class. this character is a sentence break in Bengali, Hindi and several
 other Indian languages. the code change should be small (  10 lines).

 Is it correct to think that a change of this size will not require a
 CLA?

 Ref: en.wikipedia.org/wiki/*Danda*

 Regards,
 Soubhik.
 --






--
Soubhik Bhattacharya


Re: Unicode danda in sentence detector

2012-05-31 Thread Jörn Kottmann

The wikipedia reference says its commonly used for
Indian languages, maybe we just should just include them,
e.g. like we did for Portuguese.

On the other side we might also need custom feature
generation to get good results.
How are words are delimited in Indian? With spaces?

I suggest to first test with passing in the danda char,
measure how it performs, and then decide if we might also
need an adaption of the feature generation for Indian languages.

Do you have training data you can train it on? If there is a publicly
available data set me would appreciate having format support for it
directly in OpenNLP.

What do you think?

Jörn

On 05/31/2012 03:35 AM, William Colen wrote:

As far as I know you don't need a CLA for a patch. Simply open a Jira and
attach your patch to it.

Besides what James pointed, you may also want change the EOS characters.
There are two related new features that are already implemented in the
trunk:

https://issues.apache.org/jira/browse/OPENNLP-428
This one added an optional command line argument where you set the
end-of-sentence characters. This setting will be persisted to the model. If
you are using the API you can create a SentenceDetectorFactory and use it
to set the EOS chars.

https://issues.apache.org/jira/browse/OPENNLP-434
This is a new feature that allow customizing the SentenceDetector. You can
extend the SentenceDetectorFactory and override methods as needed. You can
pass in the customized factory using both the command line or the API.


On Wed, May 30, 2012 at 7:19 PM, James Kosinjames.ko...@gmail.com  wrote:


Hi Soubhik,

Should already be supported.
You have to pass the -encoding utf8 to the command line interface.

James

On 5/30/2012 1:52 PM, Soubhik (সৌভিক) wrote:

Hi,

I'm trying to use OpenNLP to train a sentence detector for Bengali

language

(bn). I would like to add support for Unicode danda character in
opennlp.tools.sentdetect.lang.Factory
class. this character is a sentence break in Bengali, Hindi and several
other Indian languages. the code change should be small (  10 lines).

Is it correct to think that a change of this size will not require a CLA?

Ref: en.wikipedia.org/wiki/*Danda*

Regards,
Soubhik.
--







Unicode danda in sentence detector

2012-05-30 Thread সৌভিক
Hi,

I'm trying to use OpenNLP to train a sentence detector for Bengali language
(bn). I would like to add support for Unicode danda character in
opennlp.tools.sentdetect.lang.Factory
class. this character is a sentence break in Bengali, Hindi and several
other Indian languages. the code change should be small ( 10 lines).

Is it correct to think that a change of this size will not require a CLA?

Ref: en.wikipedia.org/wiki/*Danda*

Regards,
Soubhik.
--