Re: [PR] OPENNLP-912: Rule based sentence detector (opennlp)
rzo1 commented on code in PR #390: URL: https://github.com/apache/opennlp/pull/390#discussion_r1408871552 ## opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java: ## @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.sentdetect.segment; + +import java.io.IOException; +import java.io.Reader; +import java.util.ArrayList; +import java.util.List; +import java.util.Set; +import java.util.TreeSet; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +import opennlp.tools.util.StringUtil; + + +public class SentenceTokenizerME implements SentenceTokenizer { Review Comment: +1 to @jzonthemtn comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] OPENNLP-912: Rule based sentence detector (opennlp)
private String sentence; + + private int start; + + private int end; + + private CharSequence text; + + private Reader reader; + + private int bufferLength; + + private LanguageTool languageTool; + + private Matcher beforeMatcher; + + private Matcher afterMatcher; + + boolean found; + + private Set breakSections; + + private List noBreakSections; + + public SentenceTokenizerME(LanguageTool languageTool, CharSequence text) { Review Comment: I wonder if we can rebuild this to avoid creating a Tokenizer for every piece of text? Wouldn't it be of more value to provide the text as a method parameter and compute the stuff on the fly? It would also allow us to make it threadsafe in the future. ## opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/Section.java: ## @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.sentdetect.segment; + +public class Section { Review Comment: record? ## opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizer.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.sentdetect.segment; + +import java.util.List; + +/** + * The interface for rule based sentence detector + */ +public interface SentenceTokenizer { Review Comment: +1 (and the method would require a proper description). It is totally unclear what the provided method does/shall do from an implementor perspective. ## opennlp-tools/src/test/java/opennlp/tools/sentdetect/segment/GoldenRulesTest.java: ## @@ -0,0 +1,527 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.sentdetect.segment; + +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.junit.jupiter.api.Disabled; +import org.junit.jupiter.api.Test; + +import opennlp.tools.util.featuregen.GeneratorFactory; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.fail; + +/** + * Thanks for the GoldenRules of + * https://github.com/diasks2/pragmatic_segmenter;>pragmatic_segmenter + */ +public class GoldenRulesTest { + + public Cleaner cleaner = new Cleaner(); + + public List segment(String text) { +if (cleaner != null) { + text = cleaner.clean(text); +} + +InputStream inputStream = getClass().getResourceAsStream( Review Comment: we should close the stream + read it once and consume the cached result for every test run. ## opennlp-tools/src/m
Re: [PR] OPENNLP-912: Rule based sentence detector (opennlp)
rzo1 commented on PR #390: URL: https://github.com/apache/opennlp/pull/390#issuecomment-1831371994 > I would like to better understand the origins of the rules used. Does there need to be license attribution? @jzonthemtn Looks these "golden-rules.txt" is from https://github.com/diasks2/pragmatic_segmenter#the-golden-rules (at least, if we trust the textual description). Also in some other languages: https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt - the library itself with the content is MIT, so no compliance issue but we would need to attribute it accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [opennlp] jzonthemtn commented on pull request #390: OPENNLP-912: Rule based sentence detector
jzonthemtn commented on pull request #390: URL: https://github.com/apache/opennlp/pull/390#issuecomment-1080742326 I would like to better understand the origins of the rules used. Does there need to be license attribution? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [opennlp] jzonthemtn commented on a change in pull request #390: OPENNLP-912: Rule based sentence detector
jzonthemtn commented on a change in pull request #390: URL: https://github.com/apache/opennlp/pull/390#discussion_r836508689 ## File path: opennlp-tools/src/main/resources/opennlp/tools/sentdetect/segment/rules.xml ## @@ -0,0 +1,131 @@ + Review comment: What is the origin of these rules? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [opennlp] jzonthemtn commented on a change in pull request #390: OPENNLP-912: Rule based sentence detector
jzonthemtn commented on a change in pull request #390: URL: https://github.com/apache/opennlp/pull/390#discussion_r836500546 ## File path: opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java ## @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.sentdetect.segment; + +import java.io.IOException; +import java.io.Reader; +import java.util.ArrayList; +import java.util.List; +import java.util.Set; +import java.util.TreeSet; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +import opennlp.tools.util.StringUtil; + + +public class SentenceTokenizerME implements SentenceTokenizer { Review comment: The `ME` name in the file names denotes "maximum entropy" as the method for the implementation. Since this implementation doesn't use a trained model, could it be named something like `RulesBasedSentenceDetector`? (Open to other names, too.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [opennlp] jzonthemtn commented on a change in pull request #390: OPENNLP-912: Rule based sentence detector
jzonthemtn commented on a change in pull request #390: URL: https://github.com/apache/opennlp/pull/390#discussion_r836497822 ## File path: opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizer.java ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.sentdetect.segment; + +import java.util.List; + +/** + * The interface for rule based sentence detector + */ +public interface SentenceTokenizer { Review comment: There is an `opennlp.tools.sentdetect.SentenceDetector` interface that resembles this interface. Since the purpose of the two interfaces seem the same (to break text into sentences), is it possible to reuse the other interface? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [opennlp] jzonthemtn commented on pull request #390: OPENNLP-912: Rule based sentence detector
jzonthemtn commented on pull request #390: URL: https://github.com/apache/opennlp/pull/390#issuecomment-774679660 Thanks a lot for this contribution! This is something OpenNLP has needed. I will take a closer look. Built and tested successfully. ``` Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f) Maven home: /opt/apache-maven Java version: 11.0.9.1, vendor: Ubuntu, runtime: /usr/lib/jvm/java-11-openjdk-amd64 Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "5.8.0-40-generic", arch: "amd64", family: "unix" ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re:Re: Rule based sentence detector
Hi all, I have created a PR in github repo: https://github.com/apache/opennlp/pull/390, anyone has any comments please feel free. Thanks Alan At 2021-01-22 02:59:40, "William Colen" wrote: >Hi Alan, > >Do you have a PR for the implementation? > >Thank you, >William > >Em ter., 19 de jan. de 2021 às 23:52, Alan Wang escreveu: > >> Hi all, >> >> I created a rule based sentence detector for OpenNLP >> <https://issues.apache.org/jira/browse/OPENNLP-912>. >> There are two kinds of rules: >> >> 1. break rules: specifying the sentence break >> 2. no-break rules: disallowing the sentence break >> >> All rules have two parts: >> >> Before the break >> After the break >> >> The algorithm idea: >> >> Retrieves the break rules. >> If none of the no-break rules is matched at the break location, the text >> is marked as split and a new segment is created >> >> Features: >> >> Text Cleanup and Preprocessing >> Easy to extend other languages >> >> Reference: >> >> This library use "Golden Rule" test of pragmatic_segmenter >> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules> >> >> Currently, the pass rate of test cases is 92.31%. The following test cases >> fail: 39, 50, 53, 52 >> For details, see the attachment. >> >> -- >> >> >> >> >> >>
[GitHub] [opennlp] Alanscut commented on pull request #390: OPENNLP-912: Rule based sentence detector
Alanscut commented on pull request #390: URL: https://github.com/apache/opennlp/pull/390#issuecomment-772945561 Somehow the Travis CI always timed out here, but succeeded in my fork repo: https://travis-ci.org/github/Alanscut/opennlp/builds/757336242 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [opennlp] Alanscut opened a new pull request #390: OPENNLP-912: Rule based sentence detector
Alanscut opened a new pull request #390: URL: https://github.com/apache/opennlp/pull/390 Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [ ] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [ ] Has your PR been rebased against the latest commit within the target branch (typically master)? - [ ] Is your initial contribution a single, squashed commit? ### For code changes: - [ ] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: Rule based sentence detector
Hi Alan, Do you have a PR for the implementation? Thank you, William Em ter., 19 de jan. de 2021 às 23:52, Alan Wang escreveu: > Hi all, > > I created a rule based sentence detector for OpenNLP > <https://issues.apache.org/jira/browse/OPENNLP-912>. > There are two kinds of rules: > > 1. break rules: specifying the sentence break > 2. no-break rules: disallowing the sentence break > > All rules have two parts: > > Before the break > After the break > > The algorithm idea: > > Retrieves the break rules. > If none of the no-break rules is matched at the break location, the text > is marked as split and a new segment is created > > Features: > > Text Cleanup and Preprocessing > Easy to extend other languages > > Reference: > > This library use "Golden Rule" test of pragmatic_segmenter > <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules> > > Currently, the pass rate of test cases is 92.31%. The following test cases > fail: 39, 50, 53, 52 > For details, see the attachment. > > -- > > > > > >
Rule based sentence detector
Hi all, I created a rule based sentence detector for OpenNLP. There are two kinds of rules: 1. break rules: specifying the sentence break 2. no-break rules: disallowing the sentence break All rules have two parts: Before the break After the break The algorithm idea: Retrieves the break rules. If none of the no-break rules is matched at the break location, the text is marked as split and a new segment is created Features: Text Cleanup and Preprocessing Easy to extend other languages Reference: This library use "Golden Rule" test of pragmatic_segmenter Currently, the pass rate of test cases is 92.31%. The following test cases fail: 39, 50, 53, 52 For details, see the attachment. 1.) Simple period to end sentence Hello World. My name is Jonas. => ["Hello World.", "My name is Jonas."] 2.) Question mark to end sentence What is your name? My name is Jonas. => ["What is your name?", "My name is Jonas."] 3.) Exclamation point to end sentence There it is! I found it. => ["There it is!", "I found it."] 4.) One letter upper case abbreviations My name is Jonas E. Smith. => ["My name is Jonas E. Smith."] 5.) One letter lower case abbreviations Please turn to p. 55. => ["Please turn to p. 55."] 6.) Two letter lower case abbreviations in the middle of a sentence Were Jane and co. at the party? => ["Were Jane and co. at the party?"] 7.) Two letter upper case abbreviations in the middle of a sentence They closed the deal with Pitt, Briggs & Co. at noon. => ["They closed the deal with Pitt, Briggs & Co. at noon."] 8.) Two letter lower case abbreviations at the end of a sentence Let's ask Jane and co. They should know. => ["Let's ask Jane and co.", "They should know."] 9.) Two letter upper case abbreviations at the end of a sentence They closed the deal with Pitt, Briggs & Co. It closed yesterday. => ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."] 10.) Two letter (prepositive) abbreviations I can see Mt. Fuji from here. => ["I can see Mt. Fuji from here."] 11.) Two letter (prepositive & postpositive) abbreviations St. Michael's Church is on 5th st. near the light. => ["St. Michael's Church is on 5th st. near the light."] 12.) Possesive two letter abbreviations That is JFK Jr.'s book. => ["That is JFK Jr.'s book."] 13.) Multi-period abbreviations in the middle of a sentence I visited the U.S.A. last year. => ["I visited the U.S.A. last year."] 14.) Multi-period abbreviations at the end of a sentence I live in the E.U. How about you? => ["I live in the E.U.", "How about you?"] 15.) U.S. as sentence boundary I live in the U.S. How about you? => ["I live in the U.S.", "How about you?"] 16.) U.S. as non sentence boundary with next word capitalized I work for the U.S. Government in Virginia. => ["I work for the U.S. Government in Virginia."] 17.) U.S. as non sentence boundary I have lived in the U.S. for 20 years. => ["I have lived in the U.S. for 20 years."] 18.) A.M. / P.M. as non sentence boundary and sentence boundary At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store. => ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."] 19.) Number as non sentence boundary She has $100.00 in her bag. => ["She has $100.00 in her bag."] 20.) Number as sentence boundary She has $100.00. It is in her bag. => ["She has $100.00.", "It is in her bag."] 21.) Parenthetical inside sentence He teaches science (He previously worked for 5 years as an engineer.) at the local University. => ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."] 22.) Email addresses Her email is jane@example.com. I sent her an email. => ["Her email is jane@example.com.", "I sent her an email."] 23.) Web addresses The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out. => ["The site is: https://www.example.50.com/new-site/awesome_content.html.;, "Please check it out."] 24.) Single quotations inside sentence She turned to him, 'This is great.' she said. => ["She turned to him, 'This is great.' she said."] 25.) Double quotations inside sentence She turned to him, "This is great." she said. => ["She turned to him, \"This is great.\" she said."] 26.) Double quotations at the end of a sentence She turned to him, \"This is great.\" She held the book out to show him. => ["She turned to him, \"This is great.\"", &q
Re: OPENNLP-912 : Add a rule based sentence detector
Hello, could you elaborate a bit on the approach? Jörn On Tue, Apr 3, 2018 at 5:24 PM, Isuranga Pererawrote: > Hi All, > > I would like to contribute $subject feature. Appreciate if anyone can guide > me through the process. > > Best Regards > Isuranga Perera
OPENNLP-912 : Add a rule based sentence detector
Hi All, I would like to contribute $subject feature. Appreciate if anyone can guide me through the process. Best Regards Isuranga Perera