Re: [PR] OPENNLP-912: Rule based sentence detector (opennlp)

2023-11-29 Thread via GitHub


rzo1 commented on code in PR #390:
URL: https://github.com/apache/opennlp/pull/390#discussion_r1408871552


##
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java:
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.io.IOException;
+import java.io.Reader;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import opennlp.tools.util.StringUtil;
+
+
+public class SentenceTokenizerME implements SentenceTokenizer {

Review Comment:
   +1 to @jzonthemtn comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] OPENNLP-912: Rule based sentence detector (opennlp)

2023-11-29 Thread via GitHub
  private String sentence;
+
+  private int start;
+
+  private int end;
+
+  private CharSequence text;
+
+  private Reader reader;
+
+  private int bufferLength;
+
+  private LanguageTool languageTool;
+
+  private Matcher beforeMatcher;
+
+  private Matcher afterMatcher;
+
+  boolean found;
+
+  private Set breakSections;
+
+  private List noBreakSections;
+
+  public SentenceTokenizerME(LanguageTool languageTool, CharSequence text) {

Review Comment:
   I wonder if we can rebuild this to avoid creating a Tokenizer for every 
piece of text? Wouldn't it be of more value to provide the text as a method 
parameter and compute the stuff on the fly? It would also allow us to make it 
threadsafe in the future.



##
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/Section.java:
##
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+public class Section {

Review Comment:
   record?



##
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizer.java:
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.util.List;
+
+/**
+ * The interface for rule based sentence detector
+ */
+public interface SentenceTokenizer {

Review Comment:
   +1 (and the method would require a proper description). It is totally 
unclear what the provided method does/shall do from an implementor perspective.



##
opennlp-tools/src/test/java/opennlp/tools/sentdetect/segment/GoldenRulesTest.java:
##
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.junit.jupiter.api.Disabled;
+import org.junit.jupiter.api.Test;
+
+import opennlp.tools.util.featuregen.GeneratorFactory;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.fail;
+
+/**
+ * Thanks for the GoldenRules of
+ * https://github.com/diasks2/pragmatic_segmenter;>pragmatic_segmenter
+ */
+public class GoldenRulesTest {
+
+  public Cleaner cleaner = new Cleaner();
+
+  public List segment(String text) {
+if (cleaner != null) {
+  text = cleaner.clean(text);
+}
+
+InputStream inputStream = getClass().getResourceAsStream(

Review Comment:
   we should close the stream + read it once and consume the cached result for 
every test run.



##
opennlp-tools/src/m

Re: [PR] OPENNLP-912: Rule based sentence detector (opennlp)

2023-11-28 Thread via GitHub


rzo1 commented on PR #390:
URL: https://github.com/apache/opennlp/pull/390#issuecomment-1831371994

   > I would like to better understand the origins of the rules used. Does 
there need to be license attribution?
   
   @jzonthemtn Looks these "golden-rules.txt" is from 
https://github.com/diasks2/pragmatic_segmenter#the-golden-rules (at least, if 
we trust the textual description). Also in some other languages: 
https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt - the library 
itself with the content is MIT, so no compliance issue but we would need to 
attribute it accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [opennlp] jzonthemtn commented on pull request #390: OPENNLP-912: Rule based sentence detector

2022-03-28 Thread GitBox


jzonthemtn commented on pull request #390:
URL: https://github.com/apache/opennlp/pull/390#issuecomment-1080742326


   I would like to better understand the origins of the rules used. Does there 
need to be license attribution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] jzonthemtn commented on a change in pull request #390: OPENNLP-912: Rule based sentence detector

2022-03-28 Thread GitBox


jzonthemtn commented on a change in pull request #390:
URL: https://github.com/apache/opennlp/pull/390#discussion_r836508689



##
File path: 
opennlp-tools/src/main/resources/opennlp/tools/sentdetect/segment/rules.xml
##
@@ -0,0 +1,131 @@
+

Review comment:
   What is the origin of these rules?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] jzonthemtn commented on a change in pull request #390: OPENNLP-912: Rule based sentence detector

2022-03-28 Thread GitBox


jzonthemtn commented on a change in pull request #390:
URL: https://github.com/apache/opennlp/pull/390#discussion_r836500546



##
File path: 
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.io.IOException;
+import java.io.Reader;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import opennlp.tools.util.StringUtil;
+
+
+public class SentenceTokenizerME implements SentenceTokenizer {

Review comment:
   The `ME` name in the file names denotes "maximum entropy" as the method 
for the implementation. Since this implementation doesn't use a trained model, 
could it be named something like `RulesBasedSentenceDetector`? (Open to other 
names, too.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] jzonthemtn commented on a change in pull request #390: OPENNLP-912: Rule based sentence detector

2022-03-28 Thread GitBox


jzonthemtn commented on a change in pull request #390:
URL: https://github.com/apache/opennlp/pull/390#discussion_r836497822



##
File path: 
opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizer.java
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect.segment;
+
+import java.util.List;
+
+/**
+ * The interface for rule based sentence detector
+ */
+public interface SentenceTokenizer {

Review comment:
   There is an `opennlp.tools.sentdetect.SentenceDetector` interface that 
resembles this interface. Since the purpose of the two interfaces seem the same 
(to break text into sentences), is it possible to reuse the other interface?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] jzonthemtn commented on pull request #390: OPENNLP-912: Rule based sentence detector

2021-02-07 Thread GitBox


jzonthemtn commented on pull request #390:
URL: https://github.com/apache/opennlp/pull/390#issuecomment-774679660


   Thanks a lot for this contribution! This is something OpenNLP has needed. I 
will take a closer look.
   
   Built and tested successfully.
   
   ```
   Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
   Maven home: /opt/apache-maven
   Java version: 11.0.9.1, vendor: Ubuntu, runtime: 
/usr/lib/jvm/java-11-openjdk-amd64
   Default locale: en_US, platform encoding: UTF-8
   OS name: "linux", version: "5.8.0-40-generic", arch: "amd64", family: "unix"
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Re:Re: Rule based sentence detector

2021-02-03 Thread Alan Wang



Hi all, I have created a PR in github repo: 
https://github.com/apache/opennlp/pull/390, anyone has any comments please feel 
free.




Thanks
Alan








At 2021-01-22 02:59:40, "William Colen"  wrote:
>Hi Alan,
>
>Do you have a PR for the implementation?
>
>Thank you,
>William
>
>Em ter., 19 de jan. de 2021 às 23:52, Alan Wang  escreveu:
>
>> Hi all,
>>
>> I created a rule based sentence detector for OpenNLP
>> <https://issues.apache.org/jira/browse/OPENNLP-912>.
>> There are two kinds of rules:
>>
>> 1. break rules: specifying the sentence break
>> 2. no-break rules: disallowing the sentence break
>>
>> All rules have two parts:
>>
>> Before the break
>> After the break
>>
>> The algorithm idea:
>>
>> Retrieves the break rules.
>> If none of the no-break rules is matched at the break location, the text
>> is marked as split and a new segment is created
>>
>> Features:
>>
>> Text Cleanup and Preprocessing
>> Easy to extend other languages
>>
>> Reference:
>>
>> This library use "Golden Rule" test of pragmatic_segmenter
>> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules>
>>
>> Currently, the pass rate of test cases is 92.31%. The following test cases
>> fail: 39, 50, 53, 52
>> For details, see the attachment.
>>
>> --
>>
>>
>>
>>
>>
>>


[GitHub] [opennlp] Alanscut commented on pull request #390: OPENNLP-912: Rule based sentence detector

2021-02-03 Thread GitBox


Alanscut commented on pull request #390:
URL: https://github.com/apache/opennlp/pull/390#issuecomment-772945561


   Somehow the Travis CI always timed out here, but succeeded in my fork repo: 
https://travis-ci.org/github/Alanscut/opennlp/builds/757336242 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [opennlp] Alanscut opened a new pull request #390: OPENNLP-912: Rule based sentence detector

2021-01-28 Thread GitBox


Alanscut opened a new pull request #390:
URL: https://github.com/apache/opennlp/pull/390


   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [ ] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [ ] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Re: Rule based sentence detector

2021-01-21 Thread William Colen
Hi Alan,

Do you have a PR for the implementation?

Thank you,
William

Em ter., 19 de jan. de 2021 às 23:52, Alan Wang  escreveu:

> Hi all,
>
> I created a rule based sentence detector for OpenNLP
> <https://issues.apache.org/jira/browse/OPENNLP-912>.
> There are two kinds of rules:
>
> 1. break rules: specifying the sentence break
> 2. no-break rules: disallowing the sentence break
>
> All rules have two parts:
>
> Before the break
> After the break
>
> The algorithm idea:
>
> Retrieves the break rules.
> If none of the no-break rules is matched at the break location, the text
> is marked as split and a new segment is created
>
> Features:
>
> Text Cleanup and Preprocessing
> Easy to extend other languages
>
> Reference:
>
> This library use "Golden Rule" test of pragmatic_segmenter
> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules>
>
> Currently, the pass rate of test cases is 92.31%. The following test cases
> fail: 39, 50, 53, 52
> For details, see the attachment.
>
> --
>
>
>
>
>
>


Rule based sentence detector

2021-01-19 Thread Alan Wang
Hi all,


I created a rule based sentence detector for OpenNLP.
There are two kinds of rules:
1. break rules: specifying the sentence break
2. no-break rules: disallowing the sentence break
All rules have two parts:
Before the break
After the break
The algorithm idea:
Retrieves the break rules.
If none of the no-break rules is matched at the break location, the text is 
marked as split and a new segment is created
Features:
Text Cleanup and Preprocessing
Easy to extend other languages
Reference:
This library use "Golden Rule" test of pragmatic_segmenter
Currently, the pass rate of test cases is 92.31%. The following test cases 
fail: 39, 50, 53, 52
For details, see the attachment.





1.) Simple period to end sentence
Hello World. My name is Jonas.
=> ["Hello World.", "My name is Jonas."]

2.) Question mark to end sentence
What is your name? My name is Jonas.
=> ["What is your name?", "My name is Jonas."]

3.) Exclamation point to end sentence
There it is! I found it.
=> ["There it is!", "I found it."]

4.) One letter upper case abbreviations
My name is Jonas E. Smith.
=> ["My name is Jonas E. Smith."]

5.) One letter lower case abbreviations
Please turn to p. 55.
=> ["Please turn to p. 55."]

6.) Two letter lower case abbreviations in the middle of a sentence
Were Jane and co. at the party?
=> ["Were Jane and co. at the party?"]

7.) Two letter upper case abbreviations in the middle of a sentence
They closed the deal with Pitt, Briggs & Co. at noon.
=> ["They closed the deal with Pitt, Briggs & Co. at noon."]

8.) Two letter lower case abbreviations at the end of a sentence
Let's ask Jane and co. They should know.
=> ["Let's ask Jane and co.", "They should know."]

9.) Two letter upper case abbreviations at the end of a sentence
They closed the deal with Pitt, Briggs & Co. It closed yesterday.
=> ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]

10.) Two letter (prepositive) abbreviations
I can see Mt. Fuji from here.
=> ["I can see Mt. Fuji from here."]

11.) Two letter (prepositive & postpositive) abbreviations
St. Michael's Church is on 5th st. near the light.
=> ["St. Michael's Church is on 5th st. near the light."]

12.) Possesive two letter abbreviations
That is JFK Jr.'s book.
=> ["That is JFK Jr.'s book."]

13.) Multi-period abbreviations in the middle of a sentence
I visited the U.S.A. last year.
=> ["I visited the U.S.A. last year."]

14.) Multi-period abbreviations at the end of a sentence
I live in the E.U. How about you?
=> ["I live in the E.U.", "How about you?"]

15.) U.S. as sentence boundary
I live in the U.S. How about you?
=> ["I live in the U.S.", "How about you?"]

16.) U.S. as non sentence boundary with next word capitalized
I work for the U.S. Government in Virginia.
=> ["I work for the U.S. Government in Virginia."]

17.) U.S. as non sentence boundary
I have lived in the U.S. for 20 years.
=> ["I have lived in the U.S. for 20 years."]

18.) A.M. / P.M. as non sentence boundary and sentence boundary
At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then 
went to the store.
=> ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. 
Smith then went to the store."]

19.) Number as non sentence boundary
She has $100.00 in her bag.
=> ["She has $100.00 in her bag."]

20.) Number as sentence boundary
She has $100.00. It is in her bag.
=> ["She has $100.00.", "It is in her bag."]

21.) Parenthetical inside sentence
He teaches science (He previously worked for 5 years as an engineer.) at the 
local University.
=> ["He teaches science (He previously worked for 5 years as an engineer.) at 
the local University."]

22.) Email addresses
Her email is jane@example.com. I sent her an email.
=> ["Her email is jane@example.com.", "I sent her an email."]

23.) Web addresses
The site is: https://www.example.50.com/new-site/awesome_content.html. Please 
check it out.
=> ["The site is: https://www.example.50.com/new-site/awesome_content.html.;, 
"Please check it out."]

24.) Single quotations inside sentence
She turned to him, 'This is great.' she said.
=> ["She turned to him, 'This is great.' she said."]

25.) Double quotations inside sentence
She turned to him, "This is great." she said.
=> ["She turned to him, \"This is great.\" she said."]

26.) Double quotations at the end of a sentence
She turned to him, \"This is great.\" She held the book out to show him.
=> ["She turned to him, \"This is great.\"", &q

Re: OPENNLP-912 : Add a rule based sentence detector

2018-04-06 Thread Joern Kottmann
Hello,

could you elaborate a bit on the approach?

Jörn

On Tue, Apr 3, 2018 at 5:24 PM, Isuranga Perera
 wrote:
> Hi All,
>
> I would like to contribute $subject feature. Appreciate if anyone can guide
> me through the process.
>
> Best Regards
> Isuranga Perera


OPENNLP-912 : Add a rule based sentence detector

2018-04-03 Thread Isuranga Perera
Hi All,

I would like to contribute $subject feature. Appreciate if anyone can guide
me through the process.

Best Regards
Isuranga Perera