[jira] [Commented] (OPENNLP-1590) Clear open TODO in GenericFactoryTest

2024-07-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863516#comment-17863516
 ] 

ASF GitHub Bot commented on OPENNLP-1590:
-

rzo1 commented on code in PR #633:
URL: https://github.com/apache/opennlp/pull/633#discussion_r1667397156


##
opennlp-tools/src/main/java/opennlp/tools/util/featuregen/DictionaryFeatureGeneratorFactory.java:
##
@@ -31,30 +33,39 @@
 public class DictionaryFeatureGeneratorFactory
 extends GeneratorFactory.AbstractXmlFeatureGeneratorFactory {
 
+  private static final String DICT = "dict";
+
   public DictionaryFeatureGeneratorFactory() {
 super();
   }
 
   @Override
   public AdaptiveFeatureGenerator create() throws InvalidFormatException {
-// if resourceManager is null, we don't instantiate
-if (resourceManager == null) {
-  return null;
-}
-
-String dictResourceKey = getStr("dict");
-Object dictResource = resourceManager.getResource(dictResourceKey);
-if (!(dictResource instanceof Dictionary)) {
-  throw new InvalidFormatException("No dictionary resource for key: " + 
dictResourceKey);
+Dictionary dict;
+if (resourceManager == null) { // load the dictionary directly
+  String dictResourcePath = getStr(DICT);
+  ClassLoader cl = Thread.currentThread().getContextClassLoader();
+  try (InputStream is = cl.getResourceAsStream(dictResourcePath)) {

Review Comment:
   `is` can be NULL after this call (if `dictResourcePath` isn't found). Didn't 
Look into create(...) how it is handled.



##
opennlp-tools/src/main/java/opennlp/tools/util/featuregen/DictionaryFeatureGeneratorFactory.java:
##
@@ -31,30 +33,39 @@
 public class DictionaryFeatureGeneratorFactory
 extends GeneratorFactory.AbstractXmlFeatureGeneratorFactory {
 
+  private static final String DICT = "dict";
+
   public DictionaryFeatureGeneratorFactory() {
 super();
   }
 
   @Override
   public AdaptiveFeatureGenerator create() throws InvalidFormatException {
-// if resourceManager is null, we don't instantiate
-if (resourceManager == null) {
-  return null;
-}
-
-String dictResourceKey = getStr("dict");
-Object dictResource = resourceManager.getResource(dictResourceKey);
-if (!(dictResource instanceof Dictionary)) {
-  throw new InvalidFormatException("No dictionary resource for key: " + 
dictResourceKey);
+Dictionary dict;
+if (resourceManager == null) { // load the dictionary directly
+  String dictResourcePath = getStr(DICT);
+  ClassLoader cl = Thread.currentThread().getContextClassLoader();
+  try (InputStream is = cl.getResourceAsStream(dictResourcePath)) {
+dict = ((DictionarySerializer) 
getArtifactSerializerMapping().get(dictResourcePath)).create(is);
+  } catch (IOException e) {
+throw new InvalidFormatException("No dictionary resource at: " + 
dictResourcePath, e);
+  }
+} else { // get the dictionary via a resourceManager lookup
+  String dictResourceKey = getStr(DICT);
+  Object dictResource = resourceManager.getResource(dictResourceKey);
+  if (dictResource instanceof Dictionary) {

Review Comment:
   Since we are Java 17, we can omit the cast in the subsequent line by inline 
casting in the instanceof.





> Clear open TODO in GenericFactoryTest
> -
>
> Key: OPENNLP-1590
> URL: https://issues.apache.org/jira/browse/OPENNLP-1590
> Project: OpenNLP
>  Issue Type: Test
>Affects Versions: 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.4.0
>
>
> The test case _testDictionaryArtifactToSerializerMappingExtraction()_ in 
> _GeneratorFactoryTest_ has a TODO note that can be cleared by adjusting the 
> test case slightly, as well as the related inconsistent implementation 
> class(es).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1590) Clear open TODO in GenericFactoryTest

2024-07-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863515#comment-17863515
 ] 

ASF GitHub Bot commented on OPENNLP-1590:
-

mawiesne opened a new pull request, #633:
URL: https://github.com/apache/opennlp/pull/633

   Change
   -
   - adjusts `DictionaryFeatureGeneratorFactory` to handle situations without a 
ResourceManager instance at runtime
   - fixes a missing Exception type in catch block of 
`GeneratorFactory#buildGenerator `which wasn't handled correctly
   - clears TODO in GeneratorFactoryTest
   - adds another test case to demonstrate that a descriptively declared 
dictionary is loaded for the creation of a `DictionaryFeatureGeneratorFactory`
   - adds a related test resource
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Clear open TODO in GenericFactoryTest
> -
>
> Key: OPENNLP-1590
> URL: https://issues.apache.org/jira/browse/OPENNLP-1590
> Project: OpenNLP
>  Issue Type: Test
>Affects Versions: 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.4.0
>
>
> The test case _testDictionaryArtifactToSerializerMappingExtraction()_ in 
> _GeneratorFactoryTest_ has a TODO note that can be cleared by adjusting the 
> test case slightly, as well as the related inconsistent implementation 
> class(es).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1588) Clean out deprecated code marked for removal

2024-07-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863456#comment-17863456
 ] 

ASF GitHub Bot commented on OPENNLP-1588:
-

mawiesne merged PR #632:
URL: https://github.com/apache/opennlp/pull/632




> Clean out deprecated code marked for removal 
> -
>
> Key: OPENNLP-1588
> URL: https://issues.apache.org/jira/browse/OPENNLP-1588
> Project: OpenNLP
>  Issue Type: Task
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.4.0
>
>
> Some record classes (still) carry convenience getters around, which are 
> marked for removal for more than 1y now. Some other deprecated code fragments 
> can also be removed now, since those are deprecated much longer and marked 
> for removal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1588) Clean out deprecated code marked for removal

2024-07-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862884#comment-17862884
 ] 

ASF GitHub Bot commented on OPENNLP-1588:
-

mawiesne opened a new pull request, #632:
URL: https://github.com/apache/opennlp/pull/632

   Change
   -
   - removes deprecated getters in several records
   - removes a very long time deprecated constructor of TokenizerME
   - marks a constructor of ADNameSampleStream as "for removal" to indicate 
actual removal is more likely in upcoming releases
   - marks deprecated 'NameFinderEventStream#generateOutcomes(..)' as "for 
removal" to indicate actual removal is more likely in upcoming releases
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Clean out deprecated code marked for removal 
> -
>
> Key: OPENNLP-1588
> URL: https://issues.apache.org/jira/browse/OPENNLP-1588
> Project: OpenNLP
>  Issue Type: Task
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.4.0
>
>
> Some record classes (still) carry convenience getters around, which are 
> marked for removal for more than 1y now. Some other deprecated code fragments 
> can also be removed now, since those are deprecated much longer and marked 
> for removal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory

2024-07-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862524#comment-17862524
 ] 

ASF GitHub Bot commented on OPENNLP-1568:
-

rzo1 commented on code in PR #607:
URL: https://github.com/apache/opennlp/pull/607#discussion_r1661109712


##
opennlp-brat-annotator/src/main/bin/brat-annotation-service:
##
@@ -21,6 +21,28 @@
 #may be inadvertantly placed in any output files if
 #output redirection is used.
 
+# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+  ls=`ls -ld "$PRG"`
+  link=`expr "$ls" : '.*-> \(.*\)$'`
+  if expr "$link" : '/.*' > /dev/null; then
+PRG="$link"
+  else
+PRG="`dirname "$PRG"`/$link"
+  fi
+done

Review Comment:
   Yep, specific to Linux.





> opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
> ---
>
> Key: OPENNLP-1568
> URL: https://issues.apache.org/jira/browse/OPENNLP-1568
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.3.3
> Environment: Linux/Bash
>Reporter: Alexander Veit
>Priority: Major
> Fix For: 2.3.4
>
>
> Try to run the opennlp command from outside $OPENNLP_HOME/bin directory.
> It fails with an error message similar to
>  
> {noformat}
> 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed:
>  No configuration found for '4f2410ac' at 'null' in 'null'{noformat}
>  
> The error is caused by the relative path in
> {code:java}
> -Dlog4j.configurationFile=../conf/log4j2.xml {code}
> of the opennlp script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1583) Reduce compiler warnings in opennlp-tools

2024-07-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862518#comment-17862518
 ] 

ASF GitHub Bot commented on OPENNLP-1583:
-

rzo1 merged PR #626:
URL: https://github.com/apache/opennlp/pull/626




> Reduce compiler warnings in opennlp-tools
> -
>
> Key: OPENNLP-1583
> URL: https://issues.apache.org/jira/browse/OPENNLP-1583
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.3.4
>
>
> We still have a bunch of compiler warnings. Those should be reduced to and 
> kept at a minimum. 
> Aims: 
> - Get rid of most of them, as best as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1584) FeatureGeneratorUtil shall detect German umlauts with dot as 'cp'

2024-07-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862521#comment-17862521
 ] 

ASF GitHub Bot commented on OPENNLP-1584:
-

rzo1 merged PR #628:
URL: https://github.com/apache/opennlp/pull/628




> FeatureGeneratorUtil shall detect German umlauts with dot as 'cp'
> -
>
> Key: OPENNLP-1584
> URL: https://issues.apache.org/jira/browse/OPENNLP-1584
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Name Finder
>Affects Versions: 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.3.4
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> German names, such as Änne, Özlem, or Ümit, should be recognized in their 
> abbreviated short form (Ä., Ü., Ö.) by the FeatureGeneratorUtil class. 
> Atm, recognition fails, as the Pattern "capPeriod" only takes regular, 
> capitalized letters into account. This can be fixed easily.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory

2024-07-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862520#comment-17862520
 ] 

ASF GitHub Bot commented on OPENNLP-1568:
-

rzo1 merged PR #607:
URL: https://github.com/apache/opennlp/pull/607




> opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
> ---
>
> Key: OPENNLP-1568
> URL: https://issues.apache.org/jira/browse/OPENNLP-1568
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.3.3
> Environment: Linux/Bash
>Reporter: Alexander Veit
>Priority: Major
> Fix For: 2.3.4
>
>
> Try to run the opennlp command from outside $OPENNLP_HOME/bin directory.
> It fails with an error message similar to
>  
> {noformat}
> 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed:
>  No configuration found for '4f2410ac' at 'null' in 'null'{noformat}
>  
> The error is caused by the relative path in
> {code:java}
> -Dlog4j.configurationFile=../conf/log4j2.xml {code}
> of the opennlp script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory

2024-07-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862519#comment-17862519
 ] 

ASF GitHub Bot commented on OPENNLP-1568:
-

rzo1 commented on PR #607:
URL: https://github.com/apache/opennlp/pull/607#issuecomment-2200245765

   Thanks @veita 
   
   @kinow I merged it know because it is a valid improvement. We can discuss / 
adjust edge cases with realpath etc, if needed in another iteration.




> opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
> ---
>
> Key: OPENNLP-1568
> URL: https://issues.apache.org/jira/browse/OPENNLP-1568
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.3.3
> Environment: Linux/Bash
>Reporter: Alexander Veit
>Priority: Major
> Fix For: 2.3.4
>
>
> Try to run the opennlp command from outside $OPENNLP_HOME/bin directory.
> It fails with an error message similar to
>  
> {noformat}
> 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed:
>  No configuration found for '4f2410ac' at 'null' in 'null'{noformat}
>  
> The error is caused by the relative path in
> {code:java}
> -Dlog4j.configurationFile=../conf/log4j2.xml {code}
> of the opennlp script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1577) SLF4J 2.0.13

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855902#comment-17855902
 ] 

ASF GitHub Bot commented on OPENNLP-1577:
-

rzo1 merged PR #616:
URL: https://github.com/apache/opennlp/pull/616




> SLF4J 2.0.13
> 
>
> Key: OPENNLP-1577
> URL: https://issues.apache.org/jira/browse/OPENNLP-1577
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1578) Jersey 3.1.7 in Brat-Annotator

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855903#comment-17855903
 ] 

ASF GitHub Bot commented on OPENNLP-1578:
-

rzo1 merged PR #621:
URL: https://github.com/apache/opennlp/pull/621




> Jersey 3.1.7 in Brat-Annotator
> --
>
> Key: OPENNLP-1578
> URL: https://issues.apache.org/jira/browse/OPENNLP-1578
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1578) Jersey 3.1.7 in Brat-Annotator

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855899#comment-17855899
 ] 

ASF GitHub Bot commented on OPENNLP-1578:
-

dependabot[bot] commented on PR #621:
URL: https://github.com/apache/opennlp/pull/621#issuecomment-2175893444

   Looks like this PR has been edited by someone other than Dependabot. That 
means Dependabot can't rebase it - sorry!
   
   If you're happy for Dependabot to recreate it from scratch, overwriting any 
edits, you can request `@dependabot recreate`.
   




> Jersey 3.1.7 in Brat-Annotator
> --
>
> Key: OPENNLP-1578
> URL: https://issues.apache.org/jira/browse/OPENNLP-1578
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1578) Jersey 3.1.7 in Brat-Annotator

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855898#comment-17855898
 ] 

ASF GitHub Bot commented on OPENNLP-1578:
-

rzo1 commented on PR #621:
URL: https://github.com/apache/opennlp/pull/621#issuecomment-2175893335

   @dependabot rebase




> Jersey 3.1.7 in Brat-Annotator
> --
>
> Key: OPENNLP-1578
> URL: https://issues.apache.org/jira/browse/OPENNLP-1578
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1578) Jersey 3.1.7 in Brat-Annotator

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855897#comment-17855897
 ] 

ASF GitHub Bot commented on OPENNLP-1578:
-

rzo1 commented on PR #621:
URL: https://github.com/apache/opennlp/pull/621#issuecomment-2175893253

   The upgrade to Jakarta namespace should be safe here since it is only used 
for providing restful service within this tool.




> Jersey 3.1.7 in Brat-Annotator
> --
>
> Key: OPENNLP-1578
> URL: https://issues.apache.org/jira/browse/OPENNLP-1578
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1577) SLF4J 2.0.13

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855896#comment-17855896
 ] 

ASF GitHub Bot commented on OPENNLP-1577:
-

rzo1 commented on PR #616:
URL: https://github.com/apache/opennlp/pull/616#issuecomment-2175891974

   Note: SLF4J2 is now way more common as it was a few months ago. It should be 
safe to upgrade here.




> SLF4J 2.0.13
> 
>
> Key: OPENNLP-1577
> URL: https://issues.apache.org/jira/browse/OPENNLP-1577
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1577) SLF4J 2.0.13

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855894#comment-17855894
 ] 

ASF GitHub Bot commented on OPENNLP-1577:
-

rzo1 commented on PR #616:
URL: https://github.com/apache/opennlp/pull/616#issuecomment-2175891354

   @dependabot rebase




> SLF4J 2.0.13
> 
>
> Key: OPENNLP-1577
> URL: https://issues.apache.org/jira/browse/OPENNLP-1577
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1577) SLF4J 2.0.13

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855895#comment-17855895
 ] 

ASF GitHub Bot commented on OPENNLP-1577:
-

dependabot[bot] commented on PR #616:
URL: https://github.com/apache/opennlp/pull/616#issuecomment-2175891449

   Looks like this PR has been edited by someone other than Dependabot. That 
means Dependabot can't rebase it - sorry!
   
   If you're happy for Dependabot to recreate it from scratch, overwriting any 
edits, you can request `@dependabot recreate`.
   




> SLF4J 2.0.13
> 
>
> Key: OPENNLP-1577
> URL: https://issues.apache.org/jira/browse/OPENNLP-1577
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1573) Forbidden APIS 3.7

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855892#comment-17855892
 ] 

ASF GitHub Bot commented on OPENNLP-1573:
-

rzo1 merged PR #615:
URL: https://github.com/apache/opennlp/pull/615




> Forbidden APIS 3.7
> --
>
> Key: OPENNLP-1573
> URL: https://issues.apache.org/jira/browse/OPENNLP-1573
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1570) ONNX Runtime 1.18.0

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855893#comment-17855893
 ] 

ASF GitHub Bot commented on OPENNLP-1570:
-

rzo1 merged PR #614:
URL: https://github.com/apache/opennlp/pull/614




> ONNX Runtime 1.18.0
> ---
>
> Key: OPENNLP-1570
> URL: https://issues.apache.org/jira/browse/OPENNLP-1570
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1576) Update build related dependencies (Checkstyle, Docbkx, Build Helper)

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855890#comment-17855890
 ] 

ASF GitHub Bot commented on OPENNLP-1576:
-

rzo1 merged PR #620:
URL: https://github.com/apache/opennlp/pull/620




> Update build related dependencies (Checkstyle, Docbkx, Build Helper)
> 
>
> Key: OPENNLP-1576
> URL: https://issues.apache.org/jira/browse/OPENNLP-1576
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1576) Update build related dependencies (Checkstyle, Docbkx, Build Helper)

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855889#comment-17855889
 ] 

ASF GitHub Bot commented on OPENNLP-1576:
-

rzo1 merged PR #619:
URL: https://github.com/apache/opennlp/pull/619




> Update build related dependencies (Checkstyle, Docbkx, Build Helper)
> 
>
> Key: OPENNLP-1576
> URL: https://issues.apache.org/jira/browse/OPENNLP-1576
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1574) Jacoco 0.8.12

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855891#comment-17855891
 ] 

ASF GitHub Bot commented on OPENNLP-1574:
-

rzo1 merged PR #618:
URL: https://github.com/apache/opennlp/pull/618




> Jacoco 0.8.12
> -
>
> Key: OPENNLP-1574
> URL: https://issues.apache.org/jira/browse/OPENNLP-1574
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1572) JUnit 5.10.2

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855886#comment-17855886
 ] 

ASF GitHub Bot commented on OPENNLP-1572:
-

rzo1 merged PR #612:
URL: https://github.com/apache/opennlp/pull/612




> JUnit 5.10.2
> 
>
> Key: OPENNLP-1572
> URL: https://issues.apache.org/jira/browse/OPENNLP-1572
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1576) Update build related dependencies (Checkstyle, Docbkx, Build Helper)

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855888#comment-17855888
 ] 

ASF GitHub Bot commented on OPENNLP-1576:
-

rzo1 merged PR #617:
URL: https://github.com/apache/opennlp/pull/617




> Update build related dependencies (Checkstyle, Docbkx, Build Helper)
> 
>
> Key: OPENNLP-1576
> URL: https://issues.apache.org/jira/browse/OPENNLP-1576
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1571) Jackson 2.17.1

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855887#comment-17855887
 ] 

ASF GitHub Bot commented on OPENNLP-1571:
-

rzo1 merged PR #613:
URL: https://github.com/apache/opennlp/pull/613




> Jackson 2.17.1
> --
>
> Key: OPENNLP-1571
> URL: https://issues.apache.org/jira/browse/OPENNLP-1571
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1575) Update GH action versions

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855885#comment-17855885
 ] 

ASF GitHub Bot commented on OPENNLP-1575:
-

rzo1 merged PR #610:
URL: https://github.com/apache/opennlp/pull/610




> Update GH action versions
> -
>
> Key: OPENNLP-1575
> URL: https://issues.apache.org/jira/browse/OPENNLP-1575
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> [https://github.com/apache/opennlp/pull/609]
> [https://github.com/apache/opennlp/pull/610]
> https://github.com/apache/opennlp/pull/611
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1575) Update GH action versions

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855884#comment-17855884
 ] 

ASF GitHub Bot commented on OPENNLP-1575:
-

rzo1 merged PR #611:
URL: https://github.com/apache/opennlp/pull/611




> Update GH action versions
> -
>
> Key: OPENNLP-1575
> URL: https://issues.apache.org/jira/browse/OPENNLP-1575
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> [https://github.com/apache/opennlp/pull/609]
> [https://github.com/apache/opennlp/pull/610]
> https://github.com/apache/opennlp/pull/611
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1575) Update GH action versions

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855881#comment-17855881
 ] 

ASF GitHub Bot commented on OPENNLP-1575:
-

rzo1 merged PR #609:
URL: https://github.com/apache/opennlp/pull/609




> Update GH action versions
> -
>
> Key: OPENNLP-1575
> URL: https://issues.apache.org/jira/browse/OPENNLP-1575
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> [https://github.com/apache/opennlp/pull/609]
> [https://github.com/apache/opennlp/pull/610]
> https://github.com/apache/opennlp/pull/611
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1575) Update GH action versions

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855882#comment-17855882
 ] 

ASF GitHub Bot commented on OPENNLP-1575:
-

rzo1 commented on PR #610:
URL: https://github.com/apache/opennlp/pull/610#issuecomment-2175876905

   @dependabot rebase




> Update GH action versions
> -
>
> Key: OPENNLP-1575
> URL: https://issues.apache.org/jira/browse/OPENNLP-1575
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> [https://github.com/apache/opennlp/pull/609]
> [https://github.com/apache/opennlp/pull/610]
> https://github.com/apache/opennlp/pull/611
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1575) Update GH action versions

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855883#comment-17855883
 ] 

ASF GitHub Bot commented on OPENNLP-1575:
-

rzo1 commented on PR #611:
URL: https://github.com/apache/opennlp/pull/611#issuecomment-2175876989

   @dependabot rebase




> Update GH action versions
> -
>
> Key: OPENNLP-1575
> URL: https://issues.apache.org/jira/browse/OPENNLP-1575
> Project: OpenNLP
>  Issue Type: Dependency upgrade
>Reporter: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> [https://github.com/apache/opennlp/pull/609]
> [https://github.com/apache/opennlp/pull/610]
> https://github.com/apache/opennlp/pull/611
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1569) Enable Dependabot for OpenNLP

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855838#comment-17855838
 ] 

ASF GitHub Bot commented on OPENNLP-1569:
-

rzo1 merged PR #608:
URL: https://github.com/apache/opennlp/pull/608




> Enable Dependabot for OpenNLP
> -
>
> Key: OPENNLP-1569
> URL: https://issues.apache.org/jira/browse/OPENNLP-1569
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> We still need to create JIRAs for update but it will help us to avoid missing 
> something important



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1566) Array writing error in code example

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855811#comment-17855811
 ] 

ASF GitHub Bot commented on OPENNLP-1566:
-

mawiesne merged PR #605:
URL: https://github.com/apache/opennlp/pull/605




> Array writing error in code example
> ---
>
> Key: OPENNLP-1566
> URL: https://issues.apache.org/jira/browse/OPENNLP-1566
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.3
>Reporter: shellrean
>Priority: Minor
> Fix For: 2.3.4
>
> Attachments: Screenshot 2024-06-10 at 23.31.20.png
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> There was an error in writing the array symbol in the documentation, it 
> should be String[] but what is displayed here is String variable[].
> !Screenshot 2024-06-10 at 23.31.20.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1569) Enable Dependabot for OpenNLP

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855807#comment-17855807
 ] 

ASF GitHub Bot commented on OPENNLP-1569:
-

rzo1 opened a new pull request, #608:
URL: https://github.com/apache/opennlp/pull/608

   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [ ] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   
   We still need to create JIRA tickets for updates but it will help us to 
avoid miss something important.




> Enable Dependabot for OpenNLP
> -
>
> Key: OPENNLP-1569
> URL: https://issues.apache.org/jira/browse/OPENNLP-1569
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> We still need to create JIRAs for update but it will help us to avoid missing 
> something important



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855805#comment-17855805
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 merged PR #606:
URL: https://github.com/apache/opennlp/pull/606




> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855597#comment-17855597
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

jzonthemtn commented on PR #606:
URL: https://github.com/apache/opennlp/pull/606#issuecomment-2173208163

   Built and tested fine on Windows 11.
   
   ```
   Apache Maven 3.9.8 (36645f6c9b5079805ea5009217e36f2cffd34256)
   Maven home: C:\Program Files\apache-maven-3.9.8
   Java version: 21.0.1, vendor: Amazon.com Inc., runtime: C:\Program 
Files\Amazon Corretto\jdk21.0.1_12
   Default locale: en_US, platform encoding: UTF-8
   OS name: "windows 11", version: "10.0", arch: "amd64", family: "windows"
   ```




> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855515#comment-17855515
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on PR #606:
URL: https://github.com/apache/opennlp/pull/606#issuecomment-2172501124

   > Tests on a local Windows 10 are ok now (via IDE + Maven). CI is still 
complaining, so requires some more digging in that area ;-)
   
   This is fixed now. Removing DRAFT state.




> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855502#comment-17855502
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on PR #606:
URL: https://github.com/apache/opennlp/pull/606#issuecomment-2172443863

   Tests on a local Windows 10 are ok now (via IDE + Maven). CI is still 
complaining, so requires some more digging in that area ;-)




> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory

2024-06-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855447#comment-17855447
 ] 

ASF GitHub Bot commented on OPENNLP-1568:
-

kinow commented on code in PR #607:
URL: https://github.com/apache/opennlp/pull/607#discussion_r1641987177


##
opennlp-brat-annotator/src/main/bin/brat-annotation-service:
##
@@ -21,6 +21,28 @@
 #may be inadvertantly placed in any output files if
 #output redirection is used.
 
+# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+  ls=`ls -ld "$PRG"`
+  link=`expr "$ls" : '.*-> \(.*\)$'`
+  if expr "$link" : '/.*' > /dev/null; then
+PRG="$link"
+  else
+PRG="`dirname "$PRG"`/$link"
+  fi
+done

Review Comment:
   Thank you @veita 
   
   I asked as I read the code and from a quick read it seemed to have a regex 
that could lead to error in a certain scenario. You mentioned it's well-proven, 
but do you know if it was tested (e.g. with bats, or another unit test, 
covering several cases), or has it just been around for a long time and never 
raised any issues?
   
   It's late here, so my logic might be wrong, sorry. Here's what I tried.
   
   ```bash
   kinow@ranma:/tmp$ mkdir /tmp/opennlp-1568
   kinow@ranma:/tmp$ cd /tmp/opennlp-1568
   kinow@ranma:/tmp/opennlp-1568$ cat test.sh
   #!/bin/bash
   
   PRG="$0"
   
   while [ -h "$PRG" ] ; do
 ls=`ls -ld "$PRG"`
 link=`expr "$ls" : '.*-> \(.*\)$'`
 if expr "$link" : '/.*' > /dev/null; then
   PRG="$link"
 else
   PRG="`dirname "$PRG"`/$link"
 fi
   done
   
   echo "${PRG}"
   kinow@ranma:/tmp/opennlp-1568$ # not a symlink, all good...
   kinow@ranma:/tmp/opennlp-1568$ bash test.sh
   test.sh
   ```
   
   Then if I create a symlink `/tmp/test/opennlp.sh`, it works, and the correct 
path is resolved, same as `realpath`.
   
   ```bash
   kinow@ranma:/tmp/opennlp-1568$ cd ../
   kinow@ranma:/tmp$ mkdir test
   kinow@ranma:/tmp$ cd test
   kinow@ranma:/tmp/test$ ln -s /tmp/opennlp-1568/test.sh opennlp.sh
   kinow@ranma:/tmp/test$ bash opennlp.sh 
   /tmp/opennlp-1568/test.sh
   ```
   
   But if the file is not in `/tmp/opennlp-1568`, but rather in some folder 
like `/tmp/opennlp -> tests/`:
   
   ```bash
   kinow@ranma:/tmp$ mkdir '/tmp/opennlp -> tests/'
   kinow@ranma:/tmp$ cd '/tmp/opennlp -> tests/'
   kinow@ranma:/tmp/opennlp -> tests$ cp ../opennlp-1568/test.sh .
   kinow@ranma:/tmp/opennlp -> tests$ cd /tmp/test/
   kinow@ranma:/tmp/test$ ln -s '/tmp/opennlp -> tests/test.sh' opennlp2.sh
   kinow@ranma:/tmp/test$ bash opennlp2.sh 
   ./tests/test.sh
   ```
   
   I believe it's due to the regex in the `expr` (but again, late here, came 
just to check the score of a soccer :soccer:  match and saw the GH 
notification). Here's the `-eux` output.
   
   ```bash
   kinow@ranma:/tmp/test$ bash -eux opennlp2.sh 
   + PRG=opennlp2.sh
   + '[' -h opennlp2.sh ']'
   ++ ls -ld opennlp2.sh
   + ls='lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> /tmp/opennlp 
-> tests/test.sh'
   ++ expr 'lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> 
/tmp/opennlp -> tests/test.sh' : '.*-> \(.*\)$'
   + link=tests/test.sh
   + expr tests/test.sh : '/.*'
   ++ dirname opennlp2.sh
   + PRG=./tests/test.sh
   + '[' -h ./tests/test.sh ']'
   + echo ./tests/test.sh
   ./tests/test.sh
   ```
   
   And `realpath`:
   
   ```bash
   kinow@ranma:/tmp/test$ realpath opennlp.sh
   /tmp/opennlp-1568/test.sh
   kinow@ranma:/tmp/test$ realpath opennlp2.sh
   /tmp/opennlp -> tests/test.sh
   ```
   
   We do not need to use `realpath`. I was honestly just curious about this. We 
don't even need to fix the regex if you/others agree users are not likely to 
create paths with ` -> ` :+1: Other than that, changes looks OK to me (:soccer: 
:wave: )





> opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
> ---
>
> Key: OPENNLP-1568
> URL: https://issues.apache.org/jira/browse/OPENNLP-1568
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.3.3
> Environment: Linux/Bash
>Reporter: Alexander Veit
>Priority: Major
>
> Try to run the opennlp command from outside $OPENNLP_HOME/bin directory.
> It fails with an error message similar to
>  
> {noformat}
> 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed:
>  No configuration found for '4f2410ac' at 'null' in 'null'{noformat}
>  
> The error is caused by the relative path in
> {code:java}
> -Dlog4j.configurationFile=../conf/log4j2.xml {code}
> of the opennlp script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory

2024-06-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855446#comment-17855446
 ] 

ASF GitHub Bot commented on OPENNLP-1568:
-

kinow commented on code in PR #607:
URL: https://github.com/apache/opennlp/pull/607#discussion_r1641987177


##
opennlp-brat-annotator/src/main/bin/brat-annotation-service:
##
@@ -21,6 +21,28 @@
 #may be inadvertantly placed in any output files if
 #output redirection is used.
 
+# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+  ls=`ls -ld "$PRG"`
+  link=`expr "$ls" : '.*-> \(.*\)$'`
+  if expr "$link" : '/.*' > /dev/null; then
+PRG="$link"
+  else
+PRG="`dirname "$PRG"`/$link"
+  fi
+done

Review Comment:
   Thank you @veita 
   
   I asked as I read the code and from a quick read it seemed to have a regex 
that could lead to error in a certain scenario. You mentioned it's well-proven, 
but do you know if it was tested (e.g. with bats, or another unit test, 
covering several cases), or has it just been around for a long time and never 
raised any issues?
   
   It's late here, so my logic might be wrong, sorry. Here's what I tried.
   
   ```bash
   kinow@ranma:/tmp$ mkdir /tmp/opennlp-1568
   kinow@ranma:/tmp$ cd /tmp/opennlp-1568
   kinow@ranma:/tmp/opennlp-1568$ cat test.sh
   #!/bin/bash
   
   PRG="$0"
   
   while [ -h "$PRG" ] ; do
 ls=`ls -ld "$PRG"`
 link=`expr "$ls" : '.*-> \(.*\)$'`
 if expr "$link" : '/.*' > /dev/null; then
   PRG="$link"
 else
   PRG="`dirname "$PRG"`/$link"
 fi
   done
   
   echo "${PRG}"
   kinow@ranma:/tmp/opennlp-1568$ # not a symlink, all good...
   kinow@ranma:/tmp/opennlp-1568$ bash test.sh
   test.sh
   ```
   
   Then if I create a symlink `/tmp/test/opennlp.sh`, it works, and the correct 
path is resolved, same as `realpath`.
   
   ```bash
   kinow@ranma:/tmp/opennlp-1568$ cd ../
   kinow@ranma:/tmp$ mkdir test
   kinow@ranma:/tmp$ cd test
   kinow@ranma:/tmp/test$ ln -s /tmp/opennlp-1568/test.sh opennlp.sh
   kinow@ranma:/tmp/test$ bash opennlp.sh 
   /tmp/opennlp-1568/test.sh
   ```
   
   But if the file is not in `/tmp/opennlp-1568`, but rather in some folder 
like `/tmp/opennlp -> tests/`:
   
   ```bash
   kinow@ranma:/tmp$ mkdir '/tmp/opennlp -> tests/'
   kinow@ranma:/tmp$ cd '/tmp/opennlp -> tests/'
   kinow@ranma:/tmp/opennlp -> tests$ cp ../opennlp-1568/test.sh .
   kinow@ranma:/tmp/opennlp -> tests$ cd /tmp/test/
   kinow@ranma:/tmp/test$ ln -s '/tmp/opennlp -> tests/test.sh' opennlp2.sh
   kinow@ranma:/tmp/test$ bash opennlp2.sh 
   ./tests/test.sh
   ```
   
   I believe it's due to the regex in the `expr` (but again, late here, came 
just to check the score of a soccer match and saw the GH notification. Here's 
the `-eux` below.
   
   ```bash
   kinow@ranma:/tmp/test$ bash -eux opennlp2.sh 
   + PRG=opennlp2.sh
   + '[' -h opennlp2.sh ']'
   ++ ls -ld opennlp2.sh
   + ls='lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> /tmp/opennlp 
-> tests/test.sh'
   ++ expr 'lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> 
/tmp/opennlp -> tests/test.sh' : '.*-> \(.*\)$'
   + link=tests/test.sh
   + expr tests/test.sh : '/.*'
   ++ dirname opennlp2.sh
   + PRG=./tests/test.sh
   + '[' -h ./tests/test.sh ']'
   + echo ./tests/test.sh
   ./tests/test.sh
   ```
   
   And `realpath`:
   
   ```bash
   kinow@ranma:/tmp/test$ realpath opennlp.sh
   /tmp/opennlp-1568/test.sh
   kinow@ranma:/tmp/test$ realpath opennlp2.sh
   /tmp/opennlp -> tests/test.sh
   ```
   
   We do not need to use `realpath`. I was honestly just curious about this. We 
don't even need to fix the regex if you/others agree users are not likely to 
create paths with ` -> ` :+1: Other than that, changes looks OK to me (:soccer: 
:wave: )





> opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
> ---
>
> Key: OPENNLP-1568
> URL: https://issues.apache.org/jira/browse/OPENNLP-1568
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.3.3
> Environment: Linux/Bash
>Reporter: Alexander Veit
>Priority: Major
>
> Try to run the opennlp command from outside $OPENNLP_HOME/bin directory.
> It fails with an error message similar to
>  
> {noformat}
> 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed:
>  No configuration found for '4f2410ac' at 'null' in 'null'{noformat}
>  
> The error is caused by the relative path in
> {code:java}
> -Dlog4j.configurationFile=../conf/log4j2.xml {code}
> of the opennlp script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory

2024-06-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855445#comment-17855445
 ] 

ASF GitHub Bot commented on OPENNLP-1568:
-

kinow commented on code in PR #607:
URL: https://github.com/apache/opennlp/pull/607#discussion_r1641987177


##
opennlp-brat-annotator/src/main/bin/brat-annotation-service:
##
@@ -21,6 +21,28 @@
 #may be inadvertantly placed in any output files if
 #output redirection is used.
 
+# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+  ls=`ls -ld "$PRG"`
+  link=`expr "$ls" : '.*-> \(.*\)$'`
+  if expr "$link" : '/.*' > /dev/null; then
+PRG="$link"
+  else
+PRG="`dirname "$PRG"`/$link"
+  fi
+done

Review Comment:
   Thank you @veita 
   
   I asked as I read the code and from a quick read it seemed to have a greedy 
regex that could lead to error in a certain scenario. You mentioned it's 
well-proven, but do you know if it was tested (e.g. with bats, or another unit 
test, covering several cases), or has it just been around for a long time and 
never raised any issues?
   
   It's late here, so my logic might be wrong, sorry. Here's what I tried.
   
   ```bash
   kinow@ranma:/tmp$ mkdir /tmp/opennlp-1568
   kinow@ranma:/tmp$ cd /tmp/opennlp-1568
   kinow@ranma:/tmp/opennlp-1568$ cat test.sh
   #!/bin/bash
   
   PRG="$0"
   
   while [ -h "$PRG" ] ; do
 ls=`ls -ld "$PRG"`
 link=`expr "$ls" : '.*-> \(.*\)$'`
 if expr "$link" : '/.*' > /dev/null; then
   PRG="$link"
 else
   PRG="`dirname "$PRG"`/$link"
 fi
   done
   
   echo "${PRG}"
   kinow@ranma:/tmp/opennlp-1568$ # not a symlink, all good...
   kinow@ranma:/tmp/opennlp-1568$ bash test.sh
   test.sh
   ```
   
   Then if I create a symlink `/tmp/test/opennlp.sh`, it works, and the correct 
path is resolved, same as `realpath`.
   
   ```bash
   kinow@ranma:/tmp/opennlp-1568$ cd ../
   kinow@ranma:/tmp$ mkdir test
   kinow@ranma:/tmp$ cd test
   kinow@ranma:/tmp/test$ ln -s /tmp/opennlp-1568/test.sh opennlp.sh
   kinow@ranma:/tmp/test$ bash opennlp.sh 
   /tmp/opennlp-1568/test.sh
   ```
   
   But if the file is not in `/tmp/opennlp-1568`, but rather in some folder 
like `/tmp/opennlp -> tests/`:
   
   ```bash
   kinow@ranma:/tmp$ mkdir '/tmp/opennlp -> tests/'
   kinow@ranma:/tmp$ cd '/tmp/opennlp -> tests/'
   kinow@ranma:/tmp/opennlp -> tests$ cp ../opennlp-1568/test.sh .
   kinow@ranma:/tmp/opennlp -> tests$ cd /tmp/test/
   kinow@ranma:/tmp/test$ ln -s '/tmp/opennlp -> tests/test.sh' opennlp2.sh
   kinow@ranma:/tmp/test$ bash opennlp2.sh 
   ./tests/test.sh
   ```
   
   I believe it's due to the regex in the `expr` (but again, late here, came 
just to check the score of a soccer match and saw the GH notification. Here's 
the `-eux` below.
   
   ```bash
   kinow@ranma:/tmp/test$ bash -eux opennlp2.sh 
   + PRG=opennlp2.sh
   + '[' -h opennlp2.sh ']'
   ++ ls -ld opennlp2.sh
   + ls='lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> /tmp/opennlp 
-> tests/test.sh'
   ++ expr 'lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> 
/tmp/opennlp -> tests/test.sh' : '.*-> \(.*\)$'
   + link=tests/test.sh
   + expr tests/test.sh : '/.*'
   ++ dirname opennlp2.sh
   + PRG=./tests/test.sh
   + '[' -h ./tests/test.sh ']'
   + echo ./tests/test.sh
   ./tests/test.sh
   ```
   
   And `realpath`:
   
   ```bash
   kinow@ranma:/tmp/test$ realpath opennlp.sh
   /tmp/opennlp-1568/test.sh
   kinow@ranma:/tmp/test$ realpath opennlp2.sh
   /tmp/opennlp -> tests/test.sh
   ```
   
   We do not need to use `realpath`. I was honestly just curious about this. We 
don't even need to fix the regex if you/others agree users are not likely to 
create paths with ` -> ` :+1: Other than that, changes looks OK to me (:soccer: 
:wave: )





> opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
> ---
>
> Key: OPENNLP-1568
> URL: https://issues.apache.org/jira/browse/OPENNLP-1568
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.3.3
> Environment: Linux/Bash
>Reporter: Alexander Veit
>Priority: Major
>
> Try to run the opennlp command from outside $OPENNLP_HOME/bin directory.
> It fails with an error message similar to
>  
> {noformat}
> 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed:
>  No configuration found for '4f2410ac' at 'null' in 'null'{noformat}
>  
> The error is caused by the relative path in
> {code:java}
> -Dlog4j.configurationFile=../conf/log4j2.xml {code}
> of the opennlp script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory

2024-06-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855432#comment-17855432
 ] 

ASF GitHub Bot commented on OPENNLP-1568:
-

veita commented on code in PR #607:
URL: https://github.com/apache/opennlp/pull/607#discussion_r1641947136


##
opennlp-brat-annotator/src/main/bin/brat-annotation-service:
##
@@ -21,6 +21,28 @@
 #may be inadvertantly placed in any output files if
 #output redirection is used.
 
+# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+  ls=`ls -ld "$PRG"`
+  link=`expr "$ls" : '.*-> \(.*\)$'`
+  if expr "$link" : '/.*' > /dev/null; then
+PRG="$link"
+  else
+PRG="`dirname "$PRG"`/$link"
+  fi
+done

Review Comment:
   This is portable and well-proven code also used by other Apache projects 
like Ant, Maven or Groovy. `realpath` and `pushd` are not portable.





> opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
> ---
>
> Key: OPENNLP-1568
> URL: https://issues.apache.org/jira/browse/OPENNLP-1568
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.3.3
> Environment: Linux/Bash
>Reporter: Alexander Veit
>Priority: Major
>
> Try to run the opennlp command from outside $OPENNLP_HOME/bin directory.
> It fails with an error message similar to
>  
> {noformat}
> 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed:
>  No configuration found for '4f2410ac' at 'null' in 'null'{noformat}
>  
> The error is caused by the relative path in
> {code:java}
> -Dlog4j.configurationFile=../conf/log4j2.xml {code}
> of the opennlp script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory

2024-06-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855429#comment-17855429
 ] 

ASF GitHub Bot commented on OPENNLP-1568:
-

kinow commented on code in PR #607:
URL: https://github.com/apache/opennlp/pull/607#discussion_r1641918499


##
opennlp-brat-annotator/src/main/bin/brat-annotation-service:
##
@@ -21,6 +21,28 @@
 #may be inadvertantly placed in any output files if
 #output redirection is used.
 
+# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+  ls=`ls -ld "$PRG"`
+  link=`expr "$ls" : '.*-> \(.*\)$'`
+  if expr "$link" : '/.*' > /dev/null; then
+PRG="$link"
+  else
+PRG="`dirname "$PRG"`/$link"
+  fi
+done

Review Comment:
   I think `realpath` is available in Debian/Ubuntu for a while (from gnu 
coreutils? I think). And it should be available on macos too (at least I 
remember using it on an old Intel mac).
   
   I think the code above is doing something similar to `realpath`, but using 
`expr` to parse the output of the commands with regex, or use `dirname` 
(although I am not sure if that works if the regex failed but the file is still 
a symlink?).
   
   Is there a reason for not using `realpath` here? Or something else like 
`pushd $PRG ; PRG=pwd -P; popd`  (or in a subshell, etc.)?





> opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
> ---
>
> Key: OPENNLP-1568
> URL: https://issues.apache.org/jira/browse/OPENNLP-1568
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.3.3
> Environment: Linux/Bash
>Reporter: Alexander Veit
>Priority: Major
>
> Try to run the opennlp command from outside $OPENNLP_HOME/bin directory.
> It fails with an error message similar to
>  
> {noformat}
> 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed:
>  No configuration found for '4f2410ac' at 'null' in 'null'{noformat}
>  
> The error is caused by the relative path in
> {code:java}
> -Dlog4j.configurationFile=../conf/log4j2.xml {code}
> of the opennlp script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory

2024-06-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855402#comment-17855402
 ] 

ASF GitHub Bot commented on OPENNLP-1568:
-

veita opened a new pull request, #607:
URL: https://github.com/apache/opennlp/pull/607

   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x ] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x ] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [ x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x ] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [ x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
> ---
>
> Key: OPENNLP-1568
> URL: https://issues.apache.org/jira/browse/OPENNLP-1568
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Command Line Interface
>Affects Versions: 2.3.3
> Environment: Linux/Bash
>Reporter: Alexander Veit
>Priority: Major
>
> Try to run the opennlp command from outside $OPENNLP_HOME/bin directory.
> It fails with an error message similar to
>  
> {noformat}
> 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed:
>  No configuration found for '4f2410ac' at 'null' in 'null'{noformat}
>  
> The error is caused by the relative path in
> {code:java}
> -Dlog4j.configurationFile=../conf/log4j2.xml {code}
> of the opennlp script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855133#comment-17855133
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on PR #606:
URL: https://github.com/apache/opennlp/pull/606#issuecomment-2168662142

   Since I don't have a Windows system anymore, this PR needs someone with a 
Windows machine to check the implementation of `SimpleClassPathModelFinder` the 
related tests ...  ;-)




> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855132#comment-17855132
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1640265283


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
   Now contain a simple classpath scanning implementation which doesn't cover 
edge cases but makes the unit tests happy.





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855019#comment-17855019
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639610514


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
   Switching to draft and will implement a service without classgraph. Once I 
am done, I will ping here again.





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855007#comment-17855007
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639557402


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
   Yep. Let's change within this PR.





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855000#comment-17855000
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

mawiesne commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639520644


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
   Providing a "... loader/scan alternative which does not cover all edge cases 
and declare classgraph as an _optional_ dependency in this module."
   
   This is the way. Leaves the choice to the end-user and/or project on top of 
OpenNLP(-models).





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854999#comment-17854999
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

mawiesne commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639520644


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
   Providing a "... loader/scan alternative which does not cover all edge cases 
and declare classgraph as an _optional_ dependency in this module."
   
   That is the way. Leaves the choice to the end-user and/or project on top of 
OpenNLP(-models).





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854996#comment-17854996
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639507441


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
Reyling on `java.class.path` property or trying to cast to `URLClassloader` 
(on the current thread context class loader) is also error prone as not all 
vendors / classloaders extend it anymore or use the `Class-Path:` property in 
manifests for Java 9+ module system.
   
   Implementing it with pure JDK classes from scratch within OpenNLP and cover 
all edge cases (like it is done in 
[Classgraph](https://stackoverflow.com/a/31785767)) might be cumbersome and 
doesn't add much value just for the sake of reducing transient libs here. 
   
   What we could do as a compromise is to implement (in an additional PR) some 
sort of hacky loader/scan alternative which does not cover all edge cases and 
declare `classgraph` as an **optional** dependency in this module. People, who 
don't want to use `classgraph` can use the hacky alternative (if it covers 
their use-case) via `ucp` / reflections. People, who need a more sophisticated 
solution can add `classgraph` as a dependency on their end and have the 
functionality availble in all edge cases and environments (such as servlet 
containers, ee containers, quarkus / spring boot envs, etc.)
   





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854995#comment-17854995
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639507441


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
Reyling on `java.class.path` property or trying to cast to `URLClassloader` 
(on the current thread context class loader) is also error prone as not all 
vendors / classloaders extend it anymore or use the `Class-Path:` property in 
manifests for Java 9+ module system.
   
   Implementing it with pure JDK classes from scratch within OpenNLP and cover 
all edge cases (like it is done in 
[Classgraph](https://stackoverflow.com/a/31785767)) might be cumbersome and 
doesn't add much value just for the sake of reducing transient libs here. 
   
   What we could do as a compromise is to implement (in an additional PR) some 
sort of hacky loader/scan alternative which does not cover all edge cases and 
declare `classgraph` as an **optional** dependency in this module. People, who 
don't want to use `classgraph` can use the hacky alternative (if it covers 
their use-case) and people, who need a more sophisticated solution can add 
`classgraph` as a dependency on their end and have the functionality availble 
in all edge cases and environments (such as servlet containers, ee containers, 
quarkus / spring boot envs, etc.)
   





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854985#comment-17854985
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639490884


##
opennlp-tools-models/src/test/java/opennlp/tools/models/ClassPathModelFinderTest.java:
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.util.Set;
+
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+
+public class ClassPathModelFinderTest {
+
+  @Test
+  public void testFindOpenNLPModels() {
+final ClasspathModelFinder finder = new ClasspathModelFinder();

Review Comment:
   Reduced code duplication by introducing inheritance, found an additional bug 
and fixed in latest commit.





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854978#comment-17854978
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639441325


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
   Not without hacking the ucp (url) classloader via reflection which is error 
prone depending on the runtime context and might lead to issues in future JDK 
versions. For this reason, I prefer to stay with an external dependency here.
   
   





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854977#comment-17854977
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639473953


##
opennlp-tools-models/pom.xml:
##
@@ -0,0 +1,95 @@
+
+
+
+
+http://maven.apache.org/POM/4.0.0;
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
+4.0.0
+
+org.apache.opennlp
+opennlp
+2.3.4-SNAPSHOT
+
+
+opennlp-tools-models
+jar
+Apache OpenNLP Tools Models
+
+
+
+org.apache.opennlp
+opennlp-tools
+${project.version}
+provided
+
+
+
+io.github.classgraph
+classgraph
+${classgraph.version}
+
+
+
+org.slf4j
+slf4j-api
+
+
+
+org.junit.jupiter
+junit-jupiter-api
+test
+
+
+
+org.junit.jupiter
+junit-jupiter-engine

Review Comment:
   engine is needed, params can be removed.





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854968#comment-17854968
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639453973


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;
+import io.github.classgraph.ResourceList;
+import io.github.classgraph.ScanResult;
+
+
+/**
+ * Enables the detection of OpenNLP models in the classpath.
+ */
+public class ClasspathModelFinder {
+
+  private static final String OPENNLP_MODEL_JAR_PREFIX = 
"opennlp-models-*.jar";
+  private final String jarModelPrefix;
+  private Set models;
+
+  /**
+   * By default, it scans for "opennlp-models-*.jar".
+   */
+  public ClasspathModelFinder() {
+this(OPENNLP_MODEL_JAR_PREFIX);
+  }
+
+  /**
+   * @param modelJarPrefix The leafnames of the jars that should be canned 
(e.g. "opennlp.jar").
+   *  May contain a wildcard glob ("opennlp-*.jar"). It 
must not be {@code null}.
+   */
+  public ClasspathModelFinder(String modelJarPrefix) {
+Objects.requireNonNull(modelJarPrefix, "modelJarPrefix must not be null");
+this.jarModelPrefix = modelJarPrefix;
+  }
+
+  /**
+   * Finds OpenNLP models within the classpath.
+   *
+   * @param reloadCache {@code true}, if the internal cache should explicitly 
be reloaded
+   * @return A Set of {@link ClassPathModelEntry ClassPathModelEntries}. It 
might be empty.
+   */
+  public Set findModels(boolean reloadCache) {
+
+if (this.models == null || reloadCache) {
+  try (ScanResult sr = new 
ClassGraph().acceptJars(jarModelPrefix).disableDirScanning().scan()) {

Review Comment:
   This is discussed in the comment above. Resolving this extra conversation.





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854967#comment-17854967
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639441325


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
   Not without hacking the ucp (url) classloader via reflection which is error 
prone depending on the runtime context and might lead to issues in future JDK 
versions. 
   
   





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854961#comment-17854961
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

mawiesne commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1639398105


##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;

Review Comment:
   I'm not against classgraph as a library. 
   
   Nevertheless: is there an easy way to scan/look for model files in the 
classpath _without_ introducing this 3rd party dependency? Aka: Can we realize 
it with plain JDK classes?
   
   wdyt @jzonthemtn @kinow @rzo1 ?



##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.models;
+
+import java.net.URI;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import io.github.classgraph.ClassGraph;
+import io.github.classgraph.ResourceList;
+import io.github.classgraph.ScanResult;
+
+
+/**
+ * Enables the detection of OpenNLP models in the classpath.
+ */
+public class ClasspathModelFinder {
+
+  private static final String OPENNLP_MODEL_JAR_PREFIX = 
"opennlp-models-*.jar";
+  private final String jarModelPrefix;
+  private Set models;
+
+  /**
+   * By default, it scans for "opennlp-models-*.jar".
+   */
+  public ClasspathModelFinder() {
+this(OPENNLP_MODEL_JAR_PREFIX);
+  }
+
+  /**
+   * @param modelJarPrefix The leafnames of the jars that should be canned 
(e.g. "opennlp.jar").
+   *  May contain a wildcard glob ("opennlp-*.jar"). It 
must not be {@code null}.
+   */
+  public ClasspathModelFinder(String modelJarPrefix) {
+Objects.requireNonNull(modelJarPrefix, "modelJarPrefix must not be null");
+this.jarModelPrefix = modelJarPrefix;
+  }
+
+  /**
+   * Finds OpenNLP models within the classpath.
+   *
+   * @param reloadCache {@code true}, if the internal cache should explicitly 
be reloaded
+   * @return A Set of {@link ClassPathModelEntry ClassPathModelEntries}. It 
might be empty.
+   */
+  public Set findModels(boolean reloadCache) {
+
+if (this.models == null || reloadCache) {
+  try (ScanResult sr = new 
ClassGraph().acceptJars(jarModelPrefix).disableDirScanning().scan()) {

Review Comment:
   See comment above on introducing ClassGraph as a dependency.



##
opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java:
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * 

[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854154#comment-17854154
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

kinow commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1635321056


##
LICENSE:
##
@@ -303,3 +303,27 @@ The following license applies to the SLF4J API:
 LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 OF CONTRACT, TORT OR OTHERWISE,  ARISING FROM, OUT OF OR IN CONNECTION
 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+The following license applies to classgraph:
+
+The MIT License (MIT)
+
+Copyright (c) 2019 Luke Hutchison
+
+Permission is hereby granted, free of charge, to any person obtaining a 
copy
+of this software and associated documentation files (the "Software"), to 
deal
+in the Software without restriction, including without limitation the 
rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in 
all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
THE
+SOFTWARE.

Review Comment:
   I am not sure, but probably doesn't hurt having it listed in NOTICE.





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854112#comment-17854112
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

kinow commented on code in PR #606:
URL: https://github.com/apache/opennlp/pull/606#discussion_r1635163209


##
LICENSE:
##
@@ -303,3 +303,27 @@ The following license applies to the SLF4J API:
 LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 OF CONTRACT, TORT OR OTHERWISE,  ARISING FROM, OUT OF OR IN CONNECTION
 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+The following license applies to classgraph:
+
+The MIT License (MIT)
+
+Copyright (c) 2019 Luke Hutchison
+
+Permission is hereby granted, free of charge, to any person obtaining a 
copy
+of this software and associated documentation files (the "Software"), to 
deal
+in the Software without restriction, including without limitation the 
rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in 
all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
THE
+SOFTWARE.

Review Comment:
   Isn't this change required only in NOTICE?





> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation

2024-06-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853993#comment-17853993
 ] 

ASF GitHub Bot commented on OPENNLP-1567:
-

rzo1 opened a new pull request, #606:
URL: https://github.com/apache/opennlp/pull/606

   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [x] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [x] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [x] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   
   This adds a proof-of-conept implementation as a separate module (due to the 
addtional dependendy for classpath scanning) to load models from JAR files. 
   
   If we are fine with the design, we can do a first model release afterwards 
;-)
   




> OpenNLP Models: Provide a Finder / Loader Implementation
> 
>
> Key: OPENNLP-1567
> URL: https://issues.apache.org/jira/browse/OPENNLP-1567
> Project: OpenNLP
>  Issue Type: New Feature
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> as the title says



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1566) Array writing error in code example

2024-06-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853736#comment-17853736
 ] 

ASF GitHub Bot commented on OPENNLP-1566:
-

jzonthemtn commented on PR #605:
URL: https://github.com/apache/opennlp/pull/605#issuecomment-2158853268

   Hi @shellrean, thanks for the pull request!




> Array writing error in code example
> ---
>
> Key: OPENNLP-1566
> URL: https://issues.apache.org/jira/browse/OPENNLP-1566
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.3
>Reporter: shellrean
>Priority: Minor
> Fix For: 2.3.4
>
> Attachments: Screenshot 2024-06-10 at 23.31.20.png
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> There was an error in writing the array symbol in the documentation, it 
> should be String[] but what is displayed here is String variable[].
> !Screenshot 2024-06-10 at 23.31.20.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1566) Array writing error in code example

2024-06-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853732#comment-17853732
 ] 

ASF GitHub Bot commented on OPENNLP-1566:
-

shellrean opened a new pull request, #605:
URL: https://github.com/apache/opennlp/pull/605

   (no comment)




> Array writing error in code example
> ---
>
> Key: OPENNLP-1566
> URL: https://issues.apache.org/jira/browse/OPENNLP-1566
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.3
>Reporter: shellrean
>Priority: Minor
> Fix For: 2.3.4
>
> Attachments: Screenshot 2024-06-10 at 23.31.20.png
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> There was an error in writing the array symbol in the documentation, it 
> should be String[] but what is displayed here is String variable[].
> !Screenshot 2024-06-10 at 23.31.20.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change

2024-05-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850818#comment-17850818
 ] 

ASF GitHub Bot commented on OPENNLP-1564:
-

rzo1 commented on PR #604:
URL: https://github.com/apache/opennlp/pull/604#issuecomment-2140460193

   Thx @jzonthemtn - next automatic eval run should be fine again. Sadly, 
cannot manually trigger atm (due to the jenkins outage)




> Fix Evaluation Tests after POSFormat Change
> ---
>
> Key: OPENNLP-1564
> URL: https://issues.apache.org/jira/browse/OPENNLP-1564
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> ERROR] Failures: 
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: 
> <0.9504442925495558> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: 
> <0.9564251537935748> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: 
> <0.9248585572842999> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: 
> <0.9377652050919377> but was: <0.0>
> [ERROR]   OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 
> expected: <0.8070226153653437> but was: <0.8060021642728011>
> [ERROR]   OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 
> expected: <0.969345319453096> but was: <1.5460660461489513E-4>
> [ERROR]   SourceForgeModelEval.evalChunkerModel:344 expected: 
> <304922886851384639120257052245406261332> but was: 
> <85416056838725341441074840387786758951>
> [ERROR]   SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 
> expected: <231995214522232523777090597594904492687> but was: 
> <90112530006040278703441476599716290769>
> [ERROR]   SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 
> expected: <209440430718727101220960491543652921728> but was: 
> <256369615778494816584749939613105809001>
> [INFO] 
> [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change

2024-05-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850819#comment-17850819
 ] 

ASF GitHub Bot commented on OPENNLP-1564:
-

rzo1 merged PR #604:
URL: https://github.com/apache/opennlp/pull/604




> Fix Evaluation Tests after POSFormat Change
> ---
>
> Key: OPENNLP-1564
> URL: https://issues.apache.org/jira/browse/OPENNLP-1564
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> ERROR] Failures: 
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: 
> <0.9504442925495558> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: 
> <0.9564251537935748> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: 
> <0.9248585572842999> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: 
> <0.9377652050919377> but was: <0.0>
> [ERROR]   OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 
> expected: <0.8070226153653437> but was: <0.8060021642728011>
> [ERROR]   OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 
> expected: <0.969345319453096> but was: <1.5460660461489513E-4>
> [ERROR]   SourceForgeModelEval.evalChunkerModel:344 expected: 
> <304922886851384639120257052245406261332> but was: 
> <85416056838725341441074840387786758951>
> [ERROR]   SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 
> expected: <231995214522232523777090597594904492687> but was: 
> <90112530006040278703441476599716290769>
> [ERROR]   SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 
> expected: <209440430718727101220960491543652921728> but was: 
> <256369615778494816584749939613105809001>
> [INFO] 
> [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change

2024-05-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850817#comment-17850817
 ] 

ASF GitHub Bot commented on OPENNLP-1564:
-

rzo1 opened a new pull request, #604:
URL: https://github.com/apache/opennlp/pull/604

   This fixes 
   
   ```bash
   [ERROR] opennlp.tools.eval.SourceForgeModelEval.evalChunkerModel 

> Fix Evaluation Tests after POSFormat Change
> ---
>
> Key: OPENNLP-1564
> URL: https://issues.apache.org/jira/browse/OPENNLP-1564
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> ERROR] Failures: 
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: 
> <0.9504442925495558> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: 
> <0.9564251537935748> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: 
> <0.9248585572842999> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: 
> <0.9377652050919377> but was: <0.0>
> [ERROR]   OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 
> expected: <0.8070226153653437> but was: <0.8060021642728011>
> [ERROR]   OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 
> expected: <0.969345319453096> but was: <1.5460660461489513E-4>
> [ERROR]   SourceForgeModelEval.evalChunkerModel:344 expected: 
> <304922886851384639120257052245406261332> but was: 
> <85416056838725341441074840387786758951>
> [ERROR]   SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 
> expected: <231995214522232523777090597594904492687> but was: 
> <90112530006040278703441476599716290769>
> [ERROR]   SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 
> expected: <209440430718727101220960491543652921728> but was: 
> <256369615778494816584749939613105809001>
> [INFO] 
> [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change

2024-05-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850571#comment-17850571
 ] 

ASF GitHub Bot commented on OPENNLP-1564:
-

rzo1 merged PR #603:
URL: https://github.com/apache/opennlp/pull/603




> Fix Evaluation Tests after POSFormat Change
> ---
>
> Key: OPENNLP-1564
> URL: https://issues.apache.org/jira/browse/OPENNLP-1564
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> ERROR] Failures: 
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: 
> <0.9504442925495558> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: 
> <0.9564251537935748> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: 
> <0.9248585572842999> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: 
> <0.9377652050919377> but was: <0.0>
> [ERROR]   OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 
> expected: <0.8070226153653437> but was: <0.8060021642728011>
> [ERROR]   OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 
> expected: <0.969345319453096> but was: <1.5460660461489513E-4>
> [ERROR]   SourceForgeModelEval.evalChunkerModel:344 expected: 
> <304922886851384639120257052245406261332> but was: 
> <85416056838725341441074840387786758951>
> [ERROR]   SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 
> expected: <231995214522232523777090597594904492687> but was: 
> <90112530006040278703441476599716290769>
> [ERROR]   SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 
> expected: <209440430718727101220960491543652921728> but was: 
> <256369615778494816584749939613105809001>
> [INFO] 
> [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change

2024-05-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850570#comment-17850570
 ] 

ASF GitHub Bot commented on OPENNLP-1564:
-

rzo1 commented on PR #603:
URL: https://github.com/apache/opennlp/pull/603#issuecomment-2138684864

   Thanks for review. I will merge, so it gets picked up in today's scheduled 
eval run. Will close Jira, If the run is green (otherwise, debug session )




> Fix Evaluation Tests after POSFormat Change
> ---
>
> Key: OPENNLP-1564
> URL: https://issues.apache.org/jira/browse/OPENNLP-1564
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> ERROR] Failures: 
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: 
> <0.9504442925495558> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: 
> <0.9564251537935748> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: 
> <0.9248585572842999> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: 
> <0.9377652050919377> but was: <0.0>
> [ERROR]   OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 
> expected: <0.8070226153653437> but was: <0.8060021642728011>
> [ERROR]   OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 
> expected: <0.969345319453096> but was: <1.5460660461489513E-4>
> [ERROR]   SourceForgeModelEval.evalChunkerModel:344 expected: 
> <304922886851384639120257052245406261332> but was: 
> <85416056838725341441074840387786758951>
> [ERROR]   SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 
> expected: <231995214522232523777090597594904492687> but was: 
> <90112530006040278703441476599716290769>
> [ERROR]   SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 
> expected: <209440430718727101220960491543652921728> but was: 
> <256369615778494816584749939613105809001>
> [INFO] 
> [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change

2024-05-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850471#comment-17850471
 ] 

ASF GitHub Bot commented on OPENNLP-1564:
-

rzo1 opened a new pull request, #603:
URL: https://github.com/apache/opennlp/pull/603

   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   
   Updates some missed `PENN` format cases in the evaluation test data. In 
addition, if `POSTaggerNameFeatureGenerator` is used for NameFinder (i.e. 
defined via XML), we only get a POSModel and need to guess the format of it 
before creating the `POSTagger`. Therefore, I updated the mapper with a static 
method for guessing this type based on a given POSModel. This might be useful 
for other cases as well.
   
   Sadly, ASF Jenkins is currently broken, so we cannot trigger a manual run, 
see https://issues.apache.org/jira/browse/INFRA-25828




> Fix Evaluation Tests after POSFormat Change
> ---
>
> Key: OPENNLP-1564
> URL: https://issues.apache.org/jira/browse/OPENNLP-1564
> Project: OpenNLP
>  Issue Type: Task
>Reporter: Richard Zowalla
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.3.4, 2.4.0
>
>
> ERROR] Failures: 
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: 
> <0.9504442925495558> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: 
> <0.9564251537935748> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: 
> <0.9248585572842999> but was: <0.0>
> [ERROR]   ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: 
> <0.9377652050919377> but was: <0.0>
> [ERROR]   OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 
> expected: <0.8070226153653437> but was: <0.8060021642728011>
> [ERROR]   OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 
> expected: <0.969345319453096> but was: <1.5460660461489513E-4>
> [ERROR]   SourceForgeModelEval.evalChunkerModel:344 expected: 
> <304922886851384639120257052245406261332> but was: 
> <85416056838725341441074840387786758951>
> [ERROR]   SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 
> expected: <231995214522232523777090597594904492687> but was: 
> <90112530006040278703441476599716290769>
> [ERROR]   SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 
> expected: <209440430718727101220960491543652921728> but was: 
> <256369615778494816584749939613105809001>
> [INFO] 
> [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850265#comment-17850265
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

mawiesne merged PR #601:
URL: https://github.com/apache/opennlp/pull/601




> Introduce parameter for POSTaggerME to configure output POS tag format
> --
>
> Key: OPENNLP-1539
> URL: https://issues.apache.org/jira/browse/OPENNLP-1539
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Martin Wiesner
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.4.0
>
>
> [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] 
> output tags in the [PENN Treebank POS 
> tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html]
>  format.
> The modern UD-based models, however, differ in the [longer output 
> format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" 
> (UD). Extended (UD) word features are covered here: 
> https://universaldependencies.org/u/feat/index.html
> This difference results in mismatches and will cause existing IT / tests to 
> fail, if executed. Luckily, a mapping table is found here: 
> https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
> To provide compatibility for existing applications and/or use-cases, we need 
> to provide a way to retrieve both POS formats. 
> Aims:
> - Introduce a constructor parameter for POSTaggerME to configure tag format / 
> style: Penn or UD style
> - Implement a mapping between both POS tag formats: UD <==> Penn
> - Update the OpenNLP Manual to explain differences of POS tag format and 
> configuration parameter
> Conceptual idea:
> - {{new POSTaggerME("en")}}  => by _default_: UD format "as is"
> - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn 
> style
> Benefit: 
> 1. It should be explicit so devs / user see what they will get via 
> {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, 
> POSTagFormat.DEFAULT
> 2. IT tests can now be formulated to work on both modern and legacy models.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

2024-05-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850101#comment-17850101
 ] 

ASF GitHub Bot commented on OPENNLP-1563:
-

jzonthemtn commented on PR #602:
URL: https://github.com/apache/opennlp/pull/602#issuecomment-2135699051

   Thanks @demq! 




> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -
>
> Key: OPENNLP-1563
> URL: https://issues.apache.org/jira/browse/OPENNLP-1563
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Tokenizer
>Affects Versions: 2.3.3
>Reporter: Hrayr Matevosyan
>Priority: Major
> Fix For: 2.3.4
>
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes 
> words containing non-spacing letters. For example, the Arabic word "طُوّر" 
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

2024-05-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849994#comment-17849994
 ] 

ASF GitHub Bot commented on OPENNLP-1563:
-

rzo1 merged PR #602:
URL: https://github.com/apache/opennlp/pull/602




> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -
>
> Key: OPENNLP-1563
> URL: https://issues.apache.org/jira/browse/OPENNLP-1563
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Tokenizer
>Affects Versions: 2.3.3
>Reporter: Hrayr Matevosyan
>Priority: Major
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes 
> words containing non-spacing letters. For example, the Arabic word "طُوّر" 
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

2024-05-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849993#comment-17849993
 ] 

ASF GitHub Bot commented on OPENNLP-1563:
-

mawiesne commented on PR #602:
URL: https://github.com/apache/opennlp/pull/602#issuecomment-2134979949

   Thx @demq  !




> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -
>
> Key: OPENNLP-1563
> URL: https://issues.apache.org/jira/browse/OPENNLP-1563
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Tokenizer
>Affects Versions: 2.3.3
>Reporter: Hrayr Matevosyan
>Priority: Major
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes 
> words containing non-spacing letters. For example, the Arabic word "طُوّر" 
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

2024-05-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849906#comment-17849906
 ] 

ASF GitHub Bot commented on OPENNLP-1563:
-

demq commented on code in PR #602:
URL: https://github.com/apache/opennlp/pull/602#discussion_r1616698074


##
opennlp-tools/src/test/java/opennlp/tools/tokenize/SimpleTokenizerTest.java:
##
@@ -128,4 +128,18 @@ void testTokenizationOfStringWithWindowsNewLineTokens() {
 Assertions.assertArrayEquals(new String[] {"a", "\r", "\n", "\r", "\n", 
"b", "\r", "\n", "\r", "\n", "c"},
 tokenizer.tokenize("a\r\n\r\n b\r\n\r\n c"));
   }
+
+  /**
+   * Tests if it can tokenize a word containing a non-spacing character
+   * like Arabic Damma Unicode Character “◌ُ” (U+064F)
+   */
+  @Test
+  void testNonSpacingLetters() {
+String text = "طُوّر";

Review Comment:
   I have just pushed an update with a full sentence.





> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -
>
> Key: OPENNLP-1563
> URL: https://issues.apache.org/jira/browse/OPENNLP-1563
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Tokenizer
>Affects Versions: 2.3.3
>Reporter: Hrayr Matevosyan
>Priority: Major
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes 
> words containing non-spacing letters. For example, the Arabic word "طُوّر" 
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

2024-05-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849899#comment-17849899
 ] 

ASF GitHub Bot commented on OPENNLP-1563:
-

rzo1 commented on code in PR #602:
URL: https://github.com/apache/opennlp/pull/602#discussion_r1616664137


##
opennlp-tools/src/test/java/opennlp/tools/tokenize/SimpleTokenizerTest.java:
##
@@ -128,4 +128,18 @@ void testTokenizationOfStringWithWindowsNewLineTokens() {
 Assertions.assertArrayEquals(new String[] {"a", "\r", "\n", "\r", "\n", 
"b", "\r", "\n", "\r", "\n", "c"},
 tokenizer.tokenize("a\r\n\r\n b\r\n\r\n c"));
   }
+
+  /**
+   * Tests if it can tokenize a word containing a non-spacing character
+   * like Arabic Damma Unicode Character “◌ُ” (U+064F)
+   */
+  @Test
+  void testNonSpacingLetters() {
+String text = "طُوّر";

Review Comment:
   @demq Can we have a full sentence example here?





> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -
>
> Key: OPENNLP-1563
> URL: https://issues.apache.org/jira/browse/OPENNLP-1563
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Tokenizer
>Affects Versions: 2.3.3
>Reporter: Hrayr Matevosyan
>Priority: Major
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes 
> words containing non-spacing letters. For example, the Arabic word "طُوّر" 
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

2024-05-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849892#comment-17849892
 ] 

ASF GitHub Bot commented on OPENNLP-1563:
-

demq opened a new pull request, #602:
URL: https://github.com/apache/opennlp/pull/602

   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [x] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -
>
> Key: OPENNLP-1563
> URL: https://issues.apache.org/jira/browse/OPENNLP-1563
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Tokenizer
>Affects Versions: 2.3.3
>Reporter: Hrayr Matevosyan
>Priority: Major
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes 
> words containing non-spacing letters. For example, the Arabic word "طُوّر" 
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849223#comment-17849223
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

rzo1 commented on PR #601:
URL: https://github.com/apache/opennlp/pull/601#issuecomment-2128979449

   > @rzo1 Did you run into any weirdness with the older Sourceforge models?
   
   Added tests. They look good.




> Introduce parameter for POSTaggerME to configure output POS tag format
> --
>
> Key: OPENNLP-1539
> URL: https://issues.apache.org/jira/browse/OPENNLP-1539
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Martin Wiesner
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.4.0
>
>
> [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] 
> output tags in the [PENN Treebank POS 
> tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html]
>  format.
> The modern UD-based models, however, differ in the [longer output 
> format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" 
> (UD). Extended (UD) word features are covered here: 
> https://universaldependencies.org/u/feat/index.html
> This difference results in mismatches and will cause existing IT / tests to 
> fail, if executed. Luckily, a mapping table is found here: 
> https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
> To provide compatibility for existing applications and/or use-cases, we need 
> to provide a way to retrieve both POS formats. 
> Aims:
> - Introduce a constructor parameter for POSTaggerME to configure tag format / 
> style: Penn or UD style
> - Implement a mapping between both POS tag formats: UD <==> Penn
> - Update the OpenNLP Manual to explain differences of POS tag format and 
> configuration parameter
> Conceptual idea:
> - {{new POSTaggerME("en")}}  => by _default_: UD format "as is"
> - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn 
> style
> Benefit: 
> 1. It should be explicit so devs / user see what they will get via 
> {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, 
> POSTagFormat.DEFAULT
> 2. IT tests can now be formulated to work on both modern and legacy models.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849194#comment-17849194
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

mawiesne commented on code in PR #601:
URL: https://github.com/apache/opennlp/pull/601#discussion_r1612947718


##
opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java:
##
@@ -0,0 +1,207 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.postag;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A mapping implementation for converting between different POS tag formats.
+ * This class supports conversion between Penn Treebank (PENN) and Universal 
Dependencies (UD) formats.
+ * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal
 Dependencies conversion table.
+ * Please note that when converting from UD to Penn format, there may be 
ambiguity in some cases.
+ */
+public class POSTagFormatMapper {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(POSTagFormatMapper.class);
+
+  private static final Map CONVERSION_TABLE_PENN_TO_UD = new 
HashMap<>();
+  private static final Map CONVERSION_TABLE_UD_TO_PENN = new 
HashMap<>();
+
+  static {
+/*
+ * This is a conversion table to convert PENN to UD format as described in
+ * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
+ */
+CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM");
+CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("FW", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("LS", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ");
+CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV");
+
+/*
+ * Note: The back conversion might lose information.
+ */
+CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ");
+

[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849174#comment-17849174
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

rzo1 commented on code in PR #601:
URL: https://github.com/apache/opennlp/pull/601#discussion_r1612885912


##
opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java:
##
@@ -0,0 +1,207 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.postag;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A mapping implementation for converting between different POS tag formats.
+ * This class supports conversion between Penn Treebank (PENN) and Universal 
Dependencies (UD) formats.
+ * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal
 Dependencies conversion table.
+ * Please note that when converting from UD to Penn format, there may be 
ambiguity in some cases.
+ */
+public class POSTagFormatMapper {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(POSTagFormatMapper.class);
+
+  private static final Map CONVERSION_TABLE_PENN_TO_UD = new 
HashMap<>();
+  private static final Map CONVERSION_TABLE_UD_TO_PENN = new 
HashMap<>();
+
+  static {
+/*
+ * This is a conversion table to convert PENN to UD format as described in
+ * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
+ */
+CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM");
+CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("FW", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("LS", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ");
+CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV");
+
+/*
+ * Note: The back conversion might lose information.
+ */
+CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ");
+

[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849172#comment-17849172
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

rzo1 commented on code in PR #601:
URL: https://github.com/apache/opennlp/pull/601#discussion_r1612883541


##
opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java:
##
@@ -0,0 +1,207 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.postag;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A mapping implementation for converting between different POS tag formats.
+ * This class supports conversion between Penn Treebank (PENN) and Universal 
Dependencies (UD) formats.
+ * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal
 Dependencies conversion table.
+ * Please note that when converting from UD to Penn format, there may be 
ambiguity in some cases.
+ */
+public class POSTagFormatMapper {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(POSTagFormatMapper.class);
+
+  private static final Map CONVERSION_TABLE_PENN_TO_UD = new 
HashMap<>();
+  private static final Map CONVERSION_TABLE_UD_TO_PENN = new 
HashMap<>();
+
+  static {
+/*
+ * This is a conversion table to convert PENN to UD format as described in
+ * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
+ */
+CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM");
+CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("FW", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("LS", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ");
+CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV");
+
+/*
+ * Note: The back conversion might lose information.
+ */
+CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ");
+

[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849158#comment-17849158
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

mawiesne commented on PR #601:
URL: https://github.com/apache/opennlp/pull/601#issuecomment-2128552371

   Thx @rzo1 - I left some comments to improve clarity of the doc for the 
changes in the API.




> Introduce parameter for POSTaggerME to configure output POS tag format
> --
>
> Key: OPENNLP-1539
> URL: https://issues.apache.org/jira/browse/OPENNLP-1539
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Martin Wiesner
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.4.0
>
>
> [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] 
> output tags in the [PENN Treebank POS 
> tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html]
>  format.
> The modern UD-based models, however, differ in the [longer output 
> format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" 
> (UD). Extended (UD) word features are covered here: 
> https://universaldependencies.org/u/feat/index.html
> This difference results in mismatches and will cause existing IT / tests to 
> fail, if executed. Luckily, a mapping table is found here: 
> https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
> To provide compatibility for existing applications and/or use-cases, we need 
> to provide a way to retrieve both POS formats. 
> Aims:
> - Introduce a constructor parameter for POSTaggerME to configure tag format / 
> style: Penn or UD style
> - Implement a mapping between both POS tag formats: UD <==> Penn
> - Update the OpenNLP Manual to explain differences of POS tag format and 
> configuration parameter
> Conceptual idea:
> - {{new POSTaggerME("en")}}  => by _default_: UD format "as is"
> - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn 
> style
> Benefit: 
> 1. It should be explicit so devs / user see what they will get via 
> {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, 
> POSTagFormat.DEFAULT
> 2. IT tests can now be formulated to work on both modern and legacy models.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849157#comment-17849157
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

mawiesne commented on code in PR #601:
URL: https://github.com/apache/opennlp/pull/601#discussion_r1612745224


##
opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java:
##
@@ -0,0 +1,207 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.postag;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A mapping implementation for converting between different POS tag formats.
+ * This class supports conversion between Penn Treebank (PENN) and Universal 
Dependencies (UD) formats.
+ * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal
 Dependencies conversion table.
+ * Please note that when converting from UD to Penn format, there may be 
ambiguity in some cases.
+ */
+public class POSTagFormatMapper {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(POSTagFormatMapper.class);
+
+  private static final Map CONVERSION_TABLE_PENN_TO_UD = new 
HashMap<>();
+  private static final Map CONVERSION_TABLE_UD_TO_PENN = new 
HashMap<>();
+
+  static {
+/*
+ * This is a conversion table to convert PENN to UD format as described in
+ * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
+ */
+CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM");
+CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("FW", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("LS", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ");
+CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV");
+
+/*
+ * Note: The back conversion might lose information.
+ */
+CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ");
+

[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849055#comment-17849055
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

jzonthemtn commented on code in PR #601:
URL: https://github.com/apache/opennlp/pull/601#discussion_r1612105341


##
opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java:
##
@@ -0,0 +1,207 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.postag;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A mapping implementation for converting between different POS tag formats.
+ * This class supports conversion between Penn Treebank (PENN) and Universal 
Dependencies (UD) formats.
+ * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal
 Dependencies conversion table.
+ * Please note that when converting from UD to Penn format, there may be 
ambiguity in some cases.
+ */
+public class POSTagFormatMapper {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(POSTagFormatMapper.class);
+
+  private static final Map CONVERSION_TABLE_PENN_TO_UD = new 
HashMap<>();
+  private static final Map CONVERSION_TABLE_UD_TO_PENN = new 
HashMap<>();
+
+  static {
+/*
+ * This is a conversion table to convert PENN to UD format as described in
+ * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
+ */
+CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM");
+CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("FW", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("LS", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ");
+CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV");
+
+/*
+ * Note: The back conversion might lose information.
+ */
+CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ");
+

[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849016#comment-17849016
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

rzo1 commented on code in PR #601:
URL: https://github.com/apache/opennlp/pull/601#discussion_r1611936275


##
opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java:
##
@@ -0,0 +1,207 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.postag;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A mapping implementation for converting between different POS tag formats.
+ * This class supports conversion between Penn Treebank (PENN) and Universal 
Dependencies (UD) formats.
+ * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal
 Dependencies conversion table.
+ * Please note that when converting from UD to Penn format, there may be 
ambiguity in some cases.
+ */
+public class POSTagFormatMapper {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(POSTagFormatMapper.class);
+
+  private static final Map CONVERSION_TABLE_PENN_TO_UD = new 
HashMap<>();
+  private static final Map CONVERSION_TABLE_UD_TO_PENN = new 
HashMap<>();
+
+  static {
+/*
+ * This is a conversion table to convert PENN to UD format as described in
+ * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
+ */
+CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM");
+CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("FW", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("LS", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ");
+CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV");
+
+/*
+ * Note: The back conversion might lose information.
+ */
+CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ");
+

[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849008#comment-17849008
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

rzo1 commented on PR #601:
URL: https://github.com/apache/opennlp/pull/601#issuecomment-2127421991

   Need to add an test for it ;-) 




> Introduce parameter for POSTaggerME to configure output POS tag format
> --
>
> Key: OPENNLP-1539
> URL: https://issues.apache.org/jira/browse/OPENNLP-1539
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Martin Wiesner
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.4.0
>
>
> [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] 
> output tags in the [PENN Treebank POS 
> tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html]
>  format.
> The modern UD-based models, however, differ in the [longer output 
> format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" 
> (UD). Extended (UD) word features are covered here: 
> https://universaldependencies.org/u/feat/index.html
> This difference results in mismatches and will cause existing IT / tests to 
> fail, if executed. Luckily, a mapping table is found here: 
> https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
> To provide compatibility for existing applications and/or use-cases, we need 
> to provide a way to retrieve both POS formats. 
> Aims:
> - Introduce a constructor parameter for POSTaggerME to configure tag format / 
> style: Penn or UD style
> - Implement a mapping between both POS tag formats: UD <==> Penn
> - Update the OpenNLP Manual to explain differences of POS tag format and 
> configuration parameter
> Conceptual idea:
> - {{new POSTaggerME("en")}}  => by _default_: UD format "as is"
> - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn 
> style
> Benefit: 
> 1. It should be explicit so devs / user see what they will get via 
> {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, 
> POSTagFormat.DEFAULT
> 2. IT tests can now be formulated to work on both modern and legacy models.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849006#comment-17849006
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

jzonthemtn commented on PR #601:
URL: https://github.com/apache/opennlp/pull/601#issuecomment-2127416229

   @rzo1 Did you run into any weirdness with the older Sourceforge models?




> Introduce parameter for POSTaggerME to configure output POS tag format
> --
>
> Key: OPENNLP-1539
> URL: https://issues.apache.org/jira/browse/OPENNLP-1539
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Martin Wiesner
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.4.0
>
>
> [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] 
> output tags in the [PENN Treebank POS 
> tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html]
>  format.
> The modern UD-based models, however, differ in the [longer output 
> format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" 
> (UD). Extended (UD) word features are covered here: 
> https://universaldependencies.org/u/feat/index.html
> This difference results in mismatches and will cause existing IT / tests to 
> fail, if executed. Luckily, a mapping table is found here: 
> https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
> To provide compatibility for existing applications and/or use-cases, we need 
> to provide a way to retrieve both POS formats. 
> Aims:
> - Introduce a constructor parameter for POSTaggerME to configure tag format / 
> style: Penn or UD style
> - Implement a mapping between both POS tag formats: UD <==> Penn
> - Update the OpenNLP Manual to explain differences of POS tag format and 
> configuration parameter
> Conceptual idea:
> - {{new POSTaggerME("en")}}  => by _default_: UD format "as is"
> - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn 
> style
> Benefit: 
> 1. It should be explicit so devs / user see what they will get via 
> {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, 
> POSTagFormat.DEFAULT
> 2. IT tests can now be formulated to work on both modern and legacy models.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849000#comment-17849000
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

jzonthemtn commented on code in PR #601:
URL: https://github.com/apache/opennlp/pull/601#discussion_r1611890953


##
opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java:
##
@@ -0,0 +1,207 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.postag;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A mapping implementation for converting between different POS tag formats.
+ * This class supports conversion between Penn Treebank (PENN) and Universal 
Dependencies (UD) formats.
+ * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal
 Dependencies conversion table.
+ * Please note that when converting from UD to Penn format, there may be 
ambiguity in some cases.
+ */
+public class POSTagFormatMapper {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(POSTagFormatMapper.class);
+
+  private static final Map CONVERSION_TABLE_PENN_TO_UD = new 
HashMap<>();
+  private static final Map CONVERSION_TABLE_UD_TO_PENN = new 
HashMap<>();
+
+  static {
+/*
+ * This is a conversion table to convert PENN to UD format as described in
+ * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
+ */
+CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ");
+CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM");
+CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("FW", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT");
+CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ");
+CONVERSION_TABLE_PENN_TO_UD.put("LS", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X");
+CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN");
+CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN");
+CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV");
+CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP");
+CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM");
+CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART");
+CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ");
+CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB");
+CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON");
+CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET");
+CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV");
+
+/*
+ * Note: The back conversion might lose information.
+ */
+CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ");
+

[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848997#comment-17848997
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

rzo1 commented on PR #601:
URL: https://github.com/apache/opennlp/pull/601#issuecomment-2127381263

   Note: The currently failing tests are unrelated to the actual change:
   
   ```bash
   Error:ChunkerModelLoaderTest.initResources:43->lambda$initResources$0:47 
Runtime java.io.IOException: Server returned HTTP response code: 503 for URL: 
https://opennlp.sourceforge.net/models-1.5/en-chunker.bin
   Error:
TokenNameFinderModelLoaderTest.initResources:43->lambda$initResources$0:47 
Runtime java.io.IOException: Server returned HTTP response code: 503 for URL: 
https://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
   Error:
TokenNameFinderModelTest.testNERWithPOSModelV15:122->AbstractModelLoaderTest.downloadVersion15Model:41->AbstractModelLoaderTest.downloadModel:57
 » IO Server returned HTTP response code: 503 for URL: 
https://opennlp.sourceforge.net/models-1.5/pt-pos-perceptron.bin
   ```




> Introduce parameter for POSTaggerME to configure output POS tag format
> --
>
> Key: OPENNLP-1539
> URL: https://issues.apache.org/jira/browse/OPENNLP-1539
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Martin Wiesner
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.4.0
>
>
> [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] 
> output tags in the [PENN Treebank POS 
> tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html]
>  format.
> The modern UD-based models, however, differ in the [longer output 
> format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" 
> (UD). Extended (UD) word features are covered here: 
> https://universaldependencies.org/u/feat/index.html
> This difference results in mismatches and will cause existing IT / tests to 
> fail, if executed. Luckily, a mapping table is found here: 
> https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
> To provide compatibility for existing applications and/or use-cases, we need 
> to provide a way to retrieve both POS formats. 
> Aims:
> - Introduce a constructor parameter for POSTaggerME to configure tag format / 
> style: Penn or UD style
> - Implement a mapping between both POS tag formats: UD <==> Penn
> - Update the OpenNLP Manual to explain differences of POS tag format and 
> configuration parameter
> Conceptual idea:
> - {{new POSTaggerME("en")}}  => by _default_: UD format "as is"
> - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn 
> style
> Benefit: 
> 1. It should be explicit so devs / user see what they will get via 
> {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, 
> POSTagFormat.DEFAULT
> 2. IT tests can now be formulated to work on both modern and legacy models.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format

2024-05-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848994#comment-17848994
 ] 

ASF GitHub Bot commented on OPENNLP-1539:
-

rzo1 opened a new pull request, #601:
URL: https://github.com/apache/opennlp/pull/601

   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   
   This is a **Draft** open for feedback on implementing compatibility with 
older PENN-based POS models from Sourceforge. 
   
   




> Introduce parameter for POSTaggerME to configure output POS tag format
> --
>
> Key: OPENNLP-1539
> URL: https://issues.apache.org/jira/browse/OPENNLP-1539
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: POS Tagger
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Martin Wiesner
>Assignee: Richard Zowalla
>Priority: Major
> Fix For: 2.4.0
>
>
> [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] 
> output tags in the [PENN Treebank POS 
> tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html]
>  format.
> The modern UD-based models, however, differ in the [longer output 
> format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" 
> (UD). Extended (UD) word features are covered here: 
> https://universaldependencies.org/u/feat/index.html
> This difference results in mismatches and will cause existing IT / tests to 
> fail, if executed. Luckily, a mapping table is found here: 
> https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
> To provide compatibility for existing applications and/or use-cases, we need 
> to provide a way to retrieve both POS formats. 
> Aims:
> - Introduce a constructor parameter for POSTaggerME to configure tag format / 
> style: Penn or UD style
> - Implement a mapping between both POS tag formats: UD <==> Penn
> - Update the OpenNLP Manual to explain differences of POS tag format and 
> configuration parameter
> Conceptual idea:
> - {{new POSTaggerME("en")}}  => by _default_: UD format "as is"
> - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn 
> style
> Benefit: 
> 1. It should be explicit so devs / user see what they will get via 
> {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, 
> POSTagFormat.DEFAULT
> 2. IT tests can now be formulated to work on both modern and legacy models.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844159#comment-17844159
 ] 

ASF GitHub Bot commented on OPENNLP-1556:
-

mawiesne merged PR #600:
URL: https://github.com/apache/opennlp/pull/600




> Improve speed of checksum computation in TwoPassDataIndexer
> ---
>
> Key: OPENNLP-1556
> URL: https://issues.apache.org/jira/browse/OPENNLP-1556
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.3.4
>
>
> For training ML models, all observations (Events) are indexed via 
> {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. 
> When #index(..) is run, a tmp file is written and read in again. For the 
> purpose of checksum validation, instances of HashSumEventStream are used to 
> validate the content processed. 
> Based on a rather slow toString() implementation in Event, a cryptographic 
> (MD5) message digest is computed. This, however, is much slower than simply 
> computing a checksum (such as a CRC32c value) for both directions 
> (write/read). The (slowing) effect is more problematic when larger training 
> corpora are (pre-)processed, that is, indexed in advance. 
> Aims:
> - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training 
> phase.
> - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash 
> function here; it's simply a checksum that is required to decide whether all 
> bytes written are the same bytes that are read.
> - Remove the untested class HashSumEventStream which is just a wrapper for 
> calling a slow toString() in Event to get some bytes to use for the 
> computation of a checksum / md.
> - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that 
> makes use of the faster CRC32c checksum computation, avoiding cryptographic 
> hash functions such as MD5.
> - Make sure all existing tests hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer

2024-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843560#comment-17843560
 ] 

ASF GitHub Bot commented on OPENNLP-1556:
-

jzonthemtn commented on PR #600:
URL: https://github.com/apache/opennlp/pull/600#issuecomment-2094769379

   Looks awesome! Thanks @mawiesne 




> Improve speed of checksum computation in TwoPassDataIndexer
> ---
>
> Key: OPENNLP-1556
> URL: https://issues.apache.org/jira/browse/OPENNLP-1556
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.3.4
>
>
> For training ML models, all observations (Events) are indexed via 
> {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. 
> When #index(..) is run, a tmp file is written and read in again. For the 
> purpose of checksum validation, instances of HashSumEventStream are used to 
> validate the content processed. 
> Based on a rather slow toString() implementation in Event, a cryptographic 
> (MD5) message digest is computed. This, however, is much slower than simply 
> computing a checksum (such as a CRC32c value) for both directions 
> (write/read). The (slowing) effect is more problematic when larger training 
> corpora are (pre-)processed, that is, indexed in advance. 
> Aims:
> - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training 
> phase.
> - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash 
> function here; it's simply a checksum that is required to decide whether all 
> bytes written are the same bytes that are read.
> - Remove the untested class HashSumEventStream which is just a wrapper for 
> calling a slow toString() in Event to get some bytes to use for the 
> computation of a checksum / md.
> - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that 
> makes use of the faster CRC32c checksum computation, avoiding cryptographic 
> hash functions such as MD5.
> - Make sure all existing tests hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer

2024-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843539#comment-17843539
 ] 

ASF GitHub Bot commented on OPENNLP-1556:
-

mawiesne commented on PR #600:
URL: https://github.com/apache/opennlp/pull/600#issuecomment-2094703523

   Note: Added better JavaDoc, code: unchanged.




> Improve speed of checksum computation in TwoPassDataIndexer
> ---
>
> Key: OPENNLP-1556
> URL: https://issues.apache.org/jira/browse/OPENNLP-1556
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.3.4
>
>
> For training ML models, all observations (Events) are indexed via 
> {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. 
> When #index(..) is run, a tmp file is written and read in again. For the 
> purpose of checksum validation, instances of HashSumEventStream are used to 
> validate the content processed. 
> Based on a rather slow toString() implementation in Event, a cryptographic 
> (MD5) message digest is computed. This, however, is much slower than simply 
> computing a checksum (such as a CRC32c value) for both directions 
> (write/read). The (slowing) effect is more problematic when larger training 
> corpora are (pre-)processed, that is, indexed in advance. 
> Aims:
> - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training 
> phase.
> - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash 
> function here; it's simply a checksum that is required to decide whether all 
> bytes written are the same bytes that are read.
> - Remove the untested class HashSumEventStream which is just a wrapper for 
> calling a slow toString() in Event to get some bytes to use for the 
> computation of a checksum / md.
> - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that 
> makes use of the faster CRC32c checksum computation, avoiding cryptographic 
> hash functions such as MD5.
> - Make sure all existing tests hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer

2024-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843537#comment-17843537
 ] 

ASF GitHub Bot commented on OPENNLP-1556:
-

rzo1 commented on PR #600:
URL: https://github.com/apache/opennlp/pull/600#issuecomment-2094698256

   https://ci-builds.apache.org/job/OPENnLP/job/eval-tests-configurable/18/




> Improve speed of checksum computation in TwoPassDataIndexer
> ---
>
> Key: OPENNLP-1556
> URL: https://issues.apache.org/jira/browse/OPENNLP-1556
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.3.4
>
>
> For training ML models, all observations (Events) are indexed via 
> {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. 
> When #index(..) is run, a tmp file is written and read in again. For the 
> purpose of checksum validation, instances of HashSumEventStream are used to 
> validate the content processed. 
> Based on a rather slow toString() implementation in Event, a cryptographic 
> (MD5) message digest is computed. This, however, is much slower than simply 
> computing a checksum (such as a CRC32c value) for both directions 
> (write/read). The (slowing) effect is more problematic when larger training 
> corpora are (pre-)processed, that is, indexed in advance. 
> Aims:
> - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training 
> phase.
> - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash 
> function here; it's simply a checksum that is required to decide whether all 
> bytes written are the same bytes that are read.
> - Remove the untested class HashSumEventStream which is just a wrapper for 
> calling a slow toString() in Event to get some bytes to use for the 
> computation of a checksum / md.
> - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that 
> makes use of the faster CRC32c checksum computation, avoiding cryptographic 
> hash functions such as MD5.
> - Make sure all existing tests hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer

2024-05-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843533#comment-17843533
 ] 

ASF GitHub Bot commented on OPENNLP-1556:
-

mawiesne opened a new pull request, #600:
URL: https://github.com/apache/opennlp/pull/600

   Change
   -
   - adjusts TwoPassDataIndexer to make use of JDK's built-in 
`CheckedOutputStream` / `CheckedInputStream` for checksum 
([CRC32c](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/zip/CRC32C.html))
 computations
   - removes untested class `HashSumEventStream` which is just a wrapper for 
calling a slow toString() in Event to get some bytes to use for the computation 
of a checksum
   - provides a HashSumEventStream replacement: `ChecksumEventStream` which 
makes use of the faster CRC32c checksum computation, avoiding cryptographic 
hash functions such as MD5
   - adds JUnit tests for ChecksumEventStream
   
   Note(s)
   -
   1. A _full_ OpenNLP build is consistently about 3-4s faster on my local 
machine (61s vs. 57s => ~ -7%) . 
   2. The effect should be greater when processing large(r) training corpora. 
   3. We should see a difference in a bare metal EVAL build run.
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Improve speed of checksum computation in TwoPassDataIndexer
> ---
>
> Key: OPENNLP-1556
> URL: https://issues.apache.org/jira/browse/OPENNLP-1556
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.3.4
>
>
> For training ML models, all observations (Events) are indexed via 
> {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. 
> When #index(..) is run, a tmp file is written and read in again. For the 
> purpose of checksum validation, instances of HashSumEventStream are used to 
> validate the content processed. 
> Based on a rather slow toString() implementation in Event, a cryptographic 
> (MD5) message digest is computed. This, however, is much slower than simply 
> computing a checksum (such as a CRC32c value) for both directions 
> (write/read). The (slowing) effect is more problematic when larger training 
> corpora are (pre-)processed, that is, indexed in advance. 
> Aims:
> - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training 
> phase.
> - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash 
> function here; it's simply a checksum that is required to decide whether all 
> bytes written are the same bytes that are read.
> - Remove the untested class HashSumEventStream which is just a wrapper for 
> calling a slow toString() in Event to get some bytes to use for the 
> computation of a checksum / md.
> - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that 
> makes use of the faster CRC32c checksum computation, avoiding cryptographic 
> hash functions such as MD5.
> - Make sure all existing tests hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1555) TokenizerME should detect multi-dot abbreviations

2024-05-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842864#comment-17842864
 ] 

ASF GitHub Bot commented on OPENNLP-1555:
-

mawiesne merged PR #599:
URL: https://github.com/apache/opennlp/pull/599




> TokenizerME should detect multi-dot abbreviations
> -
>
> Key: OPENNLP-1555
> URL: https://issues.apache.org/jira/browse/OPENNLP-1555
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Tokenizer
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.3.4
>
>
> TokenizerME should detect and handle multi-dot abbreviations correctly. 
> Currently, this is not handled correctly. For instance,
> German: "z.B." = "zum Beispiel" (for example) or, 
> Dutch: "e.v." = "en volgende" (and following)
> are not tokenized correctly and extra tokens are returned. NOTE: no 
> whitespaces in between the dots in the above examples.
> Aims:
>  * Fix the detection / handling of abbreviations for multi-dot abbreviations
>  * Provide test cases that cover these cases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1555) TokenizerME should detect multi-dot abbreviations

2024-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841646#comment-17841646
 ] 

ASF GitHub Bot commented on OPENNLP-1555:
-

mawiesne opened a new pull request, #599:
URL: https://github.com/apache/opennlp/pull/599

   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> TokenizerME should detect multi-dot abbreviations
> -
>
> Key: OPENNLP-1555
> URL: https://issues.apache.org/jira/browse/OPENNLP-1555
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Tokenizer
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.3.4
>
>
> TokenizerME should detect and handle multi-dot abbreviations correctly. 
> Currently, this is not handled correctly. For instance,
> German: "z.B." = "zum Beispiel" (for example) or, 
> Dutch: "e.v." = "en volgende" (and following)
> are not tokenized correctly and extra tokens are returned. NOTE: no 
> whitespaces in between the dots in the above examples.
> Aims:
>  * Fix the detection / handling of abbreviations for multi-dot abbreviations
>  * Provide test cases that cover these cases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1554) Add Dutch abbreviation dictionary

2024-04-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839321#comment-17839321
 ] 

ASF GitHub Bot commented on OPENNLP-1554:
-

mawiesne merged PR #597:
URL: https://github.com/apache/opennlp/pull/597




> Add Dutch abbreviation dictionary
> -
>
> Key: OPENNLP-1554
> URL: https://issues.apache.org/jira/browse/OPENNLP-1554
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector, Tokenizer
>Affects Versions: 2.3.2
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.3.3
>
>
> Similar to the addition in OPENNLP-1526, an abbreviation dictionary for Dutch 
> sentence detection and tokenisation might be beneficial.
> Aims:
> Create and add a new file {{abb_NL.xml}} in {{opennlp-tools/lang/nl}}
> Add basic set of test cases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1554) Add Dutch abbreviation dictionary

2024-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839131#comment-17839131
 ] 

ASF GitHub Bot commented on OPENNLP-1554:
-

kinow commented on code in PR #597:
URL: https://github.com/apache/opennlp/pull/597#discussion_r1573009751


##
opennlp-tools/src/test/java/opennlp/tools/sentdetect/AbstractSentenceDetectorTest.java:
##
@@ -30,13 +30,16 @@
 
 public abstract class AbstractSentenceDetectorTest {
 
-  protected static final Locale LOCALE_SPANISH = new Locale("es");
+  protected static final Locale LOCALE_DUTCH = new Locale("nl");
   protected static final Locale LOCALE_POLISH = new Locale("pl");
   protected static final Locale LOCALE_PORTUGUESE = new Locale("pt");
+  protected static final Locale LOCALE_SPANISH = new Locale("es");

Review Comment:
   The small details that normally pass unnoticed! Thanks for sorting 
alphabetically :clap: 



##
opennlp-tools/src/test/java/opennlp/tools/sentdetect/SentenceDetectorMEDutchTest.java:
##
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.sentdetect;
+
+import java.io.IOException;
+
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import opennlp.tools.dictionary.Dictionary;
+
+/**
+ * Tests for the {@link SentenceDetectorME} class.
+ * 
+ * Demonstrates OPENNLP-1554.
+ * 
+ * In this context, well-known known Dutch (nl_NL) abbreviations must be 
respected,
+ * so that words abbreviated with one or more '.' characters do not
+ * result in incorrect sentence boundaries.
+ * 
+ * See:
+ * https://issues.apache.org/jira/projects/OPENNLP/issues/OPENNLP-1554;>OPENNLP-1554
+ */
+public class SentenceDetectorMEDutchTest extends AbstractSentenceDetectorTest {
+
+  private static final char[] EOS_CHARS = {'.', '?', '!'};
+  
+  private static SentenceModel sentdetectModel;
+
+  @BeforeAll
+  public static void prepareResources() throws IOException {
+Dictionary abbreviationDict = loadAbbDictionary(LOCALE_DUTCH);
+SentenceDetectorFactory factory = new SentenceDetectorFactory(
+"dut", true, abbreviationDict, EOS_CHARS);
+sentdetectModel = train(factory, LOCALE_DUTCH);
+Assertions.assertNotNull(sentdetectModel);
+Assertions.assertEquals("dut", sentdetectModel.getLanguage());
+  }
+
+  // Example taken from 'Sentences_NL.txt'
+  @Test
+  void testSentDetectWithInlineAbbreviationsEx1() {
+final String sent1 = "Een droom, tot de vorming waarvan een bijzonder 
sterke compressie " +
+"heeft bijgedragen, zal het meest gunstige materiaal zijn voor dit 
onderzoek.";
+// Here we have one abbreviations "p." => pagina (page)
+final String sent2 = "Ik kies voor de droom van de botanische monografie 
die " +
+"op p. 183 en volgende wordt beschreven.";
+
+SentenceDetectorME sentDetect = new SentenceDetectorME(sentdetectModel);
+String sampleSentences = sent1 + " " + sent2;
+String[] sents = sentDetect.sentDetect(sampleSentences);
+Assertions.assertEquals(2, sents.length);
+Assertions.assertEquals(sent1, sents[0]);
+Assertions.assertEquals(sent2, sents[1]);
+double[] probs = sentDetect.getSentenceProbabilities();
+Assertions.assertEquals(2, probs.length);
+  }
+
+  // Reduced example taken from 'Sentences_NL.txt'
+  @Test
+  void testSentDetectWithInlineAbbreviationsEx2() {
+// Here we have one abbreviations: "d.w.z." = dat wil zeggen (eng.: that 
is to say)
+final String sent1 = "Met het oog op de overvloed aan ideeën die de 
analyse op elk " +
+"afzonderlijk element van de droominhoud brengt, zullen sommige 
lezers twijfels " +
+"hebben over het principe of alles wat later tijdens de analyse in 
je opkomt, " +
+"tot de droomgedachten gerekend mag worden, d.w.z. of aangenomen 
mag worden " +
+"dat al deze gedachten al tijdens de slaaptoestand actief waren en 
bijdroegen " +
+"aan de vorming van de droom?";

Review Comment:
   I was reviewing a pull request in Jena today, and noticed the `"""` for 

[jira] [Commented] (OPENNLP-1554) Add Dutch abbreviation dictionary

2024-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839093#comment-17839093
 ] 

ASF GitHub Bot commented on OPENNLP-1554:
-

mawiesne opened a new pull request, #597:
URL: https://github.com/apache/opennlp/pull/597

   Change
   -
   - adds abb_NL.xml to opennlp-tools/lang for the Dutch language, by using 
these resources: https://jedutchy.com/resources/abbreviations, 
https://script.byu.edu/dutch-handwriting/tools/abbreviations, 
https://en.wikipedia.org/wiki/Date_and_time_notation_in_the_Netherlands
   - adds new test cases for the DUT/NLD localization
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
in the commit message?
   
   - [x] Does your PR title start with OPENNLP- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
   
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically main)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn 
clean install at the root opennlp folder?
   - [x] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for 
build issues and submit an update to your PR as soon as possible.
   




> Add Dutch abbreviation dictionary
> -
>
> Key: OPENNLP-1554
> URL: https://issues.apache.org/jira/browse/OPENNLP-1554
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Sentence Detector, Tokenizer
>Affects Versions: 2.3.2
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.3.3
>
>
> Similar to the addition in OPENNLP-1526, an abbreviation dictionary for Dutch 
> sentence detection and tokenisation might be beneficial.
> Aims:
> Create and add a new file {{abb_NL.xml}} in {{opennlp-tools/lang/nl}}
> Add basic set of test cases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-589) Text format of Events inconsistent across different implementations of EventStreamReaders

2024-04-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838651#comment-17838651
 ] 

ASF GitHub Bot commented on OPENNLP-589:


mawiesne merged PR #596:
URL: https://github.com/apache/opennlp/pull/596




> Text format of Events inconsistent across different implementations of 
> EventStreamReaders
> -
>
> Key: OPENNLP-589
> URL: https://issues.apache.org/jira/browse/OPENNLP-589
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Machine Learning
>Affects Versions: maxent-3.0.3, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Marcin Junczys-Dowmunt
>Assignee: Martin Wiesner
>Priority: Minor
> Fix For: 2.3.3
>
>
> BasicEventStream expects events to be written to text files as:
> context1 context2 context3 ... outcome
> FileEventStream expects events to be written to text files as:
> outcome context1 context2 context3 ...
> toString() of Event creates:
> outcome [context1 context2 context3 ...] (note the square brackets, which are 
> part of context predicates when breaking on spaces).
> This is highly confusing and took me some time to understand. I guess this 
> should be unified? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1546) NER training code example in documentation requires update

2024-04-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837137#comment-17837137
 ] 

ASF GitHub Bot commented on OPENNLP-1546:
-

mawiesne merged PR #595:
URL: https://github.com/apache/opennlp/pull/595




> NER training code example in documentation requires update
> --
>
> Key: OPENNLP-1546
> URL: https://issues.apache.org/jira/browse/OPENNLP-1546
> Project: OpenNLP
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Jeff Zemerick
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.3.3
>
>
> The NER training code example needs updated.
> [https://opennlp.apache.org/docs/2.3.2/manual/opennlp.html#tools.namefind.training.api]
>  * The `TokenNameFinderFactory nameFinderFactory` part won't compile.
>  * The `model.serizialize(...)` part won't compile.
>  * This code might be outdated in general.
> {code:java}
> ObjectStream lineStream =
>   new PlainTextByLineStream(new 
> MarkableFileInputStreamFactory(new File("en-ner-person.train")), 
> StandardCharsets.UTF_8);
> TokenNameFinderModel model;
> try (ObjectStream sampleStream = new 
> NameSampleDataStream(lineStream)) {
>   model = NameFinderME.train("eng", "person", sampleStream, 
> TrainingParameters.defaultParams(), nameFinderFactory);
> }
> try (ObjectStream modelOut = new BufferedOutputStream(new 
> FileOutputStream(modelFile)){
>   model.serialize(modelOut);
> }
>{code}
> For reference (but not tested):
> {code:java}
>         final InputStreamFactory in = new 
> MarkableFileInputStreamFactory(convertedTrainingFile);
>         final ObjectStream sampleStream = new 
> NameSampleDataStream(new PlainTextByLineStream(in, StandardCharsets.UTF_8));
>         final TokenNameFinderModel nameFinderModel = NameFinderME.train("en", 
> null, sampleStream, TrainingParameters.defaultParams(), 
> TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new 
> BioCodec())); {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >