[jira] [Commented] (OPENNLP-1590) Clear open TODO in GenericFactoryTest
[ https://issues.apache.org/jira/browse/OPENNLP-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863516#comment-17863516 ] ASF GitHub Bot commented on OPENNLP-1590: - rzo1 commented on code in PR #633: URL: https://github.com/apache/opennlp/pull/633#discussion_r1667397156 ## opennlp-tools/src/main/java/opennlp/tools/util/featuregen/DictionaryFeatureGeneratorFactory.java: ## @@ -31,30 +33,39 @@ public class DictionaryFeatureGeneratorFactory extends GeneratorFactory.AbstractXmlFeatureGeneratorFactory { + private static final String DICT = "dict"; + public DictionaryFeatureGeneratorFactory() { super(); } @Override public AdaptiveFeatureGenerator create() throws InvalidFormatException { -// if resourceManager is null, we don't instantiate -if (resourceManager == null) { - return null; -} - -String dictResourceKey = getStr("dict"); -Object dictResource = resourceManager.getResource(dictResourceKey); -if (!(dictResource instanceof Dictionary)) { - throw new InvalidFormatException("No dictionary resource for key: " + dictResourceKey); +Dictionary dict; +if (resourceManager == null) { // load the dictionary directly + String dictResourcePath = getStr(DICT); + ClassLoader cl = Thread.currentThread().getContextClassLoader(); + try (InputStream is = cl.getResourceAsStream(dictResourcePath)) { Review Comment: `is` can be NULL after this call (if `dictResourcePath` isn't found). Didn't Look into create(...) how it is handled. ## opennlp-tools/src/main/java/opennlp/tools/util/featuregen/DictionaryFeatureGeneratorFactory.java: ## @@ -31,30 +33,39 @@ public class DictionaryFeatureGeneratorFactory extends GeneratorFactory.AbstractXmlFeatureGeneratorFactory { + private static final String DICT = "dict"; + public DictionaryFeatureGeneratorFactory() { super(); } @Override public AdaptiveFeatureGenerator create() throws InvalidFormatException { -// if resourceManager is null, we don't instantiate -if (resourceManager == null) { - return null; -} - -String dictResourceKey = getStr("dict"); -Object dictResource = resourceManager.getResource(dictResourceKey); -if (!(dictResource instanceof Dictionary)) { - throw new InvalidFormatException("No dictionary resource for key: " + dictResourceKey); +Dictionary dict; +if (resourceManager == null) { // load the dictionary directly + String dictResourcePath = getStr(DICT); + ClassLoader cl = Thread.currentThread().getContextClassLoader(); + try (InputStream is = cl.getResourceAsStream(dictResourcePath)) { +dict = ((DictionarySerializer) getArtifactSerializerMapping().get(dictResourcePath)).create(is); + } catch (IOException e) { +throw new InvalidFormatException("No dictionary resource at: " + dictResourcePath, e); + } +} else { // get the dictionary via a resourceManager lookup + String dictResourceKey = getStr(DICT); + Object dictResource = resourceManager.getResource(dictResourceKey); + if (dictResource instanceof Dictionary) { Review Comment: Since we are Java 17, we can omit the cast in the subsequent line by inline casting in the instanceof. > Clear open TODO in GenericFactoryTest > - > > Key: OPENNLP-1590 > URL: https://issues.apache.org/jira/browse/OPENNLP-1590 > Project: OpenNLP > Issue Type: Test >Affects Versions: 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.4.0 > > > The test case _testDictionaryArtifactToSerializerMappingExtraction()_ in > _GeneratorFactoryTest_ has a TODO note that can be cleared by adjusting the > test case slightly, as well as the related inconsistent implementation > class(es). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1590) Clear open TODO in GenericFactoryTest
[ https://issues.apache.org/jira/browse/OPENNLP-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863515#comment-17863515 ] ASF GitHub Bot commented on OPENNLP-1590: - mawiesne opened a new pull request, #633: URL: https://github.com/apache/opennlp/pull/633 Change - - adjusts `DictionaryFeatureGeneratorFactory` to handle situations without a ResourceManager instance at runtime - fixes a missing Exception type in catch block of `GeneratorFactory#buildGenerator `which wasn't handled correctly - clears TODO in GeneratorFactoryTest - adds another test case to demonstrate that a descriptively declared dictionary is loaded for the creation of a `DictionaryFeatureGeneratorFactory` - adds a related test resource Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [x] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Clear open TODO in GenericFactoryTest > - > > Key: OPENNLP-1590 > URL: https://issues.apache.org/jira/browse/OPENNLP-1590 > Project: OpenNLP > Issue Type: Test >Affects Versions: 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.4.0 > > > The test case _testDictionaryArtifactToSerializerMappingExtraction()_ in > _GeneratorFactoryTest_ has a TODO note that can be cleared by adjusting the > test case slightly, as well as the related inconsistent implementation > class(es). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1588) Clean out deprecated code marked for removal
[ https://issues.apache.org/jira/browse/OPENNLP-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863456#comment-17863456 ] ASF GitHub Bot commented on OPENNLP-1588: - mawiesne merged PR #632: URL: https://github.com/apache/opennlp/pull/632 > Clean out deprecated code marked for removal > - > > Key: OPENNLP-1588 > URL: https://issues.apache.org/jira/browse/OPENNLP-1588 > Project: OpenNLP > Issue Type: Task >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.4.0 > > > Some record classes (still) carry convenience getters around, which are > marked for removal for more than 1y now. Some other deprecated code fragments > can also be removed now, since those are deprecated much longer and marked > for removal. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1588) Clean out deprecated code marked for removal
[ https://issues.apache.org/jira/browse/OPENNLP-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862884#comment-17862884 ] ASF GitHub Bot commented on OPENNLP-1588: - mawiesne opened a new pull request, #632: URL: https://github.com/apache/opennlp/pull/632 Change - - removes deprecated getters in several records - removes a very long time deprecated constructor of TokenizerME - marks a constructor of ADNameSampleStream as "for removal" to indicate actual removal is more likely in upcoming releases - marks deprecated 'NameFinderEventStream#generateOutcomes(..)' as "for removal" to indicate actual removal is more likely in upcoming releases Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Clean out deprecated code marked for removal > - > > Key: OPENNLP-1588 > URL: https://issues.apache.org/jira/browse/OPENNLP-1588 > Project: OpenNLP > Issue Type: Task >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.4.0 > > > Some record classes (still) carry convenience getters around, which are > marked for removal for more than 1y now. Some other deprecated code fragments > can also be removed now, since those are deprecated much longer and marked > for removal. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
[ https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862524#comment-17862524 ] ASF GitHub Bot commented on OPENNLP-1568: - rzo1 commented on code in PR #607: URL: https://github.com/apache/opennlp/pull/607#discussion_r1661109712 ## opennlp-brat-annotator/src/main/bin/brat-annotation-service: ## @@ -21,6 +21,28 @@ #may be inadvertantly placed in any output files if #output redirection is used. +# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home +PRG="$0" + +while [ -h "$PRG" ] ; do + ls=`ls -ld "$PRG"` + link=`expr "$ls" : '.*-> \(.*\)$'` + if expr "$link" : '/.*' > /dev/null; then +PRG="$link" + else +PRG="`dirname "$PRG"`/$link" + fi +done Review Comment: Yep, specific to Linux. > opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory > --- > > Key: OPENNLP-1568 > URL: https://issues.apache.org/jira/browse/OPENNLP-1568 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.3.3 > Environment: Linux/Bash >Reporter: Alexander Veit >Priority: Major > Fix For: 2.3.4 > > > Try to run the opennlp command from outside $OPENNLP_HOME/bin directory. > It fails with an error message similar to > > {noformat} > 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed: > No configuration found for '4f2410ac' at 'null' in 'null'{noformat} > > The error is caused by the relative path in > {code:java} > -Dlog4j.configurationFile=../conf/log4j2.xml {code} > of the opennlp script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1583) Reduce compiler warnings in opennlp-tools
[ https://issues.apache.org/jira/browse/OPENNLP-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862518#comment-17862518 ] ASF GitHub Bot commented on OPENNLP-1583: - rzo1 merged PR #626: URL: https://github.com/apache/opennlp/pull/626 > Reduce compiler warnings in opennlp-tools > - > > Key: OPENNLP-1583 > URL: https://issues.apache.org/jira/browse/OPENNLP-1583 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.3.4 > > > We still have a bunch of compiler warnings. Those should be reduced to and > kept at a minimum. > Aims: > - Get rid of most of them, as best as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1584) FeatureGeneratorUtil shall detect German umlauts with dot as 'cp'
[ https://issues.apache.org/jira/browse/OPENNLP-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862521#comment-17862521 ] ASF GitHub Bot commented on OPENNLP-1584: - rzo1 merged PR #628: URL: https://github.com/apache/opennlp/pull/628 > FeatureGeneratorUtil shall detect German umlauts with dot as 'cp' > - > > Key: OPENNLP-1584 > URL: https://issues.apache.org/jira/browse/OPENNLP-1584 > Project: OpenNLP > Issue Type: Improvement > Components: Name Finder >Affects Versions: 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.3.4 > > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > German names, such as Änne, Özlem, or Ümit, should be recognized in their > abbreviated short form (Ä., Ü., Ö.) by the FeatureGeneratorUtil class. > Atm, recognition fails, as the Pattern "capPeriod" only takes regular, > capitalized letters into account. This can be fixed easily. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
[ https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862520#comment-17862520 ] ASF GitHub Bot commented on OPENNLP-1568: - rzo1 merged PR #607: URL: https://github.com/apache/opennlp/pull/607 > opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory > --- > > Key: OPENNLP-1568 > URL: https://issues.apache.org/jira/browse/OPENNLP-1568 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.3.3 > Environment: Linux/Bash >Reporter: Alexander Veit >Priority: Major > Fix For: 2.3.4 > > > Try to run the opennlp command from outside $OPENNLP_HOME/bin directory. > It fails with an error message similar to > > {noformat} > 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed: > No configuration found for '4f2410ac' at 'null' in 'null'{noformat} > > The error is caused by the relative path in > {code:java} > -Dlog4j.configurationFile=../conf/log4j2.xml {code} > of the opennlp script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
[ https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17862519#comment-17862519 ] ASF GitHub Bot commented on OPENNLP-1568: - rzo1 commented on PR #607: URL: https://github.com/apache/opennlp/pull/607#issuecomment-2200245765 Thanks @veita @kinow I merged it know because it is a valid improvement. We can discuss / adjust edge cases with realpath etc, if needed in another iteration. > opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory > --- > > Key: OPENNLP-1568 > URL: https://issues.apache.org/jira/browse/OPENNLP-1568 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.3.3 > Environment: Linux/Bash >Reporter: Alexander Veit >Priority: Major > Fix For: 2.3.4 > > > Try to run the opennlp command from outside $OPENNLP_HOME/bin directory. > It fails with an error message similar to > > {noformat} > 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed: > No configuration found for '4f2410ac' at 'null' in 'null'{noformat} > > The error is caused by the relative path in > {code:java} > -Dlog4j.configurationFile=../conf/log4j2.xml {code} > of the opennlp script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1577) SLF4J 2.0.13
[ https://issues.apache.org/jira/browse/OPENNLP-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855902#comment-17855902 ] ASF GitHub Bot commented on OPENNLP-1577: - rzo1 merged PR #616: URL: https://github.com/apache/opennlp/pull/616 > SLF4J 2.0.13 > > > Key: OPENNLP-1577 > URL: https://issues.apache.org/jira/browse/OPENNLP-1577 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1578) Jersey 3.1.7 in Brat-Annotator
[ https://issues.apache.org/jira/browse/OPENNLP-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855903#comment-17855903 ] ASF GitHub Bot commented on OPENNLP-1578: - rzo1 merged PR #621: URL: https://github.com/apache/opennlp/pull/621 > Jersey 3.1.7 in Brat-Annotator > -- > > Key: OPENNLP-1578 > URL: https://issues.apache.org/jira/browse/OPENNLP-1578 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1578) Jersey 3.1.7 in Brat-Annotator
[ https://issues.apache.org/jira/browse/OPENNLP-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855899#comment-17855899 ] ASF GitHub Bot commented on OPENNLP-1578: - dependabot[bot] commented on PR #621: URL: https://github.com/apache/opennlp/pull/621#issuecomment-2175893444 Looks like this PR has been edited by someone other than Dependabot. That means Dependabot can't rebase it - sorry! If you're happy for Dependabot to recreate it from scratch, overwriting any edits, you can request `@dependabot recreate`. > Jersey 3.1.7 in Brat-Annotator > -- > > Key: OPENNLP-1578 > URL: https://issues.apache.org/jira/browse/OPENNLP-1578 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1578) Jersey 3.1.7 in Brat-Annotator
[ https://issues.apache.org/jira/browse/OPENNLP-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855898#comment-17855898 ] ASF GitHub Bot commented on OPENNLP-1578: - rzo1 commented on PR #621: URL: https://github.com/apache/opennlp/pull/621#issuecomment-2175893335 @dependabot rebase > Jersey 3.1.7 in Brat-Annotator > -- > > Key: OPENNLP-1578 > URL: https://issues.apache.org/jira/browse/OPENNLP-1578 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1578) Jersey 3.1.7 in Brat-Annotator
[ https://issues.apache.org/jira/browse/OPENNLP-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855897#comment-17855897 ] ASF GitHub Bot commented on OPENNLP-1578: - rzo1 commented on PR #621: URL: https://github.com/apache/opennlp/pull/621#issuecomment-2175893253 The upgrade to Jakarta namespace should be safe here since it is only used for providing restful service within this tool. > Jersey 3.1.7 in Brat-Annotator > -- > > Key: OPENNLP-1578 > URL: https://issues.apache.org/jira/browse/OPENNLP-1578 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1577) SLF4J 2.0.13
[ https://issues.apache.org/jira/browse/OPENNLP-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855896#comment-17855896 ] ASF GitHub Bot commented on OPENNLP-1577: - rzo1 commented on PR #616: URL: https://github.com/apache/opennlp/pull/616#issuecomment-2175891974 Note: SLF4J2 is now way more common as it was a few months ago. It should be safe to upgrade here. > SLF4J 2.0.13 > > > Key: OPENNLP-1577 > URL: https://issues.apache.org/jira/browse/OPENNLP-1577 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1577) SLF4J 2.0.13
[ https://issues.apache.org/jira/browse/OPENNLP-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855894#comment-17855894 ] ASF GitHub Bot commented on OPENNLP-1577: - rzo1 commented on PR #616: URL: https://github.com/apache/opennlp/pull/616#issuecomment-2175891354 @dependabot rebase > SLF4J 2.0.13 > > > Key: OPENNLP-1577 > URL: https://issues.apache.org/jira/browse/OPENNLP-1577 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1577) SLF4J 2.0.13
[ https://issues.apache.org/jira/browse/OPENNLP-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855895#comment-17855895 ] ASF GitHub Bot commented on OPENNLP-1577: - dependabot[bot] commented on PR #616: URL: https://github.com/apache/opennlp/pull/616#issuecomment-2175891449 Looks like this PR has been edited by someone other than Dependabot. That means Dependabot can't rebase it - sorry! If you're happy for Dependabot to recreate it from scratch, overwriting any edits, you can request `@dependabot recreate`. > SLF4J 2.0.13 > > > Key: OPENNLP-1577 > URL: https://issues.apache.org/jira/browse/OPENNLP-1577 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1573) Forbidden APIS 3.7
[ https://issues.apache.org/jira/browse/OPENNLP-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855892#comment-17855892 ] ASF GitHub Bot commented on OPENNLP-1573: - rzo1 merged PR #615: URL: https://github.com/apache/opennlp/pull/615 > Forbidden APIS 3.7 > -- > > Key: OPENNLP-1573 > URL: https://issues.apache.org/jira/browse/OPENNLP-1573 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1570) ONNX Runtime 1.18.0
[ https://issues.apache.org/jira/browse/OPENNLP-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855893#comment-17855893 ] ASF GitHub Bot commented on OPENNLP-1570: - rzo1 merged PR #614: URL: https://github.com/apache/opennlp/pull/614 > ONNX Runtime 1.18.0 > --- > > Key: OPENNLP-1570 > URL: https://issues.apache.org/jira/browse/OPENNLP-1570 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1576) Update build related dependencies (Checkstyle, Docbkx, Build Helper)
[ https://issues.apache.org/jira/browse/OPENNLP-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855890#comment-17855890 ] ASF GitHub Bot commented on OPENNLP-1576: - rzo1 merged PR #620: URL: https://github.com/apache/opennlp/pull/620 > Update build related dependencies (Checkstyle, Docbkx, Build Helper) > > > Key: OPENNLP-1576 > URL: https://issues.apache.org/jira/browse/OPENNLP-1576 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1576) Update build related dependencies (Checkstyle, Docbkx, Build Helper)
[ https://issues.apache.org/jira/browse/OPENNLP-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855889#comment-17855889 ] ASF GitHub Bot commented on OPENNLP-1576: - rzo1 merged PR #619: URL: https://github.com/apache/opennlp/pull/619 > Update build related dependencies (Checkstyle, Docbkx, Build Helper) > > > Key: OPENNLP-1576 > URL: https://issues.apache.org/jira/browse/OPENNLP-1576 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1574) Jacoco 0.8.12
[ https://issues.apache.org/jira/browse/OPENNLP-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855891#comment-17855891 ] ASF GitHub Bot commented on OPENNLP-1574: - rzo1 merged PR #618: URL: https://github.com/apache/opennlp/pull/618 > Jacoco 0.8.12 > - > > Key: OPENNLP-1574 > URL: https://issues.apache.org/jira/browse/OPENNLP-1574 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1572) JUnit 5.10.2
[ https://issues.apache.org/jira/browse/OPENNLP-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855886#comment-17855886 ] ASF GitHub Bot commented on OPENNLP-1572: - rzo1 merged PR #612: URL: https://github.com/apache/opennlp/pull/612 > JUnit 5.10.2 > > > Key: OPENNLP-1572 > URL: https://issues.apache.org/jira/browse/OPENNLP-1572 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1576) Update build related dependencies (Checkstyle, Docbkx, Build Helper)
[ https://issues.apache.org/jira/browse/OPENNLP-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855888#comment-17855888 ] ASF GitHub Bot commented on OPENNLP-1576: - rzo1 merged PR #617: URL: https://github.com/apache/opennlp/pull/617 > Update build related dependencies (Checkstyle, Docbkx, Build Helper) > > > Key: OPENNLP-1576 > URL: https://issues.apache.org/jira/browse/OPENNLP-1576 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1571) Jackson 2.17.1
[ https://issues.apache.org/jira/browse/OPENNLP-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855887#comment-17855887 ] ASF GitHub Bot commented on OPENNLP-1571: - rzo1 merged PR #613: URL: https://github.com/apache/opennlp/pull/613 > Jackson 2.17.1 > -- > > Key: OPENNLP-1571 > URL: https://issues.apache.org/jira/browse/OPENNLP-1571 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1575) Update GH action versions
[ https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855885#comment-17855885 ] ASF GitHub Bot commented on OPENNLP-1575: - rzo1 merged PR #610: URL: https://github.com/apache/opennlp/pull/610 > Update GH action versions > - > > Key: OPENNLP-1575 > URL: https://issues.apache.org/jira/browse/OPENNLP-1575 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > [https://github.com/apache/opennlp/pull/609] > [https://github.com/apache/opennlp/pull/610] > https://github.com/apache/opennlp/pull/611 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1575) Update GH action versions
[ https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855884#comment-17855884 ] ASF GitHub Bot commented on OPENNLP-1575: - rzo1 merged PR #611: URL: https://github.com/apache/opennlp/pull/611 > Update GH action versions > - > > Key: OPENNLP-1575 > URL: https://issues.apache.org/jira/browse/OPENNLP-1575 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > [https://github.com/apache/opennlp/pull/609] > [https://github.com/apache/opennlp/pull/610] > https://github.com/apache/opennlp/pull/611 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1575) Update GH action versions
[ https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855881#comment-17855881 ] ASF GitHub Bot commented on OPENNLP-1575: - rzo1 merged PR #609: URL: https://github.com/apache/opennlp/pull/609 > Update GH action versions > - > > Key: OPENNLP-1575 > URL: https://issues.apache.org/jira/browse/OPENNLP-1575 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > [https://github.com/apache/opennlp/pull/609] > [https://github.com/apache/opennlp/pull/610] > https://github.com/apache/opennlp/pull/611 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1575) Update GH action versions
[ https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855882#comment-17855882 ] ASF GitHub Bot commented on OPENNLP-1575: - rzo1 commented on PR #610: URL: https://github.com/apache/opennlp/pull/610#issuecomment-2175876905 @dependabot rebase > Update GH action versions > - > > Key: OPENNLP-1575 > URL: https://issues.apache.org/jira/browse/OPENNLP-1575 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > [https://github.com/apache/opennlp/pull/609] > [https://github.com/apache/opennlp/pull/610] > https://github.com/apache/opennlp/pull/611 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1575) Update GH action versions
[ https://issues.apache.org/jira/browse/OPENNLP-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855883#comment-17855883 ] ASF GitHub Bot commented on OPENNLP-1575: - rzo1 commented on PR #611: URL: https://github.com/apache/opennlp/pull/611#issuecomment-2175876989 @dependabot rebase > Update GH action versions > - > > Key: OPENNLP-1575 > URL: https://issues.apache.org/jira/browse/OPENNLP-1575 > Project: OpenNLP > Issue Type: Dependency upgrade >Reporter: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > [https://github.com/apache/opennlp/pull/609] > [https://github.com/apache/opennlp/pull/610] > https://github.com/apache/opennlp/pull/611 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1569) Enable Dependabot for OpenNLP
[ https://issues.apache.org/jira/browse/OPENNLP-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855838#comment-17855838 ] ASF GitHub Bot commented on OPENNLP-1569: - rzo1 merged PR #608: URL: https://github.com/apache/opennlp/pull/608 > Enable Dependabot for OpenNLP > - > > Key: OPENNLP-1569 > URL: https://issues.apache.org/jira/browse/OPENNLP-1569 > Project: OpenNLP > Issue Type: Task >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > We still need to create JIRAs for update but it will help us to avoid missing > something important -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1566) Array writing error in code example
[ https://issues.apache.org/jira/browse/OPENNLP-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855811#comment-17855811 ] ASF GitHub Bot commented on OPENNLP-1566: - mawiesne merged PR #605: URL: https://github.com/apache/opennlp/pull/605 > Array writing error in code example > --- > > Key: OPENNLP-1566 > URL: https://issues.apache.org/jira/browse/OPENNLP-1566 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.3 >Reporter: shellrean >Priority: Minor > Fix For: 2.3.4 > > Attachments: Screenshot 2024-06-10 at 23.31.20.png > > Original Estimate: 12h > Remaining Estimate: 12h > > There was an error in writing the array symbol in the documentation, it > should be String[] but what is displayed here is String variable[]. > !Screenshot 2024-06-10 at 23.31.20.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1569) Enable Dependabot for OpenNLP
[ https://issues.apache.org/jira/browse/OPENNLP-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855807#comment-17855807 ] ASF GitHub Bot commented on OPENNLP-1569: - rzo1 opened a new pull request, #608: URL: https://github.com/apache/opennlp/pull/608 Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [ ] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: We still need to create JIRA tickets for updates but it will help us to avoid miss something important. > Enable Dependabot for OpenNLP > - > > Key: OPENNLP-1569 > URL: https://issues.apache.org/jira/browse/OPENNLP-1569 > Project: OpenNLP > Issue Type: Task >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > We still need to create JIRAs for update but it will help us to avoid missing > something important -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855805#comment-17855805 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 merged PR #606: URL: https://github.com/apache/opennlp/pull/606 > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855597#comment-17855597 ] ASF GitHub Bot commented on OPENNLP-1567: - jzonthemtn commented on PR #606: URL: https://github.com/apache/opennlp/pull/606#issuecomment-2173208163 Built and tested fine on Windows 11. ``` Apache Maven 3.9.8 (36645f6c9b5079805ea5009217e36f2cffd34256) Maven home: C:\Program Files\apache-maven-3.9.8 Java version: 21.0.1, vendor: Amazon.com Inc., runtime: C:\Program Files\Amazon Corretto\jdk21.0.1_12 Default locale: en_US, platform encoding: UTF-8 OS name: "windows 11", version: "10.0", arch: "amd64", family: "windows" ``` > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855515#comment-17855515 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on PR #606: URL: https://github.com/apache/opennlp/pull/606#issuecomment-2172501124 > Tests on a local Windows 10 are ok now (via IDE + Maven). CI is still complaining, so requires some more digging in that area ;-) This is fixed now. Removing DRAFT state. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855502#comment-17855502 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on PR #606: URL: https://github.com/apache/opennlp/pull/606#issuecomment-2172443863 Tests on a local Windows 10 are ok now (via IDE + Maven). CI is still complaining, so requires some more digging in that area ;-) > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
[ https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855447#comment-17855447 ] ASF GitHub Bot commented on OPENNLP-1568: - kinow commented on code in PR #607: URL: https://github.com/apache/opennlp/pull/607#discussion_r1641987177 ## opennlp-brat-annotator/src/main/bin/brat-annotation-service: ## @@ -21,6 +21,28 @@ #may be inadvertantly placed in any output files if #output redirection is used. +# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home +PRG="$0" + +while [ -h "$PRG" ] ; do + ls=`ls -ld "$PRG"` + link=`expr "$ls" : '.*-> \(.*\)$'` + if expr "$link" : '/.*' > /dev/null; then +PRG="$link" + else +PRG="`dirname "$PRG"`/$link" + fi +done Review Comment: Thank you @veita I asked as I read the code and from a quick read it seemed to have a regex that could lead to error in a certain scenario. You mentioned it's well-proven, but do you know if it was tested (e.g. with bats, or another unit test, covering several cases), or has it just been around for a long time and never raised any issues? It's late here, so my logic might be wrong, sorry. Here's what I tried. ```bash kinow@ranma:/tmp$ mkdir /tmp/opennlp-1568 kinow@ranma:/tmp$ cd /tmp/opennlp-1568 kinow@ranma:/tmp/opennlp-1568$ cat test.sh #!/bin/bash PRG="$0" while [ -h "$PRG" ] ; do ls=`ls -ld "$PRG"` link=`expr "$ls" : '.*-> \(.*\)$'` if expr "$link" : '/.*' > /dev/null; then PRG="$link" else PRG="`dirname "$PRG"`/$link" fi done echo "${PRG}" kinow@ranma:/tmp/opennlp-1568$ # not a symlink, all good... kinow@ranma:/tmp/opennlp-1568$ bash test.sh test.sh ``` Then if I create a symlink `/tmp/test/opennlp.sh`, it works, and the correct path is resolved, same as `realpath`. ```bash kinow@ranma:/tmp/opennlp-1568$ cd ../ kinow@ranma:/tmp$ mkdir test kinow@ranma:/tmp$ cd test kinow@ranma:/tmp/test$ ln -s /tmp/opennlp-1568/test.sh opennlp.sh kinow@ranma:/tmp/test$ bash opennlp.sh /tmp/opennlp-1568/test.sh ``` But if the file is not in `/tmp/opennlp-1568`, but rather in some folder like `/tmp/opennlp -> tests/`: ```bash kinow@ranma:/tmp$ mkdir '/tmp/opennlp -> tests/' kinow@ranma:/tmp$ cd '/tmp/opennlp -> tests/' kinow@ranma:/tmp/opennlp -> tests$ cp ../opennlp-1568/test.sh . kinow@ranma:/tmp/opennlp -> tests$ cd /tmp/test/ kinow@ranma:/tmp/test$ ln -s '/tmp/opennlp -> tests/test.sh' opennlp2.sh kinow@ranma:/tmp/test$ bash opennlp2.sh ./tests/test.sh ``` I believe it's due to the regex in the `expr` (but again, late here, came just to check the score of a soccer :soccer: match and saw the GH notification). Here's the `-eux` output. ```bash kinow@ranma:/tmp/test$ bash -eux opennlp2.sh + PRG=opennlp2.sh + '[' -h opennlp2.sh ']' ++ ls -ld opennlp2.sh + ls='lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> /tmp/opennlp -> tests/test.sh' ++ expr 'lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> /tmp/opennlp -> tests/test.sh' : '.*-> \(.*\)$' + link=tests/test.sh + expr tests/test.sh : '/.*' ++ dirname opennlp2.sh + PRG=./tests/test.sh + '[' -h ./tests/test.sh ']' + echo ./tests/test.sh ./tests/test.sh ``` And `realpath`: ```bash kinow@ranma:/tmp/test$ realpath opennlp.sh /tmp/opennlp-1568/test.sh kinow@ranma:/tmp/test$ realpath opennlp2.sh /tmp/opennlp -> tests/test.sh ``` We do not need to use `realpath`. I was honestly just curious about this. We don't even need to fix the regex if you/others agree users are not likely to create paths with ` -> ` :+1: Other than that, changes looks OK to me (:soccer: :wave: ) > opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory > --- > > Key: OPENNLP-1568 > URL: https://issues.apache.org/jira/browse/OPENNLP-1568 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.3.3 > Environment: Linux/Bash >Reporter: Alexander Veit >Priority: Major > > Try to run the opennlp command from outside $OPENNLP_HOME/bin directory. > It fails with an error message similar to > > {noformat} > 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed: > No configuration found for '4f2410ac' at 'null' in 'null'{noformat} > > The error is caused by the relative path in > {code:java} > -Dlog4j.configurationFile=../conf/log4j2.xml {code} > of the opennlp script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
[ https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855446#comment-17855446 ] ASF GitHub Bot commented on OPENNLP-1568: - kinow commented on code in PR #607: URL: https://github.com/apache/opennlp/pull/607#discussion_r1641987177 ## opennlp-brat-annotator/src/main/bin/brat-annotation-service: ## @@ -21,6 +21,28 @@ #may be inadvertantly placed in any output files if #output redirection is used. +# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home +PRG="$0" + +while [ -h "$PRG" ] ; do + ls=`ls -ld "$PRG"` + link=`expr "$ls" : '.*-> \(.*\)$'` + if expr "$link" : '/.*' > /dev/null; then +PRG="$link" + else +PRG="`dirname "$PRG"`/$link" + fi +done Review Comment: Thank you @veita I asked as I read the code and from a quick read it seemed to have a regex that could lead to error in a certain scenario. You mentioned it's well-proven, but do you know if it was tested (e.g. with bats, or another unit test, covering several cases), or has it just been around for a long time and never raised any issues? It's late here, so my logic might be wrong, sorry. Here's what I tried. ```bash kinow@ranma:/tmp$ mkdir /tmp/opennlp-1568 kinow@ranma:/tmp$ cd /tmp/opennlp-1568 kinow@ranma:/tmp/opennlp-1568$ cat test.sh #!/bin/bash PRG="$0" while [ -h "$PRG" ] ; do ls=`ls -ld "$PRG"` link=`expr "$ls" : '.*-> \(.*\)$'` if expr "$link" : '/.*' > /dev/null; then PRG="$link" else PRG="`dirname "$PRG"`/$link" fi done echo "${PRG}" kinow@ranma:/tmp/opennlp-1568$ # not a symlink, all good... kinow@ranma:/tmp/opennlp-1568$ bash test.sh test.sh ``` Then if I create a symlink `/tmp/test/opennlp.sh`, it works, and the correct path is resolved, same as `realpath`. ```bash kinow@ranma:/tmp/opennlp-1568$ cd ../ kinow@ranma:/tmp$ mkdir test kinow@ranma:/tmp$ cd test kinow@ranma:/tmp/test$ ln -s /tmp/opennlp-1568/test.sh opennlp.sh kinow@ranma:/tmp/test$ bash opennlp.sh /tmp/opennlp-1568/test.sh ``` But if the file is not in `/tmp/opennlp-1568`, but rather in some folder like `/tmp/opennlp -> tests/`: ```bash kinow@ranma:/tmp$ mkdir '/tmp/opennlp -> tests/' kinow@ranma:/tmp$ cd '/tmp/opennlp -> tests/' kinow@ranma:/tmp/opennlp -> tests$ cp ../opennlp-1568/test.sh . kinow@ranma:/tmp/opennlp -> tests$ cd /tmp/test/ kinow@ranma:/tmp/test$ ln -s '/tmp/opennlp -> tests/test.sh' opennlp2.sh kinow@ranma:/tmp/test$ bash opennlp2.sh ./tests/test.sh ``` I believe it's due to the regex in the `expr` (but again, late here, came just to check the score of a soccer match and saw the GH notification. Here's the `-eux` below. ```bash kinow@ranma:/tmp/test$ bash -eux opennlp2.sh + PRG=opennlp2.sh + '[' -h opennlp2.sh ']' ++ ls -ld opennlp2.sh + ls='lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> /tmp/opennlp -> tests/test.sh' ++ expr 'lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> /tmp/opennlp -> tests/test.sh' : '.*-> \(.*\)$' + link=tests/test.sh + expr tests/test.sh : '/.*' ++ dirname opennlp2.sh + PRG=./tests/test.sh + '[' -h ./tests/test.sh ']' + echo ./tests/test.sh ./tests/test.sh ``` And `realpath`: ```bash kinow@ranma:/tmp/test$ realpath opennlp.sh /tmp/opennlp-1568/test.sh kinow@ranma:/tmp/test$ realpath opennlp2.sh /tmp/opennlp -> tests/test.sh ``` We do not need to use `realpath`. I was honestly just curious about this. We don't even need to fix the regex if you/others agree users are not likely to create paths with ` -> ` :+1: Other than that, changes looks OK to me (:soccer: :wave: ) > opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory > --- > > Key: OPENNLP-1568 > URL: https://issues.apache.org/jira/browse/OPENNLP-1568 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.3.3 > Environment: Linux/Bash >Reporter: Alexander Veit >Priority: Major > > Try to run the opennlp command from outside $OPENNLP_HOME/bin directory. > It fails with an error message similar to > > {noformat} > 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed: > No configuration found for '4f2410ac' at 'null' in 'null'{noformat} > > The error is caused by the relative path in > {code:java} > -Dlog4j.configurationFile=../conf/log4j2.xml {code} > of the opennlp script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
[ https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855445#comment-17855445 ] ASF GitHub Bot commented on OPENNLP-1568: - kinow commented on code in PR #607: URL: https://github.com/apache/opennlp/pull/607#discussion_r1641987177 ## opennlp-brat-annotator/src/main/bin/brat-annotation-service: ## @@ -21,6 +21,28 @@ #may be inadvertantly placed in any output files if #output redirection is used. +# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home +PRG="$0" + +while [ -h "$PRG" ] ; do + ls=`ls -ld "$PRG"` + link=`expr "$ls" : '.*-> \(.*\)$'` + if expr "$link" : '/.*' > /dev/null; then +PRG="$link" + else +PRG="`dirname "$PRG"`/$link" + fi +done Review Comment: Thank you @veita I asked as I read the code and from a quick read it seemed to have a greedy regex that could lead to error in a certain scenario. You mentioned it's well-proven, but do you know if it was tested (e.g. with bats, or another unit test, covering several cases), or has it just been around for a long time and never raised any issues? It's late here, so my logic might be wrong, sorry. Here's what I tried. ```bash kinow@ranma:/tmp$ mkdir /tmp/opennlp-1568 kinow@ranma:/tmp$ cd /tmp/opennlp-1568 kinow@ranma:/tmp/opennlp-1568$ cat test.sh #!/bin/bash PRG="$0" while [ -h "$PRG" ] ; do ls=`ls -ld "$PRG"` link=`expr "$ls" : '.*-> \(.*\)$'` if expr "$link" : '/.*' > /dev/null; then PRG="$link" else PRG="`dirname "$PRG"`/$link" fi done echo "${PRG}" kinow@ranma:/tmp/opennlp-1568$ # not a symlink, all good... kinow@ranma:/tmp/opennlp-1568$ bash test.sh test.sh ``` Then if I create a symlink `/tmp/test/opennlp.sh`, it works, and the correct path is resolved, same as `realpath`. ```bash kinow@ranma:/tmp/opennlp-1568$ cd ../ kinow@ranma:/tmp$ mkdir test kinow@ranma:/tmp$ cd test kinow@ranma:/tmp/test$ ln -s /tmp/opennlp-1568/test.sh opennlp.sh kinow@ranma:/tmp/test$ bash opennlp.sh /tmp/opennlp-1568/test.sh ``` But if the file is not in `/tmp/opennlp-1568`, but rather in some folder like `/tmp/opennlp -> tests/`: ```bash kinow@ranma:/tmp$ mkdir '/tmp/opennlp -> tests/' kinow@ranma:/tmp$ cd '/tmp/opennlp -> tests/' kinow@ranma:/tmp/opennlp -> tests$ cp ../opennlp-1568/test.sh . kinow@ranma:/tmp/opennlp -> tests$ cd /tmp/test/ kinow@ranma:/tmp/test$ ln -s '/tmp/opennlp -> tests/test.sh' opennlp2.sh kinow@ranma:/tmp/test$ bash opennlp2.sh ./tests/test.sh ``` I believe it's due to the regex in the `expr` (but again, late here, came just to check the score of a soccer match and saw the GH notification. Here's the `-eux` below. ```bash kinow@ranma:/tmp/test$ bash -eux opennlp2.sh + PRG=opennlp2.sh + '[' -h opennlp2.sh ']' ++ ls -ld opennlp2.sh + ls='lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> /tmp/opennlp -> tests/test.sh' ++ expr 'lrwxrwxrwx 1 kinow kinow 29 Jun 16 23:36 opennlp2.sh -> /tmp/opennlp -> tests/test.sh' : '.*-> \(.*\)$' + link=tests/test.sh + expr tests/test.sh : '/.*' ++ dirname opennlp2.sh + PRG=./tests/test.sh + '[' -h ./tests/test.sh ']' + echo ./tests/test.sh ./tests/test.sh ``` And `realpath`: ```bash kinow@ranma:/tmp/test$ realpath opennlp.sh /tmp/opennlp-1568/test.sh kinow@ranma:/tmp/test$ realpath opennlp2.sh /tmp/opennlp -> tests/test.sh ``` We do not need to use `realpath`. I was honestly just curious about this. We don't even need to fix the regex if you/others agree users are not likely to create paths with ` -> ` :+1: Other than that, changes looks OK to me (:soccer: :wave: ) > opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory > --- > > Key: OPENNLP-1568 > URL: https://issues.apache.org/jira/browse/OPENNLP-1568 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.3.3 > Environment: Linux/Bash >Reporter: Alexander Veit >Priority: Major > > Try to run the opennlp command from outside $OPENNLP_HOME/bin directory. > It fails with an error message similar to > > {noformat} > 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed: > No configuration found for '4f2410ac' at 'null' in 'null'{noformat} > > The error is caused by the relative path in > {code:java} > -Dlog4j.configurationFile=../conf/log4j2.xml {code} > of the opennlp script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
[ https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855432#comment-17855432 ] ASF GitHub Bot commented on OPENNLP-1568: - veita commented on code in PR #607: URL: https://github.com/apache/opennlp/pull/607#discussion_r1641947136 ## opennlp-brat-annotator/src/main/bin/brat-annotation-service: ## @@ -21,6 +21,28 @@ #may be inadvertantly placed in any output files if #output redirection is used. +# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home +PRG="$0" + +while [ -h "$PRG" ] ; do + ls=`ls -ld "$PRG"` + link=`expr "$ls" : '.*-> \(.*\)$'` + if expr "$link" : '/.*' > /dev/null; then +PRG="$link" + else +PRG="`dirname "$PRG"`/$link" + fi +done Review Comment: This is portable and well-proven code also used by other Apache projects like Ant, Maven or Groovy. `realpath` and `pushd` are not portable. > opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory > --- > > Key: OPENNLP-1568 > URL: https://issues.apache.org/jira/browse/OPENNLP-1568 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.3.3 > Environment: Linux/Bash >Reporter: Alexander Veit >Priority: Major > > Try to run the opennlp command from outside $OPENNLP_HOME/bin directory. > It fails with an error message similar to > > {noformat} > 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed: > No configuration found for '4f2410ac' at 'null' in 'null'{noformat} > > The error is caused by the relative path in > {code:java} > -Dlog4j.configurationFile=../conf/log4j2.xml {code} > of the opennlp script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
[ https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855429#comment-17855429 ] ASF GitHub Bot commented on OPENNLP-1568: - kinow commented on code in PR #607: URL: https://github.com/apache/opennlp/pull/607#discussion_r1641918499 ## opennlp-brat-annotator/src/main/bin/brat-annotation-service: ## @@ -21,6 +21,28 @@ #may be inadvertantly placed in any output files if #output redirection is used. +# determine OPENNLP_HOME - $0 may be a symlink to OpenNLP's home +PRG="$0" + +while [ -h "$PRG" ] ; do + ls=`ls -ld "$PRG"` + link=`expr "$ls" : '.*-> \(.*\)$'` + if expr "$link" : '/.*' > /dev/null; then +PRG="$link" + else +PRG="`dirname "$PRG"`/$link" + fi +done Review Comment: I think `realpath` is available in Debian/Ubuntu for a while (from gnu coreutils? I think). And it should be available on macos too (at least I remember using it on an old Intel mac). I think the code above is doing something similar to `realpath`, but using `expr` to parse the output of the commands with regex, or use `dirname` (although I am not sure if that works if the regex failed but the file is still a symlink?). Is there a reason for not using `realpath` here? Or something else like `pushd $PRG ; PRG=pwd -P; popd` (or in a subshell, etc.)? > opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory > --- > > Key: OPENNLP-1568 > URL: https://issues.apache.org/jira/browse/OPENNLP-1568 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.3.3 > Environment: Linux/Bash >Reporter: Alexander Veit >Priority: Major > > Try to run the opennlp command from outside $OPENNLP_HOME/bin directory. > It fails with an error message similar to > > {noformat} > 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed: > No configuration found for '4f2410ac' at 'null' in 'null'{noformat} > > The error is caused by the relative path in > {code:java} > -Dlog4j.configurationFile=../conf/log4j2.xml {code} > of the opennlp script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1568) opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory
[ https://issues.apache.org/jira/browse/OPENNLP-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855402#comment-17855402 ] ASF GitHub Bot commented on OPENNLP-1568: - veita opened a new pull request, #607: URL: https://github.com/apache/opennlp/pull/607 Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x ] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [ x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x ] Is your initial contribution a single, squashed commit? ### For code changes: - [ x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > opennlp command fails when invoked from outside $OPENNLP_HOME/bin directory > --- > > Key: OPENNLP-1568 > URL: https://issues.apache.org/jira/browse/OPENNLP-1568 > Project: OpenNLP > Issue Type: Bug > Components: Command Line Interface >Affects Versions: 2.3.3 > Environment: Linux/Bash >Reporter: Alexander Veit >Priority: Major > > Try to run the opennlp command from outside $OPENNLP_HOME/bin directory. > It fails with an error message similar to > > {noformat} > 2024-06-15T22:44:04.900344345Z main ERROR Reconfiguration failed: > No configuration found for '4f2410ac' at 'null' in 'null'{noformat} > > The error is caused by the relative path in > {code:java} > -Dlog4j.configurationFile=../conf/log4j2.xml {code} > of the opennlp script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855133#comment-17855133 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on PR #606: URL: https://github.com/apache/opennlp/pull/606#issuecomment-2168662142 Since I don't have a Windows system anymore, this PR needs someone with a Windows machine to check the implementation of `SimpleClassPathModelFinder` the related tests ... ;-) > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855132#comment-17855132 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1640265283 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: Now contain a simple classpath scanning implementation which doesn't cover edge cases but makes the unit tests happy. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855019#comment-17855019 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639610514 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: Switching to draft and will implement a service without classgraph. Once I am done, I will ping here again. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855007#comment-17855007 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639557402 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: Yep. Let's change within this PR. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855000#comment-17855000 ] ASF GitHub Bot commented on OPENNLP-1567: - mawiesne commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639520644 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: Providing a "... loader/scan alternative which does not cover all edge cases and declare classgraph as an _optional_ dependency in this module." This is the way. Leaves the choice to the end-user and/or project on top of OpenNLP(-models). > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854999#comment-17854999 ] ASF GitHub Bot commented on OPENNLP-1567: - mawiesne commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639520644 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: Providing a "... loader/scan alternative which does not cover all edge cases and declare classgraph as an _optional_ dependency in this module." That is the way. Leaves the choice to the end-user and/or project on top of OpenNLP(-models). > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854996#comment-17854996 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639507441 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: Reyling on `java.class.path` property or trying to cast to `URLClassloader` (on the current thread context class loader) is also error prone as not all vendors / classloaders extend it anymore or use the `Class-Path:` property in manifests for Java 9+ module system. Implementing it with pure JDK classes from scratch within OpenNLP and cover all edge cases (like it is done in [Classgraph](https://stackoverflow.com/a/31785767)) might be cumbersome and doesn't add much value just for the sake of reducing transient libs here. What we could do as a compromise is to implement (in an additional PR) some sort of hacky loader/scan alternative which does not cover all edge cases and declare `classgraph` as an **optional** dependency in this module. People, who don't want to use `classgraph` can use the hacky alternative (if it covers their use-case) via `ucp` / reflections. People, who need a more sophisticated solution can add `classgraph` as a dependency on their end and have the functionality availble in all edge cases and environments (such as servlet containers, ee containers, quarkus / spring boot envs, etc.) > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854995#comment-17854995 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639507441 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: Reyling on `java.class.path` property or trying to cast to `URLClassloader` (on the current thread context class loader) is also error prone as not all vendors / classloaders extend it anymore or use the `Class-Path:` property in manifests for Java 9+ module system. Implementing it with pure JDK classes from scratch within OpenNLP and cover all edge cases (like it is done in [Classgraph](https://stackoverflow.com/a/31785767)) might be cumbersome and doesn't add much value just for the sake of reducing transient libs here. What we could do as a compromise is to implement (in an additional PR) some sort of hacky loader/scan alternative which does not cover all edge cases and declare `classgraph` as an **optional** dependency in this module. People, who don't want to use `classgraph` can use the hacky alternative (if it covers their use-case) and people, who need a more sophisticated solution can add `classgraph` as a dependency on their end and have the functionality availble in all edge cases and environments (such as servlet containers, ee containers, quarkus / spring boot envs, etc.) > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854985#comment-17854985 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639490884 ## opennlp-tools-models/src/test/java/opennlp/tools/models/ClassPathModelFinderTest.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.util.Set; + +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertNotNull; + +public class ClassPathModelFinderTest { + + @Test + public void testFindOpenNLPModels() { +final ClasspathModelFinder finder = new ClasspathModelFinder(); Review Comment: Reduced code duplication by introducing inheritance, found an additional bug and fixed in latest commit. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854978#comment-17854978 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639441325 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: Not without hacking the ucp (url) classloader via reflection which is error prone depending on the runtime context and might lead to issues in future JDK versions. For this reason, I prefer to stay with an external dependency here. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854977#comment-17854977 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639473953 ## opennlp-tools-models/pom.xml: ## @@ -0,0 +1,95 @@ + + + + +http://maven.apache.org/POM/4.0.0; + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;> +4.0.0 + +org.apache.opennlp +opennlp +2.3.4-SNAPSHOT + + +opennlp-tools-models +jar +Apache OpenNLP Tools Models + + + +org.apache.opennlp +opennlp-tools +${project.version} +provided + + + +io.github.classgraph +classgraph +${classgraph.version} + + + +org.slf4j +slf4j-api + + + +org.junit.jupiter +junit-jupiter-api +test + + + +org.junit.jupiter +junit-jupiter-engine Review Comment: engine is needed, params can be removed. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854968#comment-17854968 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639453973 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; +import io.github.classgraph.ResourceList; +import io.github.classgraph.ScanResult; + + +/** + * Enables the detection of OpenNLP models in the classpath. + */ +public class ClasspathModelFinder { + + private static final String OPENNLP_MODEL_JAR_PREFIX = "opennlp-models-*.jar"; + private final String jarModelPrefix; + private Set models; + + /** + * By default, it scans for "opennlp-models-*.jar". + */ + public ClasspathModelFinder() { +this(OPENNLP_MODEL_JAR_PREFIX); + } + + /** + * @param modelJarPrefix The leafnames of the jars that should be canned (e.g. "opennlp.jar"). + * May contain a wildcard glob ("opennlp-*.jar"). It must not be {@code null}. + */ + public ClasspathModelFinder(String modelJarPrefix) { +Objects.requireNonNull(modelJarPrefix, "modelJarPrefix must not be null"); +this.jarModelPrefix = modelJarPrefix; + } + + /** + * Finds OpenNLP models within the classpath. + * + * @param reloadCache {@code true}, if the internal cache should explicitly be reloaded + * @return A Set of {@link ClassPathModelEntry ClassPathModelEntries}. It might be empty. + */ + public Set findModels(boolean reloadCache) { + +if (this.models == null || reloadCache) { + try (ScanResult sr = new ClassGraph().acceptJars(jarModelPrefix).disableDirScanning().scan()) { Review Comment: This is discussed in the comment above. Resolving this extra conversation. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854967#comment-17854967 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639441325 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: Not without hacking the ucp (url) classloader via reflection which is error prone depending on the runtime context and might lead to issues in future JDK versions. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854961#comment-17854961 ] ASF GitHub Bot commented on OPENNLP-1567: - mawiesne commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1639398105 ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; Review Comment: I'm not against classgraph as a library. Nevertheless: is there an easy way to scan/look for model files in the classpath _without_ introducing this 3rd party dependency? Aka: Can we realize it with plain JDK classes? wdyt @jzonthemtn @kinow @rzo1 ? ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.models; + +import java.net.URI; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; + +import io.github.classgraph.ClassGraph; +import io.github.classgraph.ResourceList; +import io.github.classgraph.ScanResult; + + +/** + * Enables the detection of OpenNLP models in the classpath. + */ +public class ClasspathModelFinder { + + private static final String OPENNLP_MODEL_JAR_PREFIX = "opennlp-models-*.jar"; + private final String jarModelPrefix; + private Set models; + + /** + * By default, it scans for "opennlp-models-*.jar". + */ + public ClasspathModelFinder() { +this(OPENNLP_MODEL_JAR_PREFIX); + } + + /** + * @param modelJarPrefix The leafnames of the jars that should be canned (e.g. "opennlp.jar"). + * May contain a wildcard glob ("opennlp-*.jar"). It must not be {@code null}. + */ + public ClasspathModelFinder(String modelJarPrefix) { +Objects.requireNonNull(modelJarPrefix, "modelJarPrefix must not be null"); +this.jarModelPrefix = modelJarPrefix; + } + + /** + * Finds OpenNLP models within the classpath. + * + * @param reloadCache {@code true}, if the internal cache should explicitly be reloaded + * @return A Set of {@link ClassPathModelEntry ClassPathModelEntries}. It might be empty. + */ + public Set findModels(boolean reloadCache) { + +if (this.models == null || reloadCache) { + try (ScanResult sr = new ClassGraph().acceptJars(jarModelPrefix).disableDirScanning().scan()) { Review Comment: See comment above on introducing ClassGraph as a dependency. ## opennlp-tools-models/src/main/java/opennlp/tools/models/ClasspathModelFinder.java: ## @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854154#comment-17854154 ] ASF GitHub Bot commented on OPENNLP-1567: - kinow commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1635321056 ## LICENSE: ## @@ -303,3 +303,27 @@ The following license applies to the SLF4J API: LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +The following license applies to classgraph: + +The MIT License (MIT) + +Copyright (c) 2019 Luke Hutchison + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. Review Comment: I am not sure, but probably doesn't hurt having it listed in NOTICE. > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854112#comment-17854112 ] ASF GitHub Bot commented on OPENNLP-1567: - kinow commented on code in PR #606: URL: https://github.com/apache/opennlp/pull/606#discussion_r1635163209 ## LICENSE: ## @@ -303,3 +303,27 @@ The following license applies to the SLF4J API: LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +The following license applies to classgraph: + +The MIT License (MIT) + +Copyright (c) 2019 Luke Hutchison + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. Review Comment: Isn't this change required only in NOTICE? > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1567) OpenNLP Models: Provide a Finder / Loader Implementation
[ https://issues.apache.org/jira/browse/OPENNLP-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853993#comment-17853993 ] ASF GitHub Bot commented on OPENNLP-1567: - rzo1 opened a new pull request, #606: URL: https://github.com/apache/opennlp/pull/606 Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [x] Have you written or updated unit tests to verify your changes? - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [x] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [x] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: This adds a proof-of-conept implementation as a separate module (due to the addtional dependendy for classpath scanning) to load models from JAR files. If we are fine with the design, we can do a first model release afterwards ;-) > OpenNLP Models: Provide a Finder / Loader Implementation > > > Key: OPENNLP-1567 > URL: https://issues.apache.org/jira/browse/OPENNLP-1567 > Project: OpenNLP > Issue Type: New Feature >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > as the title says -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1566) Array writing error in code example
[ https://issues.apache.org/jira/browse/OPENNLP-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853736#comment-17853736 ] ASF GitHub Bot commented on OPENNLP-1566: - jzonthemtn commented on PR #605: URL: https://github.com/apache/opennlp/pull/605#issuecomment-2158853268 Hi @shellrean, thanks for the pull request! > Array writing error in code example > --- > > Key: OPENNLP-1566 > URL: https://issues.apache.org/jira/browse/OPENNLP-1566 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.3 >Reporter: shellrean >Priority: Minor > Fix For: 2.3.4 > > Attachments: Screenshot 2024-06-10 at 23.31.20.png > > Original Estimate: 12h > Remaining Estimate: 12h > > There was an error in writing the array symbol in the documentation, it > should be String[] but what is displayed here is String variable[]. > !Screenshot 2024-06-10 at 23.31.20.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1566) Array writing error in code example
[ https://issues.apache.org/jira/browse/OPENNLP-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853732#comment-17853732 ] ASF GitHub Bot commented on OPENNLP-1566: - shellrean opened a new pull request, #605: URL: https://github.com/apache/opennlp/pull/605 (no comment) > Array writing error in code example > --- > > Key: OPENNLP-1566 > URL: https://issues.apache.org/jira/browse/OPENNLP-1566 > Project: OpenNLP > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.3 >Reporter: shellrean >Priority: Minor > Fix For: 2.3.4 > > Attachments: Screenshot 2024-06-10 at 23.31.20.png > > Original Estimate: 12h > Remaining Estimate: 12h > > There was an error in writing the array symbol in the documentation, it > should be String[] but what is displayed here is String variable[]. > !Screenshot 2024-06-10 at 23.31.20.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change
[ https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850818#comment-17850818 ] ASF GitHub Bot commented on OPENNLP-1564: - rzo1 commented on PR #604: URL: https://github.com/apache/opennlp/pull/604#issuecomment-2140460193 Thx @jzonthemtn - next automatic eval run should be fine again. Sadly, cannot manually trigger atm (due to the jenkins outage) > Fix Evaluation Tests after POSFormat Change > --- > > Key: OPENNLP-1564 > URL: https://issues.apache.org/jira/browse/OPENNLP-1564 > Project: OpenNLP > Issue Type: Task >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > ERROR] Failures: > [ERROR] ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: > <0.9504442925495558> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: > <0.9564251537935748> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: > <0.9248585572842999> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: > <0.9377652050919377> but was: <0.0> > [ERROR] OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 > expected: <0.8070226153653437> but was: <0.8060021642728011> > [ERROR] OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 > expected: <0.969345319453096> but was: <1.5460660461489513E-4> > [ERROR] SourceForgeModelEval.evalChunkerModel:344 expected: > <304922886851384639120257052245406261332> but was: > <85416056838725341441074840387786758951> > [ERROR] SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 > expected: <231995214522232523777090597594904492687> but was: > <90112530006040278703441476599716290769> > [ERROR] SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 > expected: <209440430718727101220960491543652921728> but was: > <256369615778494816584749939613105809001> > [INFO] > [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change
[ https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850819#comment-17850819 ] ASF GitHub Bot commented on OPENNLP-1564: - rzo1 merged PR #604: URL: https://github.com/apache/opennlp/pull/604 > Fix Evaluation Tests after POSFormat Change > --- > > Key: OPENNLP-1564 > URL: https://issues.apache.org/jira/browse/OPENNLP-1564 > Project: OpenNLP > Issue Type: Task >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > ERROR] Failures: > [ERROR] ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: > <0.9504442925495558> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: > <0.9564251537935748> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: > <0.9248585572842999> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: > <0.9377652050919377> but was: <0.0> > [ERROR] OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 > expected: <0.8070226153653437> but was: <0.8060021642728011> > [ERROR] OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 > expected: <0.969345319453096> but was: <1.5460660461489513E-4> > [ERROR] SourceForgeModelEval.evalChunkerModel:344 expected: > <304922886851384639120257052245406261332> but was: > <85416056838725341441074840387786758951> > [ERROR] SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 > expected: <231995214522232523777090597594904492687> but was: > <90112530006040278703441476599716290769> > [ERROR] SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 > expected: <209440430718727101220960491543652921728> but was: > <256369615778494816584749939613105809001> > [INFO] > [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change
[ https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850817#comment-17850817 ] ASF GitHub Bot commented on OPENNLP-1564: - rzo1 opened a new pull request, #604: URL: https://github.com/apache/opennlp/pull/604 This fixes ```bash [ERROR] opennlp.tools.eval.SourceForgeModelEval.evalChunkerModel > Fix Evaluation Tests after POSFormat Change > --- > > Key: OPENNLP-1564 > URL: https://issues.apache.org/jira/browse/OPENNLP-1564 > Project: OpenNLP > Issue Type: Task >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > ERROR] Failures: > [ERROR] ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: > <0.9504442925495558> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: > <0.9564251537935748> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: > <0.9248585572842999> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: > <0.9377652050919377> but was: <0.0> > [ERROR] OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 > expected: <0.8070226153653437> but was: <0.8060021642728011> > [ERROR] OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 > expected: <0.969345319453096> but was: <1.5460660461489513E-4> > [ERROR] SourceForgeModelEval.evalChunkerModel:344 expected: > <304922886851384639120257052245406261332> but was: > <85416056838725341441074840387786758951> > [ERROR] SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 > expected: <231995214522232523777090597594904492687> but was: > <90112530006040278703441476599716290769> > [ERROR] SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 > expected: <209440430718727101220960491543652921728> but was: > <256369615778494816584749939613105809001> > [INFO] > [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change
[ https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850571#comment-17850571 ] ASF GitHub Bot commented on OPENNLP-1564: - rzo1 merged PR #603: URL: https://github.com/apache/opennlp/pull/603 > Fix Evaluation Tests after POSFormat Change > --- > > Key: OPENNLP-1564 > URL: https://issues.apache.org/jira/browse/OPENNLP-1564 > Project: OpenNLP > Issue Type: Task >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > ERROR] Failures: > [ERROR] ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: > <0.9504442925495558> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: > <0.9564251537935748> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: > <0.9248585572842999> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: > <0.9377652050919377> but was: <0.0> > [ERROR] OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 > expected: <0.8070226153653437> but was: <0.8060021642728011> > [ERROR] OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 > expected: <0.969345319453096> but was: <1.5460660461489513E-4> > [ERROR] SourceForgeModelEval.evalChunkerModel:344 expected: > <304922886851384639120257052245406261332> but was: > <85416056838725341441074840387786758951> > [ERROR] SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 > expected: <231995214522232523777090597594904492687> but was: > <90112530006040278703441476599716290769> > [ERROR] SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 > expected: <209440430718727101220960491543652921728> but was: > <256369615778494816584749939613105809001> > [INFO] > [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change
[ https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850570#comment-17850570 ] ASF GitHub Bot commented on OPENNLP-1564: - rzo1 commented on PR #603: URL: https://github.com/apache/opennlp/pull/603#issuecomment-2138684864 Thanks for review. I will merge, so it gets picked up in today's scheduled eval run. Will close Jira, If the run is green (otherwise, debug session ) > Fix Evaluation Tests after POSFormat Change > --- > > Key: OPENNLP-1564 > URL: https://issues.apache.org/jira/browse/OPENNLP-1564 > Project: OpenNLP > Issue Type: Task >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > ERROR] Failures: > [ERROR] ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: > <0.9504442925495558> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: > <0.9564251537935748> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: > <0.9248585572842999> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: > <0.9377652050919377> but was: <0.0> > [ERROR] OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 > expected: <0.8070226153653437> but was: <0.8060021642728011> > [ERROR] OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 > expected: <0.969345319453096> but was: <1.5460660461489513E-4> > [ERROR] SourceForgeModelEval.evalChunkerModel:344 expected: > <304922886851384639120257052245406261332> but was: > <85416056838725341441074840387786758951> > [ERROR] SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 > expected: <231995214522232523777090597594904492687> but was: > <90112530006040278703441476599716290769> > [ERROR] SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 > expected: <209440430718727101220960491543652921728> but was: > <256369615778494816584749939613105809001> > [INFO] > [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1564) Fix Evaluation Tests after POSFormat Change
[ https://issues.apache.org/jira/browse/OPENNLP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850471#comment-17850471 ] ASF GitHub Bot commented on OPENNLP-1564: - rzo1 opened a new pull request, #603: URL: https://github.com/apache/opennlp/pull/603 Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [ ] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [x] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Updates some missed `PENN` format cases in the evaluation test data. In addition, if `POSTaggerNameFeatureGenerator` is used for NameFinder (i.e. defined via XML), we only get a POSModel and need to guess the format of it before creating the `POSTagger`. Therefore, I updated the mapper with a static method for guessing this type based on a given POSModel. This might be useful for other cases as well. Sadly, ASF Jenkins is currently broken, so we cannot trigger a manual run, see https://issues.apache.org/jira/browse/INFRA-25828 > Fix Evaluation Tests after POSFormat Change > --- > > Key: OPENNLP-1564 > URL: https://issues.apache.org/jira/browse/OPENNLP-1564 > Project: OpenNLP > Issue Type: Task >Reporter: Richard Zowalla >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.3.4, 2.4.0 > > > ERROR] Failures: > [ERROR] ConllXPosTaggerEval.evalDanishMaxentGis:133->eval:79 expected: > <0.9504442925495558> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalDanishMaxentQn:144->eval:79 expected: > <0.9564251537935748> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentGis:200->eval:79 expected: > <0.9248585572842999> but was: <0.0> > [ERROR] ConllXPosTaggerEval.evalSwedishMaxentQn:211->eval:79 expected: > <0.9377652050919377> but was: <0.0> > [ERROR] OntoNotes4NameFinderEval.evalAllTypesWithPOSNameFinder:152 > expected: <0.8070226153653437> but was: <0.8060021642728011> > [ERROR] OntoNotes4PosTaggerEval.evalEnglishMaxentTagger:79->crossEval:65 > expected: <0.969345319453096> but was: <1.5460660461489513E-4> > [ERROR] SourceForgeModelEval.evalChunkerModel:344 expected: > <304922886851384639120257052245406261332> but was: > <85416056838725341441074840387786758951> > [ERROR] SourceForgeModelEval.evalMaxentModel:376->evalPosModel:368 > expected: <231995214522232523777090597594904492687> but was: > <90112530006040278703441476599716290769> > [ERROR] SourceForgeModelEval.evalPerceptronModel:384->evalPosModel:368 > expected: <209440430718727101220960491543652921728> but was: > <256369615778494816584749939613105809001> > [INFO] > [ERROR] Tests run: 1040, Failures: 9, Errors: 0, Skipped: 3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850265#comment-17850265 ] ASF GitHub Bot commented on OPENNLP-1539: - mawiesne merged PR #601: URL: https://github.com/apache/opennlp/pull/601 > Introduce parameter for POSTaggerME to configure output POS tag format > -- > > Key: OPENNLP-1539 > URL: https://issues.apache.org/jira/browse/OPENNLP-1539 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Martin Wiesner >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.4.0 > > > [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] > output tags in the [PENN Treebank POS > tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html] > format. > The modern UD-based models, however, differ in the [longer output > format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" > (UD). Extended (UD) word features are covered here: > https://universaldependencies.org/u/feat/index.html > This difference results in mismatches and will cause existing IT / tests to > fail, if executed. Luckily, a mapping table is found here: > https://universaldependencies.org/tagset-conversion/en-penn-uposf.html > To provide compatibility for existing applications and/or use-cases, we need > to provide a way to retrieve both POS formats. > Aims: > - Introduce a constructor parameter for POSTaggerME to configure tag format / > style: Penn or UD style > - Implement a mapping between both POS tag formats: UD <==> Penn > - Update the OpenNLP Manual to explain differences of POS tag format and > configuration parameter > Conceptual idea: > - {{new POSTaggerME("en")}} => by _default_: UD format "as is" > - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn > style > Benefit: > 1. It should be explicit so devs / user see what they will get via > {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, > POSTagFormat.DEFAULT > 2. IT tests can now be formulated to work on both modern and legacy models. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850101#comment-17850101 ] ASF GitHub Bot commented on OPENNLP-1563: - jzonthemtn commented on PR #602: URL: https://github.com/apache/opennlp/pull/602#issuecomment-2135699051 Thanks @demq! > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > - > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer >Affects Versions: 2.3.3 >Reporter: Hrayr Matevosyan >Priority: Major > Fix For: 2.3.4 > > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849994#comment-17849994 ] ASF GitHub Bot commented on OPENNLP-1563: - rzo1 merged PR #602: URL: https://github.com/apache/opennlp/pull/602 > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > - > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer >Affects Versions: 2.3.3 >Reporter: Hrayr Matevosyan >Priority: Major > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849993#comment-17849993 ] ASF GitHub Bot commented on OPENNLP-1563: - mawiesne commented on PR #602: URL: https://github.com/apache/opennlp/pull/602#issuecomment-2134979949 Thx @demq ! > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > - > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer >Affects Versions: 2.3.3 >Reporter: Hrayr Matevosyan >Priority: Major > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849906#comment-17849906 ] ASF GitHub Bot commented on OPENNLP-1563: - demq commented on code in PR #602: URL: https://github.com/apache/opennlp/pull/602#discussion_r1616698074 ## opennlp-tools/src/test/java/opennlp/tools/tokenize/SimpleTokenizerTest.java: ## @@ -128,4 +128,18 @@ void testTokenizationOfStringWithWindowsNewLineTokens() { Assertions.assertArrayEquals(new String[] {"a", "\r", "\n", "\r", "\n", "b", "\r", "\n", "\r", "\n", "c"}, tokenizer.tokenize("a\r\n\r\n b\r\n\r\n c")); } + + /** + * Tests if it can tokenize a word containing a non-spacing character + * like Arabic Damma Unicode Character “◌ُ” (U+064F) + */ + @Test + void testNonSpacingLetters() { +String text = "طُوّر"; Review Comment: I have just pushed an update with a full sentence. > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > - > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer >Affects Versions: 2.3.3 >Reporter: Hrayr Matevosyan >Priority: Major > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849899#comment-17849899 ] ASF GitHub Bot commented on OPENNLP-1563: - rzo1 commented on code in PR #602: URL: https://github.com/apache/opennlp/pull/602#discussion_r1616664137 ## opennlp-tools/src/test/java/opennlp/tools/tokenize/SimpleTokenizerTest.java: ## @@ -128,4 +128,18 @@ void testTokenizationOfStringWithWindowsNewLineTokens() { Assertions.assertArrayEquals(new String[] {"a", "\r", "\n", "\r", "\n", "b", "\r", "\n", "\r", "\n", "c"}, tokenizer.tokenize("a\r\n\r\n b\r\n\r\n c")); } + + /** + * Tests if it can tokenize a word containing a non-spacing character + * like Arabic Damma Unicode Character “◌ُ” (U+064F) + */ + @Test + void testNonSpacingLetters() { +String text = "طُوّر"; Review Comment: @demq Can we have a full sentence example here? > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > - > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer >Affects Versions: 2.3.3 >Reporter: Hrayr Matevosyan >Priority: Major > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849892#comment-17849892 ] ASF GitHub Bot commented on OPENNLP-1563: - demq opened a new pull request, #602: URL: https://github.com/apache/opennlp/pull/602 Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [x] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > - > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer >Affects Versions: 2.3.3 >Reporter: Hrayr Matevosyan >Priority: Major > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849223#comment-17849223 ] ASF GitHub Bot commented on OPENNLP-1539: - rzo1 commented on PR #601: URL: https://github.com/apache/opennlp/pull/601#issuecomment-2128979449 > @rzo1 Did you run into any weirdness with the older Sourceforge models? Added tests. They look good. > Introduce parameter for POSTaggerME to configure output POS tag format > -- > > Key: OPENNLP-1539 > URL: https://issues.apache.org/jira/browse/OPENNLP-1539 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Martin Wiesner >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.4.0 > > > [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] > output tags in the [PENN Treebank POS > tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html] > format. > The modern UD-based models, however, differ in the [longer output > format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" > (UD). Extended (UD) word features are covered here: > https://universaldependencies.org/u/feat/index.html > This difference results in mismatches and will cause existing IT / tests to > fail, if executed. Luckily, a mapping table is found here: > https://universaldependencies.org/tagset-conversion/en-penn-uposf.html > To provide compatibility for existing applications and/or use-cases, we need > to provide a way to retrieve both POS formats. > Aims: > - Introduce a constructor parameter for POSTaggerME to configure tag format / > style: Penn or UD style > - Implement a mapping between both POS tag formats: UD <==> Penn > - Update the OpenNLP Manual to explain differences of POS tag format and > configuration parameter > Conceptual idea: > - {{new POSTaggerME("en")}} => by _default_: UD format "as is" > - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn > style > Benefit: > 1. It should be explicit so devs / user see what they will get via > {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, > POSTagFormat.DEFAULT > 2. IT tests can now be formulated to work on both modern and legacy models. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849194#comment-17849194 ] ASF GitHub Bot commented on OPENNLP-1539: - mawiesne commented on code in PR #601: URL: https://github.com/apache/opennlp/pull/601#discussion_r1612947718 ## opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java: ## @@ -0,0 +1,207 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.postag; + +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A mapping implementation for converting between different POS tag formats. + * This class supports conversion between Penn Treebank (PENN) and Universal Dependencies (UD) formats. + * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal Dependencies conversion table. + * Please note that when converting from UD to Penn format, there may be ambiguity in some cases. + */ +public class POSTagFormatMapper { + + private static final Logger logger = LoggerFactory.getLogger(POSTagFormatMapper.class); + + private static final Map CONVERSION_TABLE_PENN_TO_UD = new HashMap<>(); + private static final Map CONVERSION_TABLE_UD_TO_PENN = new HashMap<>(); + + static { +/* + * This is a conversion table to convert PENN to UD format as described in + * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html + */ +CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM"); +CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("FW", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("LS", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ"); +CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV"); + +/* + * Note: The back conversion might lose information. + */ +CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ"); +
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849174#comment-17849174 ] ASF GitHub Bot commented on OPENNLP-1539: - rzo1 commented on code in PR #601: URL: https://github.com/apache/opennlp/pull/601#discussion_r1612885912 ## opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java: ## @@ -0,0 +1,207 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.postag; + +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A mapping implementation for converting between different POS tag formats. + * This class supports conversion between Penn Treebank (PENN) and Universal Dependencies (UD) formats. + * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal Dependencies conversion table. + * Please note that when converting from UD to Penn format, there may be ambiguity in some cases. + */ +public class POSTagFormatMapper { + + private static final Logger logger = LoggerFactory.getLogger(POSTagFormatMapper.class); + + private static final Map CONVERSION_TABLE_PENN_TO_UD = new HashMap<>(); + private static final Map CONVERSION_TABLE_UD_TO_PENN = new HashMap<>(); + + static { +/* + * This is a conversion table to convert PENN to UD format as described in + * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html + */ +CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM"); +CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("FW", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("LS", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ"); +CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV"); + +/* + * Note: The back conversion might lose information. + */ +CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ"); +
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849172#comment-17849172 ] ASF GitHub Bot commented on OPENNLP-1539: - rzo1 commented on code in PR #601: URL: https://github.com/apache/opennlp/pull/601#discussion_r1612883541 ## opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java: ## @@ -0,0 +1,207 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.postag; + +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A mapping implementation for converting between different POS tag formats. + * This class supports conversion between Penn Treebank (PENN) and Universal Dependencies (UD) formats. + * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal Dependencies conversion table. + * Please note that when converting from UD to Penn format, there may be ambiguity in some cases. + */ +public class POSTagFormatMapper { + + private static final Logger logger = LoggerFactory.getLogger(POSTagFormatMapper.class); + + private static final Map CONVERSION_TABLE_PENN_TO_UD = new HashMap<>(); + private static final Map CONVERSION_TABLE_UD_TO_PENN = new HashMap<>(); + + static { +/* + * This is a conversion table to convert PENN to UD format as described in + * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html + */ +CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM"); +CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("FW", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("LS", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ"); +CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV"); + +/* + * Note: The back conversion might lose information. + */ +CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ"); +
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849158#comment-17849158 ] ASF GitHub Bot commented on OPENNLP-1539: - mawiesne commented on PR #601: URL: https://github.com/apache/opennlp/pull/601#issuecomment-2128552371 Thx @rzo1 - I left some comments to improve clarity of the doc for the changes in the API. > Introduce parameter for POSTaggerME to configure output POS tag format > -- > > Key: OPENNLP-1539 > URL: https://issues.apache.org/jira/browse/OPENNLP-1539 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Martin Wiesner >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.4.0 > > > [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] > output tags in the [PENN Treebank POS > tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html] > format. > The modern UD-based models, however, differ in the [longer output > format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" > (UD). Extended (UD) word features are covered here: > https://universaldependencies.org/u/feat/index.html > This difference results in mismatches and will cause existing IT / tests to > fail, if executed. Luckily, a mapping table is found here: > https://universaldependencies.org/tagset-conversion/en-penn-uposf.html > To provide compatibility for existing applications and/or use-cases, we need > to provide a way to retrieve both POS formats. > Aims: > - Introduce a constructor parameter for POSTaggerME to configure tag format / > style: Penn or UD style > - Implement a mapping between both POS tag formats: UD <==> Penn > - Update the OpenNLP Manual to explain differences of POS tag format and > configuration parameter > Conceptual idea: > - {{new POSTaggerME("en")}} => by _default_: UD format "as is" > - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn > style > Benefit: > 1. It should be explicit so devs / user see what they will get via > {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, > POSTagFormat.DEFAULT > 2. IT tests can now be formulated to work on both modern and legacy models. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849157#comment-17849157 ] ASF GitHub Bot commented on OPENNLP-1539: - mawiesne commented on code in PR #601: URL: https://github.com/apache/opennlp/pull/601#discussion_r1612745224 ## opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java: ## @@ -0,0 +1,207 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.postag; + +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A mapping implementation for converting between different POS tag formats. + * This class supports conversion between Penn Treebank (PENN) and Universal Dependencies (UD) formats. + * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal Dependencies conversion table. + * Please note that when converting from UD to Penn format, there may be ambiguity in some cases. + */ +public class POSTagFormatMapper { + + private static final Logger logger = LoggerFactory.getLogger(POSTagFormatMapper.class); + + private static final Map CONVERSION_TABLE_PENN_TO_UD = new HashMap<>(); + private static final Map CONVERSION_TABLE_UD_TO_PENN = new HashMap<>(); + + static { +/* + * This is a conversion table to convert PENN to UD format as described in + * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html + */ +CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM"); +CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("FW", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("LS", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ"); +CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV"); + +/* + * Note: The back conversion might lose information. + */ +CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ"); +
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849055#comment-17849055 ] ASF GitHub Bot commented on OPENNLP-1539: - jzonthemtn commented on code in PR #601: URL: https://github.com/apache/opennlp/pull/601#discussion_r1612105341 ## opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java: ## @@ -0,0 +1,207 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.postag; + +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A mapping implementation for converting between different POS tag formats. + * This class supports conversion between Penn Treebank (PENN) and Universal Dependencies (UD) formats. + * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal Dependencies conversion table. + * Please note that when converting from UD to Penn format, there may be ambiguity in some cases. + */ +public class POSTagFormatMapper { + + private static final Logger logger = LoggerFactory.getLogger(POSTagFormatMapper.class); + + private static final Map CONVERSION_TABLE_PENN_TO_UD = new HashMap<>(); + private static final Map CONVERSION_TABLE_UD_TO_PENN = new HashMap<>(); + + static { +/* + * This is a conversion table to convert PENN to UD format as described in + * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html + */ +CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM"); +CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("FW", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("LS", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ"); +CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV"); + +/* + * Note: The back conversion might lose information. + */ +CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ"); +
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849016#comment-17849016 ] ASF GitHub Bot commented on OPENNLP-1539: - rzo1 commented on code in PR #601: URL: https://github.com/apache/opennlp/pull/601#discussion_r1611936275 ## opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java: ## @@ -0,0 +1,207 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.postag; + +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A mapping implementation for converting between different POS tag formats. + * This class supports conversion between Penn Treebank (PENN) and Universal Dependencies (UD) formats. + * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal Dependencies conversion table. + * Please note that when converting from UD to Penn format, there may be ambiguity in some cases. + */ +public class POSTagFormatMapper { + + private static final Logger logger = LoggerFactory.getLogger(POSTagFormatMapper.class); + + private static final Map CONVERSION_TABLE_PENN_TO_UD = new HashMap<>(); + private static final Map CONVERSION_TABLE_UD_TO_PENN = new HashMap<>(); + + static { +/* + * This is a conversion table to convert PENN to UD format as described in + * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html + */ +CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM"); +CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("FW", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("LS", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ"); +CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV"); + +/* + * Note: The back conversion might lose information. + */ +CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ"); +
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849008#comment-17849008 ] ASF GitHub Bot commented on OPENNLP-1539: - rzo1 commented on PR #601: URL: https://github.com/apache/opennlp/pull/601#issuecomment-2127421991 Need to add an test for it ;-) > Introduce parameter for POSTaggerME to configure output POS tag format > -- > > Key: OPENNLP-1539 > URL: https://issues.apache.org/jira/browse/OPENNLP-1539 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Martin Wiesner >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.4.0 > > > [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] > output tags in the [PENN Treebank POS > tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html] > format. > The modern UD-based models, however, differ in the [longer output > format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" > (UD). Extended (UD) word features are covered here: > https://universaldependencies.org/u/feat/index.html > This difference results in mismatches and will cause existing IT / tests to > fail, if executed. Luckily, a mapping table is found here: > https://universaldependencies.org/tagset-conversion/en-penn-uposf.html > To provide compatibility for existing applications and/or use-cases, we need > to provide a way to retrieve both POS formats. > Aims: > - Introduce a constructor parameter for POSTaggerME to configure tag format / > style: Penn or UD style > - Implement a mapping between both POS tag formats: UD <==> Penn > - Update the OpenNLP Manual to explain differences of POS tag format and > configuration parameter > Conceptual idea: > - {{new POSTaggerME("en")}} => by _default_: UD format "as is" > - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn > style > Benefit: > 1. It should be explicit so devs / user see what they will get via > {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, > POSTagFormat.DEFAULT > 2. IT tests can now be formulated to work on both modern and legacy models. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849006#comment-17849006 ] ASF GitHub Bot commented on OPENNLP-1539: - jzonthemtn commented on PR #601: URL: https://github.com/apache/opennlp/pull/601#issuecomment-2127416229 @rzo1 Did you run into any weirdness with the older Sourceforge models? > Introduce parameter for POSTaggerME to configure output POS tag format > -- > > Key: OPENNLP-1539 > URL: https://issues.apache.org/jira/browse/OPENNLP-1539 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Martin Wiesner >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.4.0 > > > [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] > output tags in the [PENN Treebank POS > tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html] > format. > The modern UD-based models, however, differ in the [longer output > format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" > (UD). Extended (UD) word features are covered here: > https://universaldependencies.org/u/feat/index.html > This difference results in mismatches and will cause existing IT / tests to > fail, if executed. Luckily, a mapping table is found here: > https://universaldependencies.org/tagset-conversion/en-penn-uposf.html > To provide compatibility for existing applications and/or use-cases, we need > to provide a way to retrieve both POS formats. > Aims: > - Introduce a constructor parameter for POSTaggerME to configure tag format / > style: Penn or UD style > - Implement a mapping between both POS tag formats: UD <==> Penn > - Update the OpenNLP Manual to explain differences of POS tag format and > configuration parameter > Conceptual idea: > - {{new POSTaggerME("en")}} => by _default_: UD format "as is" > - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn > style > Benefit: > 1. It should be explicit so devs / user see what they will get via > {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, > POSTagFormat.DEFAULT > 2. IT tests can now be formulated to work on both modern and legacy models. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849000#comment-17849000 ] ASF GitHub Bot commented on OPENNLP-1539: - jzonthemtn commented on code in PR #601: URL: https://github.com/apache/opennlp/pull/601#discussion_r1611890953 ## opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java: ## @@ -0,0 +1,207 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.postag; + +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * A mapping implementation for converting between different POS tag formats. + * This class supports conversion between Penn Treebank (PENN) and Universal Dependencies (UD) formats. + * The conversion is based on the https://universaldependencies.org/tagset-conversion/en-penn-uposf.html;>Universal Dependencies conversion table. + * Please note that when converting from UD to Penn format, there may be ambiguity in some cases. + */ +public class POSTagFormatMapper { + + private static final Logger logger = LoggerFactory.getLogger(POSTagFormatMapper.class); + + private static final Map CONVERSION_TABLE_PENN_TO_UD = new HashMap<>(); + private static final Map CONVERSION_TABLE_UD_TO_PENN = new HashMap<>(); + + static { +/* + * This is a conversion table to convert PENN to UD format as described in + * https://universaldependencies.org/tagset-conversion/en-penn-uposf.html + */ +CONVERSION_TABLE_PENN_TO_UD.put("#", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("$", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("''", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(",", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-LRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("-RRB-", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(".", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put(":", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("AFX", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CC", "CCONJ"); +CONVERSION_TABLE_PENN_TO_UD.put("CD", "NUM"); +CONVERSION_TABLE_PENN_TO_UD.put("DT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("EX", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("FW", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("HYPH", "PUNCT"); +CONVERSION_TABLE_PENN_TO_UD.put("IN", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("JJ", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJR", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("JJS", "ADJ"); +CONVERSION_TABLE_PENN_TO_UD.put("LS", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("MD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("NIL", "X"); +CONVERSION_TABLE_PENN_TO_UD.put("NN", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNP", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNPS", "PROPN"); +CONVERSION_TABLE_PENN_TO_UD.put("NNS", "NOUN"); +CONVERSION_TABLE_PENN_TO_UD.put("PDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("POS", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("PRP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("RB", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBR", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RBS", "ADV"); +CONVERSION_TABLE_PENN_TO_UD.put("RP", "ADP"); +CONVERSION_TABLE_PENN_TO_UD.put("SYM", "SYM"); +CONVERSION_TABLE_PENN_TO_UD.put("TO", "PART"); +CONVERSION_TABLE_PENN_TO_UD.put("UH", "INTJ"); +CONVERSION_TABLE_PENN_TO_UD.put("VB", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBD", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBG", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBN", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBP", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("VBZ", "VERB"); +CONVERSION_TABLE_PENN_TO_UD.put("WDT", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WP", "PRON"); +CONVERSION_TABLE_PENN_TO_UD.put("WP$", "DET"); +CONVERSION_TABLE_PENN_TO_UD.put("WRB", "ADV"); + +/* + * Note: The back conversion might lose information. + */ +CONVERSION_TABLE_UD_TO_PENN.put("ADJ", "JJ"); +
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848997#comment-17848997 ] ASF GitHub Bot commented on OPENNLP-1539: - rzo1 commented on PR #601: URL: https://github.com/apache/opennlp/pull/601#issuecomment-2127381263 Note: The currently failing tests are unrelated to the actual change: ```bash Error:ChunkerModelLoaderTest.initResources:43->lambda$initResources$0:47 Runtime java.io.IOException: Server returned HTTP response code: 503 for URL: https://opennlp.sourceforge.net/models-1.5/en-chunker.bin Error: TokenNameFinderModelLoaderTest.initResources:43->lambda$initResources$0:47 Runtime java.io.IOException: Server returned HTTP response code: 503 for URL: https://opennlp.sourceforge.net/models-1.5/en-ner-location.bin Error: TokenNameFinderModelTest.testNERWithPOSModelV15:122->AbstractModelLoaderTest.downloadVersion15Model:41->AbstractModelLoaderTest.downloadModel:57 » IO Server returned HTTP response code: 503 for URL: https://opennlp.sourceforge.net/models-1.5/pt-pos-perceptron.bin ``` > Introduce parameter for POSTaggerME to configure output POS tag format > -- > > Key: OPENNLP-1539 > URL: https://issues.apache.org/jira/browse/OPENNLP-1539 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Martin Wiesner >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.4.0 > > > [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] > output tags in the [PENN Treebank POS > tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html] > format. > The modern UD-based models, however, differ in the [longer output > format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" > (UD). Extended (UD) word features are covered here: > https://universaldependencies.org/u/feat/index.html > This difference results in mismatches and will cause existing IT / tests to > fail, if executed. Luckily, a mapping table is found here: > https://universaldependencies.org/tagset-conversion/en-penn-uposf.html > To provide compatibility for existing applications and/or use-cases, we need > to provide a way to retrieve both POS formats. > Aims: > - Introduce a constructor parameter for POSTaggerME to configure tag format / > style: Penn or UD style > - Implement a mapping between both POS tag formats: UD <==> Penn > - Update the OpenNLP Manual to explain differences of POS tag format and > configuration parameter > Conceptual idea: > - {{new POSTaggerME("en")}} => by _default_: UD format "as is" > - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn > style > Benefit: > 1. It should be explicit so devs / user see what they will get via > {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, > POSTagFormat.DEFAULT > 2. IT tests can now be formulated to work on both modern and legacy models. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1539) Introduce parameter for POSTaggerME to configure output POS tag format
[ https://issues.apache.org/jira/browse/OPENNLP-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848994#comment-17848994 ] ASF GitHub Bot commented on OPENNLP-1539: - rzo1 opened a new pull request, #601: URL: https://github.com/apache/opennlp/pull/601 ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [x] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: This is a **Draft** open for feedback on implementing compatibility with older PENN-based POS models from Sourceforge. > Introduce parameter for POSTaggerME to configure output POS tag format > -- > > Key: OPENNLP-1539 > URL: https://issues.apache.org/jira/browse/OPENNLP-1539 > Project: OpenNLP > Issue Type: Improvement > Components: POS Tagger >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Martin Wiesner >Assignee: Richard Zowalla >Priority: Major > Fix For: 2.4.0 > > > [Classic (legacy) POS models|https://opennlp.sourceforge.net/models-1.5/] > output tags in the [PENN Treebank POS > tag|https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html] > format. > The modern UD-based models, however, differ in the [longer output > format|https://universaldependencies.org/u/pos/], e.g. "VB" (Penn) vs. "VERB" > (UD). Extended (UD) word features are covered here: > https://universaldependencies.org/u/feat/index.html > This difference results in mismatches and will cause existing IT / tests to > fail, if executed. Luckily, a mapping table is found here: > https://universaldependencies.org/tagset-conversion/en-penn-uposf.html > To provide compatibility for existing applications and/or use-cases, we need > to provide a way to retrieve both POS formats. > Aims: > - Introduce a constructor parameter for POSTaggerME to configure tag format / > style: Penn or UD style > - Implement a mapping between both POS tag formats: UD <==> Penn > - Update the OpenNLP Manual to explain differences of POS tag format and > configuration parameter > Conceptual idea: > - {{new POSTaggerME("en")}} => by _default_: UD format "as is" > - {{new POSTaggerME("en", POSTagFormat.PENN)}} => by _intention_, here: Penn > style > Benefit: > 1. It should be explicit so devs / user see what they will get via > {{POSTagFormat}}. Enum values: POSTagFormat.UD, POSTagFormat.PENN, > POSTagFormat.DEFAULT > 2. IT tests can now be formulated to work on both modern and legacy models. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer
[ https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844159#comment-17844159 ] ASF GitHub Bot commented on OPENNLP-1556: - mawiesne merged PR #600: URL: https://github.com/apache/opennlp/pull/600 > Improve speed of checksum computation in TwoPassDataIndexer > --- > > Key: OPENNLP-1556 > URL: https://issues.apache.org/jira/browse/OPENNLP-1556 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning >Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.3.4 > > > For training ML models, all observations (Events) are indexed via > {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. > When #index(..) is run, a tmp file is written and read in again. For the > purpose of checksum validation, instances of HashSumEventStream are used to > validate the content processed. > Based on a rather slow toString() implementation in Event, a cryptographic > (MD5) message digest is computed. This, however, is much slower than simply > computing a checksum (such as a CRC32c value) for both directions > (write/read). The (slowing) effect is more problematic when larger training > corpora are (pre-)processed, that is, indexed in advance. > Aims: > - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training > phase. > - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash > function here; it's simply a checksum that is required to decide whether all > bytes written are the same bytes that are read. > - Remove the untested class HashSumEventStream which is just a wrapper for > calling a slow toString() in Event to get some bytes to use for the > computation of a checksum / md. > - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that > makes use of the faster CRC32c checksum computation, avoiding cryptographic > hash functions such as MD5. > - Make sure all existing tests hold. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer
[ https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843560#comment-17843560 ] ASF GitHub Bot commented on OPENNLP-1556: - jzonthemtn commented on PR #600: URL: https://github.com/apache/opennlp/pull/600#issuecomment-2094769379 Looks awesome! Thanks @mawiesne > Improve speed of checksum computation in TwoPassDataIndexer > --- > > Key: OPENNLP-1556 > URL: https://issues.apache.org/jira/browse/OPENNLP-1556 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning >Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.3.4 > > > For training ML models, all observations (Events) are indexed via > {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. > When #index(..) is run, a tmp file is written and read in again. For the > purpose of checksum validation, instances of HashSumEventStream are used to > validate the content processed. > Based on a rather slow toString() implementation in Event, a cryptographic > (MD5) message digest is computed. This, however, is much slower than simply > computing a checksum (such as a CRC32c value) for both directions > (write/read). The (slowing) effect is more problematic when larger training > corpora are (pre-)processed, that is, indexed in advance. > Aims: > - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training > phase. > - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash > function here; it's simply a checksum that is required to decide whether all > bytes written are the same bytes that are read. > - Remove the untested class HashSumEventStream which is just a wrapper for > calling a slow toString() in Event to get some bytes to use for the > computation of a checksum / md. > - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that > makes use of the faster CRC32c checksum computation, avoiding cryptographic > hash functions such as MD5. > - Make sure all existing tests hold. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer
[ https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843539#comment-17843539 ] ASF GitHub Bot commented on OPENNLP-1556: - mawiesne commented on PR #600: URL: https://github.com/apache/opennlp/pull/600#issuecomment-2094703523 Note: Added better JavaDoc, code: unchanged. > Improve speed of checksum computation in TwoPassDataIndexer > --- > > Key: OPENNLP-1556 > URL: https://issues.apache.org/jira/browse/OPENNLP-1556 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning >Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.3.4 > > > For training ML models, all observations (Events) are indexed via > {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. > When #index(..) is run, a tmp file is written and read in again. For the > purpose of checksum validation, instances of HashSumEventStream are used to > validate the content processed. > Based on a rather slow toString() implementation in Event, a cryptographic > (MD5) message digest is computed. This, however, is much slower than simply > computing a checksum (such as a CRC32c value) for both directions > (write/read). The (slowing) effect is more problematic when larger training > corpora are (pre-)processed, that is, indexed in advance. > Aims: > - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training > phase. > - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash > function here; it's simply a checksum that is required to decide whether all > bytes written are the same bytes that are read. > - Remove the untested class HashSumEventStream which is just a wrapper for > calling a slow toString() in Event to get some bytes to use for the > computation of a checksum / md. > - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that > makes use of the faster CRC32c checksum computation, avoiding cryptographic > hash functions such as MD5. > - Make sure all existing tests hold. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer
[ https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843537#comment-17843537 ] ASF GitHub Bot commented on OPENNLP-1556: - rzo1 commented on PR #600: URL: https://github.com/apache/opennlp/pull/600#issuecomment-2094698256 https://ci-builds.apache.org/job/OPENnLP/job/eval-tests-configurable/18/ > Improve speed of checksum computation in TwoPassDataIndexer > --- > > Key: OPENNLP-1556 > URL: https://issues.apache.org/jira/browse/OPENNLP-1556 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning >Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.3.4 > > > For training ML models, all observations (Events) are indexed via > {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. > When #index(..) is run, a tmp file is written and read in again. For the > purpose of checksum validation, instances of HashSumEventStream are used to > validate the content processed. > Based on a rather slow toString() implementation in Event, a cryptographic > (MD5) message digest is computed. This, however, is much slower than simply > computing a checksum (such as a CRC32c value) for both directions > (write/read). The (slowing) effect is more problematic when larger training > corpora are (pre-)processed, that is, indexed in advance. > Aims: > - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training > phase. > - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash > function here; it's simply a checksum that is required to decide whether all > bytes written are the same bytes that are read. > - Remove the untested class HashSumEventStream which is just a wrapper for > calling a slow toString() in Event to get some bytes to use for the > computation of a checksum / md. > - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that > makes use of the faster CRC32c checksum computation, avoiding cryptographic > hash functions such as MD5. > - Make sure all existing tests hold. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer
[ https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843533#comment-17843533 ] ASF GitHub Bot commented on OPENNLP-1556: - mawiesne opened a new pull request, #600: URL: https://github.com/apache/opennlp/pull/600 Change - - adjusts TwoPassDataIndexer to make use of JDK's built-in `CheckedOutputStream` / `CheckedInputStream` for checksum ([CRC32c](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/zip/CRC32C.html)) computations - removes untested class `HashSumEventStream` which is just a wrapper for calling a slow toString() in Event to get some bytes to use for the computation of a checksum - provides a HashSumEventStream replacement: `ChecksumEventStream` which makes use of the faster CRC32c checksum computation, avoiding cryptographic hash functions such as MD5 - adds JUnit tests for ChecksumEventStream Note(s) - 1. A _full_ OpenNLP build is consistently about 3-4s faster on my local machine (61s vs. 57s => ~ -7%) . 2. The effect should be greater when processing large(r) training corpora. 3. We should see a difference in a bare metal EVAL build run. Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [x] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Improve speed of checksum computation in TwoPassDataIndexer > --- > > Key: OPENNLP-1556 > URL: https://issues.apache.org/jira/browse/OPENNLP-1556 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning >Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.3.4 > > > For training ML models, all observations (Events) are indexed via > {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. > When #index(..) is run, a tmp file is written and read in again. For the > purpose of checksum validation, instances of HashSumEventStream are used to > validate the content processed. > Based on a rather slow toString() implementation in Event, a cryptographic > (MD5) message digest is computed. This, however, is much slower than simply > computing a checksum (such as a CRC32c value) for both directions > (write/read). The (slowing) effect is more problematic when larger training > corpora are (pre-)processed, that is, indexed in advance. > Aims: > - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training > phase. > - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash > function here; it's simply a checksum that is required to decide whether all > bytes written are the same bytes that are read. > - Remove the untested class HashSumEventStream which is just a wrapper for > calling a slow toString() in Event to get some bytes to use for the > computation of a checksum / md. > - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that > makes use of the faster CRC32c checksum computation, avoiding cryptographic > hash functions such as MD5. > - Make sure all existing tests hold. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1555) TokenizerME should detect multi-dot abbreviations
[ https://issues.apache.org/jira/browse/OPENNLP-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842864#comment-17842864 ] ASF GitHub Bot commented on OPENNLP-1555: - mawiesne merged PR #599: URL: https://github.com/apache/opennlp/pull/599 > TokenizerME should detect multi-dot abbreviations > - > > Key: OPENNLP-1555 > URL: https://issues.apache.org/jira/browse/OPENNLP-1555 > Project: OpenNLP > Issue Type: Improvement > Components: Tokenizer >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.3.4 > > > TokenizerME should detect and handle multi-dot abbreviations correctly. > Currently, this is not handled correctly. For instance, > German: "z.B." = "zum Beispiel" (for example) or, > Dutch: "e.v." = "en volgende" (and following) > are not tokenized correctly and extra tokens are returned. NOTE: no > whitespaces in between the dots in the above examples. > Aims: > * Fix the detection / handling of abbreviations for multi-dot abbreviations > * Provide test cases that cover these cases -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1555) TokenizerME should detect multi-dot abbreviations
[ https://issues.apache.org/jira/browse/OPENNLP-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841646#comment-17841646 ] ASF GitHub Bot commented on OPENNLP-1555: - mawiesne opened a new pull request, #599: URL: https://github.com/apache/opennlp/pull/599 Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [x] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > TokenizerME should detect multi-dot abbreviations > - > > Key: OPENNLP-1555 > URL: https://issues.apache.org/jira/browse/OPENNLP-1555 > Project: OpenNLP > Issue Type: Improvement > Components: Tokenizer >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.3.4 > > > TokenizerME should detect and handle multi-dot abbreviations correctly. > Currently, this is not handled correctly. For instance, > German: "z.B." = "zum Beispiel" (for example) or, > Dutch: "e.v." = "en volgende" (and following) > are not tokenized correctly and extra tokens are returned. NOTE: no > whitespaces in between the dots in the above examples. > Aims: > * Fix the detection / handling of abbreviations for multi-dot abbreviations > * Provide test cases that cover these cases -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1554) Add Dutch abbreviation dictionary
[ https://issues.apache.org/jira/browse/OPENNLP-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839321#comment-17839321 ] ASF GitHub Bot commented on OPENNLP-1554: - mawiesne merged PR #597: URL: https://github.com/apache/opennlp/pull/597 > Add Dutch abbreviation dictionary > - > > Key: OPENNLP-1554 > URL: https://issues.apache.org/jira/browse/OPENNLP-1554 > Project: OpenNLP > Issue Type: Improvement > Components: Sentence Detector, Tokenizer >Affects Versions: 2.3.2 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.3.3 > > > Similar to the addition in OPENNLP-1526, an abbreviation dictionary for Dutch > sentence detection and tokenisation might be beneficial. > Aims: > Create and add a new file {{abb_NL.xml}} in {{opennlp-tools/lang/nl}} > Add basic set of test cases -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1554) Add Dutch abbreviation dictionary
[ https://issues.apache.org/jira/browse/OPENNLP-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839131#comment-17839131 ] ASF GitHub Bot commented on OPENNLP-1554: - kinow commented on code in PR #597: URL: https://github.com/apache/opennlp/pull/597#discussion_r1573009751 ## opennlp-tools/src/test/java/opennlp/tools/sentdetect/AbstractSentenceDetectorTest.java: ## @@ -30,13 +30,16 @@ public abstract class AbstractSentenceDetectorTest { - protected static final Locale LOCALE_SPANISH = new Locale("es"); + protected static final Locale LOCALE_DUTCH = new Locale("nl"); protected static final Locale LOCALE_POLISH = new Locale("pl"); protected static final Locale LOCALE_PORTUGUESE = new Locale("pt"); + protected static final Locale LOCALE_SPANISH = new Locale("es"); Review Comment: The small details that normally pass unnoticed! Thanks for sorting alphabetically :clap: ## opennlp-tools/src/test/java/opennlp/tools/sentdetect/SentenceDetectorMEDutchTest.java: ## @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.sentdetect; + +import java.io.IOException; + +import org.junit.jupiter.api.Assertions; +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.Test; + +import opennlp.tools.dictionary.Dictionary; + +/** + * Tests for the {@link SentenceDetectorME} class. + * + * Demonstrates OPENNLP-1554. + * + * In this context, well-known known Dutch (nl_NL) abbreviations must be respected, + * so that words abbreviated with one or more '.' characters do not + * result in incorrect sentence boundaries. + * + * See: + * https://issues.apache.org/jira/projects/OPENNLP/issues/OPENNLP-1554;>OPENNLP-1554 + */ +public class SentenceDetectorMEDutchTest extends AbstractSentenceDetectorTest { + + private static final char[] EOS_CHARS = {'.', '?', '!'}; + + private static SentenceModel sentdetectModel; + + @BeforeAll + public static void prepareResources() throws IOException { +Dictionary abbreviationDict = loadAbbDictionary(LOCALE_DUTCH); +SentenceDetectorFactory factory = new SentenceDetectorFactory( +"dut", true, abbreviationDict, EOS_CHARS); +sentdetectModel = train(factory, LOCALE_DUTCH); +Assertions.assertNotNull(sentdetectModel); +Assertions.assertEquals("dut", sentdetectModel.getLanguage()); + } + + // Example taken from 'Sentences_NL.txt' + @Test + void testSentDetectWithInlineAbbreviationsEx1() { +final String sent1 = "Een droom, tot de vorming waarvan een bijzonder sterke compressie " + +"heeft bijgedragen, zal het meest gunstige materiaal zijn voor dit onderzoek."; +// Here we have one abbreviations "p." => pagina (page) +final String sent2 = "Ik kies voor de droom van de botanische monografie die " + +"op p. 183 en volgende wordt beschreven."; + +SentenceDetectorME sentDetect = new SentenceDetectorME(sentdetectModel); +String sampleSentences = sent1 + " " + sent2; +String[] sents = sentDetect.sentDetect(sampleSentences); +Assertions.assertEquals(2, sents.length); +Assertions.assertEquals(sent1, sents[0]); +Assertions.assertEquals(sent2, sents[1]); +double[] probs = sentDetect.getSentenceProbabilities(); +Assertions.assertEquals(2, probs.length); + } + + // Reduced example taken from 'Sentences_NL.txt' + @Test + void testSentDetectWithInlineAbbreviationsEx2() { +// Here we have one abbreviations: "d.w.z." = dat wil zeggen (eng.: that is to say) +final String sent1 = "Met het oog op de overvloed aan ideeën die de analyse op elk " + +"afzonderlijk element van de droominhoud brengt, zullen sommige lezers twijfels " + +"hebben over het principe of alles wat later tijdens de analyse in je opkomt, " + +"tot de droomgedachten gerekend mag worden, d.w.z. of aangenomen mag worden " + +"dat al deze gedachten al tijdens de slaaptoestand actief waren en bijdroegen " + +"aan de vorming van de droom?"; Review Comment: I was reviewing a pull request in Jena today, and noticed the `"""` for
[jira] [Commented] (OPENNLP-1554) Add Dutch abbreviation dictionary
[ https://issues.apache.org/jira/browse/OPENNLP-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839093#comment-17839093 ] ASF GitHub Bot commented on OPENNLP-1554: - mawiesne opened a new pull request, #597: URL: https://github.com/apache/opennlp/pull/597 Change - - adds abb_NL.xml to opennlp-tools/lang for the Dutch language, by using these resources: https://jedutchy.com/resources/abbreviations, https://script.byu.edu/dutch-handwriting/tools/abbreviations, https://en.wikipedia.org/wiki/Date_and_time_notation_in_the_Netherlands - adds new test cases for the DUT/NLD localization Tasks - Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with OPENNLP- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically main)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder? - [x] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible. > Add Dutch abbreviation dictionary > - > > Key: OPENNLP-1554 > URL: https://issues.apache.org/jira/browse/OPENNLP-1554 > Project: OpenNLP > Issue Type: Improvement > Components: Sentence Detector, Tokenizer >Affects Versions: 2.3.2 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.3.3 > > > Similar to the addition in OPENNLP-1526, an abbreviation dictionary for Dutch > sentence detection and tokenisation might be beneficial. > Aims: > Create and add a new file {{abb_NL.xml}} in {{opennlp-tools/lang/nl}} > Add basic set of test cases -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-589) Text format of Events inconsistent across different implementations of EventStreamReaders
[ https://issues.apache.org/jira/browse/OPENNLP-589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838651#comment-17838651 ] ASF GitHub Bot commented on OPENNLP-589: mawiesne merged PR #596: URL: https://github.com/apache/opennlp/pull/596 > Text format of Events inconsistent across different implementations of > EventStreamReaders > - > > Key: OPENNLP-589 > URL: https://issues.apache.org/jira/browse/OPENNLP-589 > Project: OpenNLP > Issue Type: Bug > Components: Machine Learning >Affects Versions: maxent-3.0.3, 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Marcin Junczys-Dowmunt >Assignee: Martin Wiesner >Priority: Minor > Fix For: 2.3.3 > > > BasicEventStream expects events to be written to text files as: > context1 context2 context3 ... outcome > FileEventStream expects events to be written to text files as: > outcome context1 context2 context3 ... > toString() of Event creates: > outcome [context1 context2 context3 ...] (note the square brackets, which are > part of context predicates when breaking on spaces). > This is highly confusing and took me some time to understand. I guess this > should be unified? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1546) NER training code example in documentation requires update
[ https://issues.apache.org/jira/browse/OPENNLP-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837137#comment-17837137 ] ASF GitHub Bot commented on OPENNLP-1546: - mawiesne merged PR #595: URL: https://github.com/apache/opennlp/pull/595 > NER training code example in documentation requires update > -- > > Key: OPENNLP-1546 > URL: https://issues.apache.org/jira/browse/OPENNLP-1546 > Project: OpenNLP > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Jeff Zemerick >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.3.3 > > > The NER training code example needs updated. > [https://opennlp.apache.org/docs/2.3.2/manual/opennlp.html#tools.namefind.training.api] > * The `TokenNameFinderFactory nameFinderFactory` part won't compile. > * The `model.serizialize(...)` part won't compile. > * This code might be outdated in general. > {code:java} > ObjectStream lineStream = > new PlainTextByLineStream(new > MarkableFileInputStreamFactory(new File("en-ner-person.train")), > StandardCharsets.UTF_8); > TokenNameFinderModel model; > try (ObjectStream sampleStream = new > NameSampleDataStream(lineStream)) { > model = NameFinderME.train("eng", "person", sampleStream, > TrainingParameters.defaultParams(), nameFinderFactory); > } > try (ObjectStream modelOut = new BufferedOutputStream(new > FileOutputStream(modelFile)){ > model.serialize(modelOut); > } >{code} > For reference (but not tested): > {code:java} > final InputStreamFactory in = new > MarkableFileInputStreamFactory(convertedTrainingFile); > final ObjectStream sampleStream = new > NameSampleDataStream(new PlainTextByLineStream(in, StandardCharsets.UTF_8)); > final TokenNameFinderModel nameFinderModel = NameFinderME.train("en", > null, sampleStream, TrainingParameters.defaultParams(), > TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new > BioCodec())); {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)