Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Jeff Zemerick
Great, thanks. I was able to reproduce the problem. I'll take a look and keep this thread updated. Thanks, Jeff On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard < richard.zowa...@hs-heilbronn.de> wrote: > Hi Jeff, > > thanks for the quick reply. Here it is: > https://issues.apache.org/jira/brow

[GitHub] [opennlp] jzonthemtn opened a new pull request, #411: OPENNLP-1354: Fixing javadoc generation on Java 11.

2022-04-11 Thread GitBox
jzonthemtn opened a new pull request, #411: URL: https://github.com/apache/opennlp/pull/411 I ran into this issue while cutting the 2.0 release. Javadocs on JDK 11 would not build without these changes. The source 8 looks weird since we're on Java 11 but is a recommended workaround for the

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Zowalla, Richard
Hi Jeff, thanks for the quick reply. Here it is: https://issues.apache.org/jira/browse/OPENNLP-1366 Using the treebank from Tübingen might not be feasable as it consumes around 2 TB RAM ;) - the mentioned link in the ticket points to a smaller dataset, which should reproduce the issue with a fea

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Jeff Zemerick
Hi Richard, Thanks for reporting this. A Jira issue with steps to reproduce it would be fantastic. https://issues.apache.org/jira/projects/OPENNLP Please create one and reply back here with its ID once you do. I can take a look and see what can be done. Thanks, Jeff On Mon, Apr 11, 2022 at 8:47

Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Zowalla, Richard
Hi all, we are working on training a large opennlp maxent model for lemmatizing German texts. We use a wikipedia tree bank from Tübingen. This works fine for mid size corpora (just need a little bit of RAM and time). However, we are running into the exception mentioned in [1]. Debugging into the

Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Zowalla, Richard
Hi all, we are working on training a large opennlp maxent model for lemmatizing German texts. We use a wikipedia tree bank from Tübingen. This works fine for mid size corpora (just need a little bit of RAM and time). However, we are running into the exception mentioned in [1]. Debugging into the

[GitHub] [opennlp] jzonthemtn merged pull request #410: OPENNLP-1351: Moving onnx models for testing. Fixing expected value.

2022-04-11 Thread GitBox
jzonthemtn merged PR #410: URL: https://github.com/apache/opennlp/pull/410 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apach

[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #410: OPENNLP-1351: Moving onnx models for testing. Fixing expected value.

2022-04-11 Thread GitBox
jzonthemtn commented on code in PR #410: URL: https://github.com/apache/opennlp/pull/410#discussion_r847234944 ## opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java: ## @@ -54,7 +54,7 @@ public void tokenNameFinder1Test() throws Exception { Assert.assertEq