Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-28 Thread Jeff Zemerick
Thanks for the PR! I just merged it. I'm glad this will be in the 2.1 release which should go out for vote next week. Thanks, Jeff On Tue, Oct 25, 2022 at 2:31 AM Richard Zowalla wrote: > Hi, > > here is a PR by my collegue Martin W.: > https://github.com/apache/opennlp/pull/427 > > Some more d

[GitHub] [opennlp] jzonthemtn merged pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-28 Thread GitBox
jzonthemtn merged PR #427: URL: https://github.com/apache/opennlp/pull/427 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apach

[GitHub] [opennlp] mawiesne commented on pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-28 Thread GitBox
mawiesne commented on PR #427: URL: https://github.com/apache/opennlp/pull/427#issuecomment-1294606427 FYI and for completeness: I tested the patch against the German treebank resources available from: https://universaldependencies.org/#language. - The resulting model files could be

[GitHub] [opennlp] kinow commented on pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-27 Thread GitBox
kinow commented on PR #427: URL: https://github.com/apache/opennlp/pull/427#issuecomment-1294547409 > @jzonthemtn Could you trigger & approve the (build) workflow, 'cause I can't. Thx in advance. Done -- This is an automated message from the Apache Git Service. To respond to the me

[GitHub] [opennlp] mawiesne commented on pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-27 Thread GitBox
mawiesne commented on PR #427: URL: https://github.com/apache/opennlp/pull/427#issuecomment-1294518254 @jzonthemtn Could you trigger approve the (build) workflow, 'cause I can't. Thx in advance. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [opennlp] mawiesne commented on pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-26 Thread GitBox
mawiesne commented on PR #427: URL: https://github.com/apache/opennlp/pull/427#issuecomment-1292436503 @atarora The JUnit test has 🛩️ and gives a 💯% line coverage for `ModelParameterChunker`. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [opennlp] mawiesne commented on pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-26 Thread GitBox
mawiesne commented on PR #427: URL: https://github.com/apache/opennlp/pull/427#issuecomment-1292384823 @atarora I'll provide a basic JUnit test for the `ModelParameterChunker` class. The more complex scenario resembles a full system test at least it is an integration test setup, that is qui

[GitHub] [opennlp] atarora commented on pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-26 Thread GitBox
atarora commented on PR #427: URL: https://github.com/apache/opennlp/pull/427#issuecomment-1292233015 the PR looks good, except for missing test cases! Can those be added ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-26 Thread GitBox
jzonthemtn commented on code in PR #427: URL: https://github.com/apache/opennlp/pull/427#discussion_r1005576541 ## opennlp-tools/src/main/java/opennlp/tools/ml/model/ModelParameterChunker.java: ## @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under o

[GitHub] [opennlp] mawiesne commented on a diff in pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-25 Thread GitBox
mawiesne commented on code in PR #427: URL: https://github.com/apache/opennlp/pull/427#discussion_r1005268338 ## opennlp-tools/src/main/java/opennlp/tools/ml/model/ModelParameterChunker.java: ## @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

[GitHub] [opennlp] mawiesne commented on a diff in pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-25 Thread GitBox
mawiesne commented on code in PR #427: URL: https://github.com/apache/opennlp/pull/427#discussion_r1005268338 ## opennlp-tools/src/main/java/opennlp/tools/ml/model/ModelParameterChunker.java: ## @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

[GitHub] [opennlp] mawiesne commented on a diff in pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-25 Thread GitBox
mawiesne commented on code in PR #427: URL: https://github.com/apache/opennlp/pull/427#discussion_r1005268338 ## opennlp-tools/src/main/java/opennlp/tools/ml/model/ModelParameterChunker.java: ## @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

[GitHub] [opennlp] jzonthemtn commented on pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-25 Thread GitBox
jzonthemtn commented on PR #427: URL: https://github.com/apache/opennlp/pull/427#issuecomment-1290958387 @mawiesne This looks great! Thanks for the PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

[GitHub] [opennlp] jzonthemtn commented on a diff in pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-25 Thread GitBox
jzonthemtn commented on code in PR #427: URL: https://github.com/apache/opennlp/pull/427#discussion_r1004820481 ## opennlp-tools/src/main/java/opennlp/tools/ml/model/ModelParameterChunker.java: ## @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under o

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-24 Thread Richard Zowalla
Hi, here is a PR by my collegue Martin W.: https://github.com/apache/opennlp/pull/427 Some more details are contained in https://issues.apache.org/jira/browse/OPENNLP-1366 The change is tested with the huge corpus on the HPC system. Gruß Richard Z Am Freitag, dem 14.10.2022 um 08:18 +0200 sc

[GitHub] [opennlp] mawiesne opened a new pull request, #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-24 Thread GitBox
mawiesne opened a new pull request, #427: URL: https://github.com/apache/opennlp/pull/427 Thank you for contributing to Apache OpenNLP. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-10-13 Thread Richard Zowalla
Hi Jeff, just to drop a short notice on that one: My collegue, who is affected by this, is preparing a PR (might take some time though because of testing on the HPC system...), which will hopefully solve reading / writing "large" models without breaking existing ones in the process. Gruß Richar

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-07-28 Thread Zowalla, Richard
Hi Jeff, no real updates from our side. We were quite busy in the last weeks finishing and correcting student course work ;) My last status in this matter is: The change from writeUTF to writeShort worked. Training and writing the MaxEnt model just worked for this huge corpus. No (runtime) erros

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-07-27 Thread Jeff Zemerick
Hi Richard, I know it's been a while but I wanted to circle back to this to see if there are any updates. Thanks, Jeff On Mon, Apr 25, 2022 at 1:48 PM Richard Eckart de Castilho wrote: > Hi, > > > On 11. Apr 2022, at 14:50, Zowalla, Richard < > richard.zowa...@hs-heilbronn.de> wrote: > > > > T

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-25 Thread Richard Eckart de Castilho
Hi, > On 11. Apr 2022, at 14:50, Zowalla, Richard > wrote: > > This works fine for mid size corpora (just need a little bit of RAM and > time). However, we are running into the exception mentioned in [1]. > Debugging into the DataOutputStream reveals, that this is a limitation > of the java.io.

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-18 Thread Jeff Zemerick
Thanks for trying it and for all the info! I will check it out and let you know. Thanks, Jeff On Sun, Apr 17, 2022 at 12:51 PM Zowalla, Richard < richard.zowa...@hs-heilbronn.de> wrote: > Hi Jeff, > > he did the validation again and it showed, that the IDE used an older > version of OpenNLP. > >

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-17 Thread Zowalla, Richard
Hi Jeff, he did the validation again and it showed, that the IDE used an older version of OpenNLP. After a clean build with the freshly created SNAPSHOT, the model load resulted in another exception (which now looks reasonable to me). He updated his comment in [1]. Maybe you have an idea :) Th

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-17 Thread Zowalla, Richard
Hi Jeff, reading the stacktrace myself (now), I think, that an outdated snapshot was included for this test (as it doesn't fit the code). I will report back, if this is the case and Maven / Gradle / IDE did something weird. Sorry & Gruß Richard Am Sonntag, dem 17.04.2022 um 16:26 + schrieb

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-17 Thread Zowalla, Richard
Hi Jeff, the task completed and we have some feedback. My colleague directly commented in the related commit [1]. Writing the model seems to work but reading the resulting model fails. Gruß Richard [1] https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702#commitcom

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-12 Thread Zowalla, Richard
Hi Jeff, thanks for the update. We will give the change a try with a SNAPSHOT build including the potential patch and start a run on the cluster with the Tübingen Wikipedia Treebank. Guess we will have feedback in ~ 48 hours regarding writeShort(...). Gruß Richard Am Dienstag, dem 12.04.2022 u

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-12 Thread Jeff Zemerick
Luckily, this looks like a common problem [1] for years regarding writeUTF(). Following other guidance and the function's javadocs [2], writeUTF() writes the number of bytes written out followed by the string. Changing it to manually write the length of the string followed by write() allows the tra

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Jeff Zemerick
Great, thanks. I was able to reproduce the problem. I'll take a look and keep this thread updated. Thanks, Jeff On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard < richard.zowa...@hs-heilbronn.de> wrote: > Hi Jeff, > > thanks for the quick reply. Here it is: > https://issues.apache.org/jira/brow

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Zowalla, Richard
Hi Jeff, thanks for the quick reply. Here it is: https://issues.apache.org/jira/browse/OPENNLP-1366 Using the treebank from Tübingen might not be feasable as it consumes around 2 TB RAM ;) - the mentioned link in the ticket points to a smaller dataset, which should reproduce the issue with a fea

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Jeff Zemerick
Hi Richard, Thanks for reporting this. A Jira issue with steps to reproduce it would be fantastic. https://issues.apache.org/jira/projects/OPENNLP Please create one and reply back here with its ID once you do. I can take a look and see what can be done. Thanks, Jeff On Mon, Apr 11, 2022 at 8:47

Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Zowalla, Richard
Hi all, we are working on training a large opennlp maxent model for lemmatizing German texts. We use a wikipedia tree bank from Tübingen. This works fine for mid size corpora (just need a little bit of RAM and time). However, we are running into the exception mentioned in [1]. Debugging into the

Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

2022-04-11 Thread Zowalla, Richard
Hi all, we are working on training a large opennlp maxent model for lemmatizing German texts. We use a wikipedia tree bank from Tübingen. This works fine for mid size corpora (just need a little bit of RAM and time). However, we are running into the exception mentioned in [1]. Debugging into the