Thanks for trying it and for all the info! I will check it out and let you know.
Thanks, Jeff On Sun, Apr 17, 2022 at 12:51 PM Zowalla, Richard < richard.zowa...@hs-heilbronn.de> wrote: > Hi Jeff, > > he did the validation again and it showed, that the IDE used an older > version of OpenNLP. > > After a clean build with the freshly created SNAPSHOT, the model load > resulted in another exception (which now looks reasonable to me). > > He updated his comment in [1]. Maybe you have an idea :) > > Thanks > Richard > > [1] > > https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702 > > Am Sonntag, dem 17.04.2022 um 16:33 +0000 schrieb Zowalla, Richard: > > Hi Jeff, > > > > reading the stacktrace myself (now), I think, that an outdated > > snapshot > > was included for this test (as it doesn't fit the code). > > > > I will report back, if this is the case and Maven / Gradle / IDE did > > something weird. > > > > Sorry & Gruß > > Richard > > > > Am Sonntag, dem 17.04.2022 um 16:26 +0000 schrieb Zowalla, Richard: > > > Hi Jeff, > > > > > > the task completed and we have some feedback. > > > > > > My colleague directly commented in the related commit [1]. > > > > > > Writing the model seems to work but reading the resulting model > > > fails. > > > > > > Gruß > > > Richard > > > > > > [1] > > > > https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702#commitcomment-71463963 > > > > > > Am Dienstag, dem 12.04.2022 um 14:40 +0000 schrieb Zowalla, > > > Richard: > > > > Hi Jeff, > > > > > > > > thanks for the update. > > > > > > > > We will give the change a try with a SNAPSHOT build including the > > > > potential patch and start a run on the cluster with the Tübingen > > > > Wikipedia Treebank. Guess we will have feedback in ~ 48 hours > > > > regarding > > > > writeShort(...). > > > > > > > > Gruß > > > > Richard > > > > > > > > Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick: > > > > > Luckily, this looks like a common problem [1] for years > > > > > regarding > > > > > writeUTF(). Following other guidance and the function's > > > > > javadocs > > > > > [2], > > > > > writeUTF() writes the number of bytes written out followed by > > > > > the > > > > > string. > > > > > Changing it to manually write the length of the string followed > > > > > by > > > > > write() > > > > > allows the training to succeed. All unit tests pass and this > > > > > seems > > > > > to > > > > > indicate it would be backward compatible because of unit tests > > > > > that > > > > > load > > > > > models in src/test/resources/, but I want to verify that more > > > > > to > > > > > be > > > > > sure. > > > > > > > > > > Here's the changes: > > > > > > https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1 > > > > > > > > > > I am unsure of the writeShort() method for writing the length > > > > > of > > > > > the > > > > > string. Even though it works for the UD data now, is that > > > > > actually > > > > > resolving the problem? > > > > > > > > > > Anyone have any insights into this? > > > > > > > > > > Thanks, > > > > > Jeff > > > > > > > > > > [1] > > > > > > https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction > > > > > [2] > > > > > > https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String) > > > > > > > > > > On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick < > > > > > jzemer...@apache.org > > > > > wrote: > > > > > > > > > > > Great, thanks. I was able to reproduce the problem. I'll take > > > > > > a > > > > > > look and > > > > > > keep this thread updated. > > > > > > > > > > > > Thanks, > > > > > > Jeff > > > > > > > > > > > > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard < > > > > > > richard.zowa...@hs-heilbronn.de> wrote: > > > > > > > > > > > > > Hi Jeff, > > > > > > > > > > > > > > thanks for the quick reply. Here it is: > > > > > > > https://issues.apache.org/jira/browse/OPENNLP-1366 > > > > > > > > > > > > > > Using the treebank from Tübingen might not be feasable as > > > > > > > it > > > > > > > consumes > > > > > > > around 2 TB RAM ;) - the mentioned link in the ticket > > > > > > > points > > > > > > > to > > > > > > > a > > > > > > > smaller dataset, which should reproduce the issue with a > > > > > > > feasable > > > > > > > amount of required RAM. > > > > > > > > > > > > > > It basically boils down to a size limitation in the JDK's > > > > > > > DataOutputStream. > > > > > > > > > > > > > > Gruß > > > > > > > Richard > > > > > > > > > > > > > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff > > > > > > > Zemerick: > > > > > > > > Hi Richard, > > > > > > > > > > > > > > > > Thanks for reporting this. A Jira issue with steps to > > > > > > > > reproduce > > > > > > > > it > > > > > > > > would be > > > > > > > > fantastic. > > > > > > > > https://issues.apache.org/jira/projects/OPENNLP > > > > > > > > > > > > > > > > Please create one and reply back here with its ID once > > > > > > > > you > > > > > > > > do. > > > > > > > > I can > > > > > > > > take a > > > > > > > > look and see what can be done. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Jeff > > > > > > > > > > > > > > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard < > > > > > > > > richard.zowa...@hs-heilbronn.de> wrote: > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > we are working on training a large opennlp maxent model > > > > > > > > > for > > > > > > > > > lemmatizing > > > > > > > > > German texts. We use a wikipedia tree bank from > > > > > > > > > Tübingen. > > > > > > > > > > > > > > > > > > This works fine for mid size corpora (just need a > > > > > > > > > little > > > > > > > > > bit > > > > > > > > > of RAM > > > > > > > > > and > > > > > > > > > time). However, we are running into the exception > > > > > > > > > mentioned > > > > > > > > > in [1]. > > > > > > > > > Debugging into the DataOutputStream reveals, that this > > > > > > > > > is > > > > > > > > > a > > > > > > > > > limitation > > > > > > > > > of the java.io.DataOutputstream. > > > > > > > > > > > > > > > > > > Do we have any chance to solve this or do we need to > > > > > > > > > implement > > > > > > > > > custom > > > > > > > > > readers / writers in order to get it work? > > > > > > > > > > > > > > > > > > If this is a general problem for large corpora, I am > > > > > > > > > also > > > > > > > > > happy to > > > > > > > > > create a related ticket / issue in Jira with steps to > > > > > > > > > reproduce ;) > > > > > > > > > > > > > > > > > > Thanks in advance. > > > > > > > > > > > > > > > > > > Gruß > > > > > > > > > Richard > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long > > > > > > > > > -- > > > > > > > > > Richard Zowalla, M.Sc. > > > > > > > > > Research Associate, PhD Student | Medical Informatics > > > > > > > > > > > > > > > > > > Hochschule Heilbronn – University of Applied Sciences > > > > > > > > > Max-Planck-Str. 39 > > > > > > > > > D-74081 Heilbronn > > > > > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon > > > > > > > > > erreichbar) > > > > > > > > > mail: richard.zowa...@hs-heilbronn.de > > > > > > > > > web: https://www.mi.hs-heilbronn.de/ > > > > > > > > > > > > > > > > -- > > > > > > > Richard Zowalla, M.Sc. > > > > > > > Research Associate, PhD Student | Medical Informatics > > > > > > > > > > > > > > Hochschule Heilbronn – University of Applied Sciences > > > > > > > Max-Planck-Str. 39 > > > > > > > D-74081 Heilbronn > > > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon > > > > > > > erreichbar) > > > > > > > mail: richard.zowa...@hs-heilbronn.de > > > > > > > web: https://www.mi.hs-heilbronn.de/ > > > > > > > >