Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Zowalla, Richard Sun, 17 Apr 2022 09:33:38 -0700

Hi Jeff,

reading the stacktrace myself (now), I think, that an outdated snapshot
was included for this test (as it doesn't fit the code).


I will report back, if this is the case and Maven / Gradle / IDE did
something weird.

Sorry & Gruß
Richard

Am Sonntag, dem 17.04.2022 um 16:26 +0000 schrieb Zowalla, Richard:
> Hi Jeff,
> 
> the task completed and we have some feedback.
> 
> My colleague directly commented in the related commit [1].
> 
> Writing the model seems to work but reading the resulting model
> fails.
> 
> Gruß
> Richard
> 
> [1] 
> https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702#commitcomment-71463963
> 
> Am Dienstag, dem 12.04.2022 um 14:40 +0000 schrieb Zowalla, Richard:
> > Hi Jeff,
> > 
> > thanks for the update.
> > 
> > We will give the change a try with a SNAPSHOT build including the
> > potential patch and start a run on the cluster with the Tübingen
> > Wikipedia Treebank. Guess we will have feedback in ~ 48 hours
> > regarding
> > writeShort(...).
> > 
> > Gruß
> > Richard 
> > 
> > Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick:
> > > Luckily, this looks like a common problem [1] for years regarding
> > > writeUTF(). Following other guidance and the function's javadocs
> > > [2],
> > > writeUTF() writes the number of bytes written out followed by the
> > > string.
> > > Changing it to manually write the length of the string followed
> > > by
> > > write()
> > > allows the training to succeed. All unit tests pass and this
> > > seems
> > > to
> > > indicate it would be backward compatible because of unit tests
> > > that
> > > load
> > > models in src/test/resources/, but I want to verify that more to
> > > be
> > > sure.
> > > 
> > > Here's the changes:
> > > https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1
> > > 
> > > I am unsure of the writeShort() method for writing the length of
> > > the
> > > string. Even though it works for the UD data now, is that
> > > actually
> > > resolving the problem?
> > > 
> > > Anyone have any insights into this?
> > > 
> > > Thanks,
> > > Jeff
> > > 
> > > [1]
> > > https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> > > [2]
> > > https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
> > > 
> > > On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <
> > > [email protected]
> > > wrote:
> > > 
> > > > Great, thanks. I was able to reproduce the problem. I'll take a
> > > > look and
> > > > keep this thread updated.
> > > > 
> > > > Thanks,
> > > > Jeff
> > > > 
> > > > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> > > > [email protected]> wrote:
> > > > 
> > > > > Hi Jeff,
> > > > > 
> > > > > thanks for the quick reply. Here it is:
> > > > > https://issues.apache.org/jira/browse/OPENNLP-1366
> > > > > 
> > > > > Using the treebank from Tübingen might not be feasable as it
> > > > > consumes
> > > > > around 2 TB RAM ;) - the mentioned link in the ticket points
> > > > > to
> > > > > a
> > > > > smaller dataset, which should reproduce the issue with a
> > > > > feasable
> > > > > amount of required RAM.
> > > > > 
> > > > > It basically boils down to a size limitation in the JDK's
> > > > > DataOutputStream.
> > > > > 
> > > > > Gruß
> > > > > Richard
> > > > > 
> > > > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff
> > > > > Zemerick:
> > > > > > Hi Richard,
> > > > > > 
> > > > > > Thanks for reporting this. A Jira issue with steps to
> > > > > > reproduce
> > > > > > it
> > > > > > would be
> > > > > > fantastic. https://issues.apache.org/jira/projects/OPENNLP
> > > > > > 
> > > > > > Please create one and reply back here with its ID once you
> > > > > > do.
> > > > > > I can
> > > > > > take a
> > > > > > look and see what can be done.
> > > > > > 
> > > > > > Thanks,
> > > > > > Jeff
> > > > > > 
> > > > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > > > > > [email protected]> wrote:
> > > > > > 
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > we are working on training a large opennlp maxent model
> > > > > > > for
> > > > > > > lemmatizing
> > > > > > > German texts. We use a wikipedia tree bank from Tübingen.
> > > > > > > 
> > > > > > > This works fine for mid size corpora (just need a little
> > > > > > > bit
> > > > > > > of RAM
> > > > > > > and
> > > > > > > time). However, we are running into the exception
> > > > > > > mentioned
> > > > > > > in [1].
> > > > > > > Debugging into the DataOutputStream reveals, that this is
> > > > > > > a
> > > > > > > limitation
> > > > > > > of the java.io.DataOutputstream.
> > > > > > > 
> > > > > > > Do we have any chance to solve this or do we need to
> > > > > > > implement
> > > > > > > custom
> > > > > > > readers / writers in order to get it work?
> > > > > > > 
> > > > > > > If this is a general problem for large corpora, I am also
> > > > > > > happy to
> > > > > > > create a related ticket / issue in Jira with steps to
> > > > > > > reproduce ;)
> > > > > > > 
> > > > > > > Thanks in advance.
> > > > > > > 
> > > > > > > Gruß
> > > > > > > Richard
> > > > > > > 
> > > > > > > [1]
> > > > > > > 
> > > > > > > 
> > > > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > > > > > > --
> > > > > > > Richard Zowalla, M.Sc.
> > > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > > 
> > > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > > Max-Planck-Str. 39
> > > > > > > D-74081 Heilbronn
> > > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > > erreichbar)
> > > > > > > mail: [email protected]
> > > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > > 
> > > > > --
> > > > > Richard Zowalla, M.Sc.
> > > > > Research Associate, PhD Student | Medical Informatics
> > > > > 
> > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > Max-Planck-Str. 39
> > > > > D-74081 Heilbronn
> > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > erreichbar)
> > > > > mail: [email protected]
> > > > > web: https://www.mi.hs-heilbronn.de/
> > > > >

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to