Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Zowalla, Richard Sun, 17 Apr 2022 09:27:03 -0700

Hi Jeff,

the task completed and we have some feedback.


My colleague directly commented in the related commit [1].

Writing the model seems to work but reading the resulting model fails.

Gruß
Richard

[1] 
https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702#commitcomment-71463963

Am Dienstag, dem 12.04.2022 um 14:40 +0000 schrieb Zowalla, Richard:
> Hi Jeff,
> 
> thanks for the update.
> 
> We will give the change a try with a SNAPSHOT build including the
> potential patch and start a run on the cluster with the Tübingen
> Wikipedia Treebank. Guess we will have feedback in ~ 48 hours
> regarding
> writeShort(...).
> 
> Gruß
> Richard 
> 
> Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick:
> > Luckily, this looks like a common problem [1] for years regarding
> > writeUTF(). Following other guidance and the function's javadocs
> > [2],
> > writeUTF() writes the number of bytes written out followed by the
> > string.
> > Changing it to manually write the length of the string followed by
> > write()
> > allows the training to succeed. All unit tests pass and this seems
> > to
> > indicate it would be backward compatible because of unit tests that
> > load
> > models in src/test/resources/, but I want to verify that more to be
> > sure.
> > 
> > Here's the changes:
> > https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1
> > 
> > I am unsure of the writeShort() method for writing the length of
> > the
> > string. Even though it works for the UD data now, is that actually
> > resolving the problem?
> > 
> > Anyone have any insights into this?
> > 
> > Thanks,
> > Jeff
> > 
> > [1]
> > https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> > [2]
> > https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
> > 
> > On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <[email protected]
> > >
> > wrote:
> > 
> > > Great, thanks. I was able to reproduce the problem. I'll take a
> > > look and
> > > keep this thread updated.
> > > 
> > > Thanks,
> > > Jeff
> > > 
> > > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> > > [email protected]> wrote:
> > > 
> > > > Hi Jeff,
> > > > 
> > > > thanks for the quick reply. Here it is:
> > > > https://issues.apache.org/jira/browse/OPENNLP-1366
> > > > 
> > > > Using the treebank from Tübingen might not be feasable as it
> > > > consumes
> > > > around 2 TB RAM ;) - the mentioned link in the ticket points to
> > > > a
> > > > smaller dataset, which should reproduce the issue with a
> > > > feasable
> > > > amount of required RAM.
> > > > 
> > > > It basically boils down to a size limitation in the JDK's
> > > > DataOutputStream.
> > > > 
> > > > Gruß
> > > > Richard
> > > > 
> > > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick:
> > > > > Hi Richard,
> > > > > 
> > > > > Thanks for reporting this. A Jira issue with steps to
> > > > > reproduce
> > > > > it
> > > > > would be
> > > > > fantastic. https://issues.apache.org/jira/projects/OPENNLP
> > > > > 
> > > > > Please create one and reply back here with its ID once you
> > > > > do.
> > > > > I can
> > > > > take a
> > > > > look and see what can be done.
> > > > > 
> > > > > Thanks,
> > > > > Jeff
> > > > > 
> > > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > > > > [email protected]> wrote:
> > > > > 
> > > > > > Hi all,
> > > > > > 
> > > > > > we are working on training a large opennlp maxent model for
> > > > > > lemmatizing
> > > > > > German texts. We use a wikipedia tree bank from Tübingen.
> > > > > > 
> > > > > > This works fine for mid size corpora (just need a little
> > > > > > bit
> > > > > > of RAM
> > > > > > and
> > > > > > time). However, we are running into the exception mentioned
> > > > > > in [1].
> > > > > > Debugging into the DataOutputStream reveals, that this is a
> > > > > > limitation
> > > > > > of the java.io.DataOutputstream.
> > > > > > 
> > > > > > Do we have any chance to solve this or do we need to
> > > > > > implement
> > > > > > custom
> > > > > > readers / writers in order to get it work?
> > > > > > 
> > > > > > If this is a general problem for large corpora, I am also
> > > > > > happy to
> > > > > > create a related ticket / issue in Jira with steps to
> > > > > > reproduce ;)
> > > > > > 
> > > > > > Thanks in advance.
> > > > > > 
> > > > > > Gruß
> > > > > > Richard
> > > > > > 
> > > > > > [1]
> > > > > > 
> > > > > > 
> > > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > > > > > --
> > > > > > Richard Zowalla, M.Sc.
> > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > 
> > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > Max-Planck-Str. 39
> > > > > > D-74081 Heilbronn
> > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > erreichbar)
> > > > > > mail: [email protected]
> > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > 
> > > > --
> > > > Richard Zowalla, M.Sc.
> > > > Research Associate, PhD Student | Medical Informatics
> > > > 
> > > > Hochschule Heilbronn – University of Applied Sciences
> > > > Max-Planck-Str. 39
> > > > D-74081 Heilbronn
> > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > erreichbar)
> > > > mail: [email protected]
> > > > web: https://www.mi.hs-heilbronn.de/
> > > >

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to