Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Jeff Zemerick Mon, 18 Apr 2022 04:55:51 -0700

Thanks for trying it and for all the info! I will check it out and let you
know.


Thanks,
Jeff

On Sun, Apr 17, 2022 at 12:51 PM Zowalla, Richard <
[email protected]> wrote:

> Hi Jeff,
>
> he did the validation again and it showed, that the IDE used an older
> version of OpenNLP.
>
> After a clean build with the freshly created SNAPSHOT, the model load
> resulted in another exception (which now looks reasonable to me).
>
> He updated his comment in [1]. Maybe you have an idea :)
>
> Thanks
> Richard
>
> [1]
>
> https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702
>
> Am Sonntag, dem 17.04.2022 um 16:33 +0000 schrieb Zowalla, Richard:
> > Hi Jeff,
> >
> > reading the stacktrace myself (now), I think, that an outdated
> > snapshot
> > was included for this test (as it doesn't fit the code).
> >
> > I will report back, if this is the case and Maven / Gradle / IDE did
> > something weird.
> >
> > Sorry & Gruß
> > Richard
> >
> > Am Sonntag, dem 17.04.2022 um 16:26 +0000 schrieb Zowalla, Richard:
> > > Hi Jeff,
> > >
> > > the task completed and we have some feedback.
> > >
> > > My colleague directly commented in the related commit [1].
> > >
> > > Writing the model seems to work but reading the resulting model
> > > fails.
> > >
> > > Gruß
> > > Richard
> > >
> > > [1]
> > >
> https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702#commitcomment-71463963
> > >
> > > Am Dienstag, dem 12.04.2022 um 14:40 +0000 schrieb Zowalla,
> > > Richard:
> > > > Hi Jeff,
> > > >
> > > > thanks for the update.
> > > >
> > > > We will give the change a try with a SNAPSHOT build including the
> > > > potential patch and start a run on the cluster with the Tübingen
> > > > Wikipedia Treebank. Guess we will have feedback in ~ 48 hours
> > > > regarding
> > > > writeShort(...).
> > > >
> > > > Gruß
> > > > Richard
> > > >
> > > > Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick:
> > > > > Luckily, this looks like a common problem [1] for years
> > > > > regarding
> > > > > writeUTF(). Following other guidance and the function's
> > > > > javadocs
> > > > > [2],
> > > > > writeUTF() writes the number of bytes written out followed by
> > > > > the
> > > > > string.
> > > > > Changing it to manually write the length of the string followed
> > > > > by
> > > > > write()
> > > > > allows the training to succeed. All unit tests pass and this
> > > > > seems
> > > > > to
> > > > > indicate it would be backward compatible because of unit tests
> > > > > that
> > > > > load
> > > > > models in src/test/resources/, but I want to verify that more
> > > > > to
> > > > > be
> > > > > sure.
> > > > >
> > > > > Here's the changes:
> > > > >
> https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1
> > > > >
> > > > > I am unsure of the writeShort() method for writing the length
> > > > > of
> > > > > the
> > > > > string. Even though it works for the UD data now, is that
> > > > > actually
> > > > > resolving the problem?
> > > > >
> > > > > Anyone have any insights into this?
> > > > >
> > > > > Thanks,
> > > > > Jeff
> > > > >
> > > > > [1]
> > > > >
> https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> > > > > [2]
> > > > >
> https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
> > > > >
> > > > > On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <
> > > > > [email protected]
> > > > > wrote:
> > > > >
> > > > > > Great, thanks. I was able to reproduce the problem. I'll take
> > > > > > a
> > > > > > look and
> > > > > > keep this thread updated.
> > > > > >
> > > > > > Thanks,
> > > > > > Jeff
> > > > > >
> > > > > > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Hi Jeff,
> > > > > > >
> > > > > > > thanks for the quick reply. Here it is:
> > > > > > > https://issues.apache.org/jira/browse/OPENNLP-1366
> > > > > > >
> > > > > > > Using the treebank from Tübingen might not be feasable as
> > > > > > > it
> > > > > > > consumes
> > > > > > > around 2 TB RAM ;) - the mentioned link in the ticket
> > > > > > > points
> > > > > > > to
> > > > > > > a
> > > > > > > smaller dataset, which should reproduce the issue with a
> > > > > > > feasable
> > > > > > > amount of required RAM.
> > > > > > >
> > > > > > > It basically boils down to a size limitation in the JDK's
> > > > > > > DataOutputStream.
> > > > > > >
> > > > > > > Gruß
> > > > > > > Richard
> > > > > > >
> > > > > > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff
> > > > > > > Zemerick:
> > > > > > > > Hi Richard,
> > > > > > > >
> > > > > > > > Thanks for reporting this. A Jira issue with steps to
> > > > > > > > reproduce
> > > > > > > > it
> > > > > > > > would be
> > > > > > > > fantastic.
> > > > > > > > https://issues.apache.org/jira/projects/OPENNLP
> > > > > > > >
> > > > > > > > Please create one and reply back here with its ID once
> > > > > > > > you
> > > > > > > > do.
> > > > > > > > I can
> > > > > > > > take a
> > > > > > > > look and see what can be done.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jeff
> > > > > > > >
> > > > > > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > > > > > > > [email protected]> wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > we are working on training a large opennlp maxent model
> > > > > > > > > for
> > > > > > > > > lemmatizing
> > > > > > > > > German texts. We use a wikipedia tree bank from
> > > > > > > > > Tübingen.
> > > > > > > > >
> > > > > > > > > This works fine for mid size corpora (just need a
> > > > > > > > > little
> > > > > > > > > bit
> > > > > > > > > of RAM
> > > > > > > > > and
> > > > > > > > > time). However, we are running into the exception
> > > > > > > > > mentioned
> > > > > > > > > in [1].
> > > > > > > > > Debugging into the DataOutputStream reveals, that this
> > > > > > > > > is
> > > > > > > > > a
> > > > > > > > > limitation
> > > > > > > > > of the java.io.DataOutputstream.
> > > > > > > > >
> > > > > > > > > Do we have any chance to solve this or do we need to
> > > > > > > > > implement
> > > > > > > > > custom
> > > > > > > > > readers / writers in order to get it work?
> > > > > > > > >
> > > > > > > > > If this is a general problem for large corpora, I am
> > > > > > > > > also
> > > > > > > > > happy to
> > > > > > > > > create a related ticket / issue in Jira with steps to
> > > > > > > > > reproduce ;)
> > > > > > > > >
> > > > > > > > > Thanks in advance.
> > > > > > > > >
> > > > > > > > > Gruß
> > > > > > > > > Richard
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > > > > > > > > --
> > > > > > > > > Richard Zowalla, M.Sc.
> > > > > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > > > >
> > > > > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > > > > Max-Planck-Str. 39
> > > > > > > > > D-74081 Heilbronn
> > > > > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > > > > erreichbar)
> > > > > > > > > mail: [email protected]
> > > > > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > > > >
> > > > > > > --
> > > > > > > Richard Zowalla, M.Sc.
> > > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > >
> > > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > > Max-Planck-Str. 39
> > > > > > > D-74081 Heilbronn
> > > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > > erreichbar)
> > > > > > > mail: [email protected]
> > > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > >
>

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to