Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Jeff Zemerick Mon, 11 Apr 2022 11:41:21 -0700

Great, thanks. I was able to reproduce the problem. I'll take a look and
keep this thread updated.


Thanks,
Jeff

On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
richard.zowa...@hs-heilbronn.de> wrote:

> Hi Jeff,
>
> thanks for the quick reply. Here it is:
> https://issues.apache.org/jira/browse/OPENNLP-1366
>
> Using the treebank from Tübingen might not be feasable as it consumes
> around 2 TB RAM ;) - the mentioned link in the ticket points to a
> smaller dataset, which should reproduce the issue with a feasable
> amount of required RAM.
>
> It basically boils down to a size limitation in the JDK's
> DataOutputStream.
>
> Gruß
> Richard
>
> Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick:
> > Hi Richard,
> >
> > Thanks for reporting this. A Jira issue with steps to reproduce it
> > would be
> > fantastic. https://issues.apache.org/jira/projects/OPENNLP
> >
> > Please create one and reply back here with its ID once you do. I can
> > take a
> > look and see what can be done.
> >
> > Thanks,
> > Jeff
> >
> > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > richard.zowa...@hs-heilbronn.de> wrote:
> >
> > > Hi all,
> > >
> > > we are working on training a large opennlp maxent model for
> > > lemmatizing
> > > German texts. We use a wikipedia tree bank from Tübingen.
> > >
> > > This works fine for mid size corpora (just need a little bit of RAM
> > > and
> > > time). However, we are running into the exception mentioned in [1].
> > > Debugging into the DataOutputStream reveals, that this is a
> > > limitation
> > > of the java.io.DataOutputstream.
> > >
> > > Do we have any chance to solve this or do we need to implement
> > > custom
> > > readers / writers in order to get it work?
> > >
> > > If this is a general problem for large corpora, I am also happy to
> > > create a related ticket / issue in Jira with steps to reproduce ;)
> > >
> > > Thanks in advance.
> > >
> > > Gruß
> > > Richard
> > >
> > > [1]
> > >
> > >
> https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > >
> > >
> > > --
> > > Richard Zowalla, M.Sc.
> > > Research Associate, PhD Student | Medical Informatics
> > >
> > > Hochschule Heilbronn – University of Applied Sciences
> > > Max-Planck-Str. 39
> > > D-74081 Heilbronn
> > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> > > mail: richard.zowa...@hs-heilbronn.de
> > > web: https://www.mi.hs-heilbronn.de/
> > >
> --
> Richard Zowalla, M.Sc.
> Research Associate, PhD Student | Medical Informatics
>
> Hochschule Heilbronn – University of Applied Sciences
> Max-Planck-Str. 39
> D-74081 Heilbronn
> phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> mail: richard.zowa...@hs-heilbronn.de
> web: https://www.mi.hs-heilbronn.de/
>

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to