Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Jeff Zemerick Mon, 11 Apr 2022 07:13:20 -0700

Hi Richard,

Thanks for reporting this. A Jira issue with steps to reproduce it would be
fantastic. https://issues.apache.org/jira/projects/OPENNLP


Please create one and reply back here with its ID once you do. I can take a
look and see what can be done.

Thanks,
Jeff

On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
[email protected]> wrote:

> Hi all,
>
> we are working on training a large opennlp maxent model for lemmatizing
> German texts. We use a wikipedia tree bank from Tübingen.
>
> This works fine for mid size corpora (just need a little bit of RAM and
> time). However, we are running into the exception mentioned in [1].
> Debugging into the DataOutputStream reveals, that this is a limitation
> of the java.io.DataOutputstream.
>
> Do we have any chance to solve this or do we need to implement custom
> readers / writers in order to get it work?
>
> If this is a general problem for large corpora, I am also happy to
> create a related ticket / issue in Jira with steps to reproduce ;)
>
> Thanks in advance.
>
> Gruß
> Richard
>
> [1]
>
> https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
>
>
> --
> Richard Zowalla, M.Sc.
> Research Associate, PhD Student | Medical Informatics
>
> Hochschule Heilbronn – University of Applied Sciences
> Max-Planck-Str. 39
> D-74081 Heilbronn
> phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> mail: [email protected]
> web: https://www.mi.hs-heilbronn.de/
>

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to