Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Zowalla, Richard Mon, 11 Apr 2022 05:51:09 -0700

Hi all,

we are working on training a large opennlp maxent model for lemmatizing
German texts. We use a wikipedia tree bank from Tübingen.


This works fine for mid size corpora (just need a little bit of RAM and
time). However, we are running into the exception mentioned in [1].
Debugging into the DataOutputStream reveals, that this is a limitation
of the java.io.DataOutputstream.

Do we have any chance to solve this or do we need to implement custom
readers / writers in order to get it work?

If this is a general problem for large corpora, I am also happy to
create a related ticket / issue in Jira with steps to reproduce ;)

Thanks in advance.

Gruß
Richard

[1] 
https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long


-- 
Richard Zowalla, M.Sc.
Research Associate, PhD Student | Medical Informatics

Hochschule Heilbronn – University of Applied Sciences
Max-Planck-Str. 39 
D-74081 Heilbronn 
phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
mail: [email protected]
web: https://www.mi.hs-heilbronn.de/

smime.p7s
Description: S/MIME cryptographic signature

Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to