Hi all, we are working on training a large opennlp maxent model for lemmatizing German texts. We use a wikipedia tree bank from Tübingen.
This works fine for mid size corpora (just need a little bit of RAM and time). However, we are running into the exception mentioned in [1]. Debugging into the DataOutputStream reveals, that this is a limitation of the java.io.DataOutputstream. Do we have any chance to solve this or do we need to implement custom readers / writers in order to get it work? If this is a general problem for large corpora, I am also happy to create a related ticket / issue in Jira with steps to reproduce ;) Thanks in advance. Gruß Richard [1] https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long -- Richard Zowalla, M.Sc. Research Associate, PhD Student | Medical Informatics Hochschule Heilbronn – University of Applied Sciences Max-Planck-Str. 39 D-74081 Heilbronn phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar) mail: [email protected] web: https://www.mi.hs-heilbronn.de/
signature.asc
Description: This is a digitally signed message part
smime.p7s
Description: S/MIME cryptographic signature
