Hi Richard, Thanks for reporting this. A Jira issue with steps to reproduce it would be fantastic. https://issues.apache.org/jira/projects/OPENNLP
Please create one and reply back here with its ID once you do. I can take a look and see what can be done. Thanks, Jeff On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard < richard.zowa...@hs-heilbronn.de> wrote: > Hi all, > > we are working on training a large opennlp maxent model for lemmatizing > German texts. We use a wikipedia tree bank from Tübingen. > > This works fine for mid size corpora (just need a little bit of RAM and > time). However, we are running into the exception mentioned in [1]. > Debugging into the DataOutputStream reveals, that this is a limitation > of the java.io.DataOutputstream. > > Do we have any chance to solve this or do we need to implement custom > readers / writers in order to get it work? > > If this is a general problem for large corpora, I am also happy to > create a related ticket / issue in Jira with steps to reproduce ;) > > Thanks in advance. > > Gruß > Richard > > [1] > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long > > > -- > Richard Zowalla, M.Sc. > Research Associate, PhD Student | Medical Informatics > > Hochschule Heilbronn – University of Applied Sciences > Max-Planck-Str. 39 > D-74081 Heilbronn > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar) > mail: richard.zowa...@hs-heilbronn.de > web: https://www.mi.hs-heilbronn.de/ >