Hi Jeff, thanks for the quick reply. Here it is: https://issues.apache.org/jira/browse/OPENNLP-1366
Using the treebank from Tübingen might not be feasable as it consumes around 2 TB RAM ;) - the mentioned link in the ticket points to a smaller dataset, which should reproduce the issue with a feasable amount of required RAM. It basically boils down to a size limitation in the JDK's DataOutputStream. Gruß Richard Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick: > Hi Richard, > > Thanks for reporting this. A Jira issue with steps to reproduce it > would be > fantastic. https://issues.apache.org/jira/projects/OPENNLP > > Please create one and reply back here with its ID once you do. I can > take a > look and see what can be done. > > Thanks, > Jeff > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard < > [email protected]> wrote: > > > Hi all, > > > > we are working on training a large opennlp maxent model for > > lemmatizing > > German texts. We use a wikipedia tree bank from Tübingen. > > > > This works fine for mid size corpora (just need a little bit of RAM > > and > > time). However, we are running into the exception mentioned in [1]. > > Debugging into the DataOutputStream reveals, that this is a > > limitation > > of the java.io.DataOutputstream. > > > > Do we have any chance to solve this or do we need to implement > > custom > > readers / writers in order to get it work? > > > > If this is a general problem for large corpora, I am also happy to > > create a related ticket / issue in Jira with steps to reproduce ;) > > > > Thanks in advance. > > > > Gruß > > Richard > > > > [1] > > > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long > > > > > > -- > > Richard Zowalla, M.Sc. > > Research Associate, PhD Student | Medical Informatics > > > > Hochschule Heilbronn – University of Applied Sciences > > Max-Planck-Str. 39 > > D-74081 Heilbronn > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar) > > mail: [email protected] > > web: https://www.mi.hs-heilbronn.de/ > > -- Richard Zowalla, M.Sc. Research Associate, PhD Student | Medical Informatics Hochschule Heilbronn – University of Applied Sciences Max-Planck-Str. 39 D-74081 Heilbronn phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar) mail: [email protected] web: https://www.mi.hs-heilbronn.de/
smime.p7s
Description: S/MIME cryptographic signature
