Hi Jeff,

thanks for the quick reply. Here it is: 
https://issues.apache.org/jira/browse/OPENNLP-1366

Using the treebank from Tübingen might not be feasable as it consumes
around 2 TB RAM ;) - the mentioned link in the ticket points to a
smaller dataset, which should reproduce the issue with a feasable
amount of required RAM.

It basically boils down to a size limitation in the JDK's
DataOutputStream. 

Gruß
Richard

Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick:
> Hi Richard,
> 
> Thanks for reporting this. A Jira issue with steps to reproduce it
> would be
> fantastic. https://issues.apache.org/jira/projects/OPENNLP
> 
> Please create one and reply back here with its ID once you do. I can
> take a
> look and see what can be done.
> 
> Thanks,
> Jeff
> 
> On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> [email protected]> wrote:
> 
> > Hi all,
> > 
> > we are working on training a large opennlp maxent model for
> > lemmatizing
> > German texts. We use a wikipedia tree bank from Tübingen.
> > 
> > This works fine for mid size corpora (just need a little bit of RAM
> > and
> > time). However, we are running into the exception mentioned in [1].
> > Debugging into the DataOutputStream reveals, that this is a
> > limitation
> > of the java.io.DataOutputstream.
> > 
> > Do we have any chance to solve this or do we need to implement
> > custom
> > readers / writers in order to get it work?
> > 
> > If this is a general problem for large corpora, I am also happy to
> > create a related ticket / issue in Jira with steps to reproduce ;)
> > 
> > Thanks in advance.
> > 
> > Gruß
> > Richard
> > 
> > [1]
> > 
> > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > 
> > 
> > --
> > Richard Zowalla, M.Sc.
> > Research Associate, PhD Student | Medical Informatics
> > 
> > Hochschule Heilbronn – University of Applied Sciences
> > Max-Planck-Str. 39
> > D-74081 Heilbronn
> > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> > mail: [email protected]
> > web: https://www.mi.hs-heilbronn.de/
> > 
-- 
Richard Zowalla, M.Sc.
Research Associate, PhD Student | Medical Informatics

Hochschule Heilbronn – University of Applied Sciences
Max-Planck-Str. 39 
D-74081 Heilbronn 
phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
mail: [email protected]
web: https://www.mi.hs-heilbronn.de/ 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to