Hi Richard, I know it's been a while but I wanted to circle back to this to see if there are any updates.
Thanks, Jeff On Mon, Apr 25, 2022 at 1:48 PM Richard Eckart de Castilho <r...@apache.org> wrote: > Hi, > > > On 11. Apr 2022, at 14:50, Zowalla, Richard < > richard.zowa...@hs-heilbronn.de> wrote: > > > > This works fine for mid size corpora (just need a little bit of RAM and > > time). However, we are running into the exception mentioned in [1]. > > Debugging into the DataOutputStream reveals, that this is a limitation > > of the java.io.DataOutputstream. > > > > Do we have any chance to solve this or do we need to implement custom > > readers / writers in order to get it work? > > Looking at the OpenNLP 1.9.3 code, the relevant piece seems to be this: > > opennlp.tools.ml.maxent.io.GISModelWriter.class > ---- > // the mapping from predicates to the outcomes they contributed to. > // The sorting is done so that we actually can write this out more > // compactly than as the entire list. > ComparablePredicate[] sorted = sortValues(); > List<List<ComparablePredicate>> compressed = compressOutcomes(sorted); > > writeInt(compressed.size()); > > for (List<ComparablePredicate> aCompressed : compressed) { > writeUTF(aCompressed.size() + ((List<?>) > aCompressed).get(0).toString()); > } > ---- > > opennlp.tools.ml.model.ComparablePredicate.ComparablePredicate(String, > int[], double[]) > ---- > public String toString() { > StringBuilder s = new StringBuilder(); > for (int outcome : outcomes) { > s.append(" ").append(outcome); > } > return s.toString(); > } > ---- > > If I read it correctly, then the UTF-8-encoded list of outcomes that a > single ComparablePredicate contributed to > is larger than 383769 bytes. > > I'm not familiar with the code, but it seems strange to me that such a > long list should be valid to start with. > Maybe set a breakpoint and check if you have any *way too long* labels or > maybe too many labels in total? > > Cheers, > > -- Richard