Hi,
> On 11. Apr 2022, at 14:50, Zowalla, Richard <[email protected]>
> wrote:
>
> This works fine for mid size corpora (just need a little bit of RAM and
> time). However, we are running into the exception mentioned in [1].
> Debugging into the DataOutputStream reveals, that this is a limitation
> of the java.io.DataOutputstream.
>
> Do we have any chance to solve this or do we need to implement custom
> readers / writers in order to get it work?
Looking at the OpenNLP 1.9.3 code, the relevant piece seems to be this:
opennlp.tools.ml.maxent.io.GISModelWriter.class
----
// the mapping from predicates to the outcomes they contributed to.
// The sorting is done so that we actually can write this out more
// compactly than as the entire list.
ComparablePredicate[] sorted = sortValues();
List<List<ComparablePredicate>> compressed = compressOutcomes(sorted);
writeInt(compressed.size());
for (List<ComparablePredicate> aCompressed : compressed) {
writeUTF(aCompressed.size() + ((List<?>) aCompressed).get(0).toString());
}
----
opennlp.tools.ml.model.ComparablePredicate.ComparablePredicate(String, int[],
double[])
----
public String toString() {
StringBuilder s = new StringBuilder();
for (int outcome : outcomes) {
s.append(" ").append(outcome);
}
return s.toString();
}
----
If I read it correctly, then the UTF-8-encoded list of outcomes that a single
ComparablePredicate contributed to
is larger than 383769 bytes.
I'm not familiar with the code, but it seems strange to me that such a long
list should be valid to start with.
Maybe set a breakpoint and check if you have any *way too long* labels or maybe
too many labels in total?
Cheers,
-- Richard