Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Richard Eckart de Castilho Mon, 25 Apr 2022 10:48:01 -0700

Hi,

> On 11. Apr 2022, at 14:50, Zowalla, Richard <[email protected]> 
> wrote:
> 
> This works fine for mid size corpora (just need a little bit of RAM and
> time). However, we are running into the exception mentioned in [1].
> Debugging into the DataOutputStream reveals, that this is a limitation
> of the java.io.DataOutputstream.
> 
> Do we have any chance to solve this or do we need to implement custom
> readers / writers in order to get it work?


Looking at the OpenNLP 1.9.3 code, the relevant piece seems to be this:

opennlp.tools.ml.maxent.io.GISModelWriter.class
----
    // the mapping from predicates to the outcomes they contributed to.
    // The sorting is done so that we actually can write this out more
    // compactly than as the entire list.
    ComparablePredicate[] sorted = sortValues();
    List<List<ComparablePredicate>> compressed = compressOutcomes(sorted);

    writeInt(compressed.size());

    for (List<ComparablePredicate> aCompressed : compressed) {
      writeUTF(aCompressed.size() + ((List<?>) aCompressed).get(0).toString());
    }
----

opennlp.tools.ml.model.ComparablePredicate.ComparablePredicate(String, int[], 
double[])
----
  public String toString() {
    StringBuilder s = new StringBuilder();
    for (int outcome : outcomes) {
      s.append(" ").append(outcome);
    }
    return s.toString();
  }
----

If I read it correctly, then the UTF-8-encoded list of outcomes that a single 
ComparablePredicate contributed to
is larger than 383769 bytes.

I'm not familiar with the code, but it seems strange to me that such a long 
list should be valid to start with.
Maybe set a breakpoint and check if you have any *way too long* labels or maybe 
too many labels in total?

Cheers,

-- Richard

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to