Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Jeff Zemerick Wed, 27 Jul 2022 14:10:02 -0700

Hi Richard,

I know it's been a while but I wanted to circle back to this to see if
there are any updates.


Thanks,
Jeff

On Mon, Apr 25, 2022 at 1:48 PM Richard Eckart de Castilho <r...@apache.org>
wrote:

> Hi,
>
> > On 11. Apr 2022, at 14:50, Zowalla, Richard <
> richard.zowa...@hs-heilbronn.de> wrote:
> >
> > This works fine for mid size corpora (just need a little bit of RAM and
> > time). However, we are running into the exception mentioned in [1].
> > Debugging into the DataOutputStream reveals, that this is a limitation
> > of the java.io.DataOutputstream.
> >
> > Do we have any chance to solve this or do we need to implement custom
> > readers / writers in order to get it work?
>
> Looking at the OpenNLP 1.9.3 code, the relevant piece seems to be this:
>
> opennlp.tools.ml.maxent.io.GISModelWriter.class
> ----
>     // the mapping from predicates to the outcomes they contributed to.
>     // The sorting is done so that we actually can write this out more
>     // compactly than as the entire list.
>     ComparablePredicate[] sorted = sortValues();
>     List<List<ComparablePredicate>> compressed = compressOutcomes(sorted);
>
>     writeInt(compressed.size());
>
>     for (List<ComparablePredicate> aCompressed : compressed) {
>       writeUTF(aCompressed.size() + ((List<?>)
> aCompressed).get(0).toString());
>     }
> ----
>
> opennlp.tools.ml.model.ComparablePredicate.ComparablePredicate(String,
> int[], double[])
> ----
>   public String toString() {
>     StringBuilder s = new StringBuilder();
>     for (int outcome : outcomes) {
>       s.append(" ").append(outcome);
>     }
>     return s.toString();
>   }
> ----
>
> If I read it correctly, then the UTF-8-encoded list of outcomes that a
> single ComparablePredicate contributed to
> is larger than 383769 bytes.
>
> I'm not familiar with the code, but it seems strange to me that such a
> long list should be valid to start with.
> Maybe set a breakpoint and check if you have any *way too long* labels or
> maybe too many labels in total?
>
> Cheers,
>
> -- Richard

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to