Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Jeff Zemerick Fri, 28 Oct 2022 05:15:59 -0700

Thanks for the PR! I just merged it. I'm glad this will be in the 2.1
release which should go out for vote next week.


Thanks,
Jeff

On Tue, Oct 25, 2022 at 2:31 AM Richard Zowalla <r...@apache.org> wrote:

> Hi,
>
> here is a PR by my collegue Martin W.:
> https://github.com/apache/opennlp/pull/427
>
> Some more details are contained in
> https://issues.apache.org/jira/browse/OPENNLP-1366
>
> The change is tested with the huge corpus on the HPC system.
>
> Gruß
> Richard Z
>
> Am Freitag, dem 14.10.2022 um 08:18 +0200 schrieb Richard Zowalla:
> > Hi Jeff,
> >
> > just to drop a short notice on that one:
> >
> > My collegue, who is affected by this, is preparing a PR (might take
> > some time though because of testing on the HPC system...), which will
> > hopefully solve reading / writing "large" models without breaking
> > existing ones in the process.
> >
> > Gruß
> > Richard Z
> >
> >
> > Am Donnerstag, dem 28.07.2022 um 12:13 +0000 schrieb Zowalla,
> > Richard:
> > > Hi Jeff,
> > >
> > > no real updates from our side. We were quite busy in the last weeks
> > > finishing and correcting student course work ;)
> > >
> > > My last status in this matter is:
> > >
> > > The change from writeUTF to writeShort worked. Training and writing
> > > the
> > > MaxEnt model just worked for this huge corpus. No (runtime) erros
> > > were
> > > logged.
> > >
> > > However, loading the resulting binary file failed with
> > >
> > > java.lang.NumberFormatException: For input string: "24178�A1"
> > >
> > >     at
> > > java.base/java.lang.NumberFormatException.forInputString(NumberForm
> > > at
> > > Ex
> > > ception.java:67)
> > >     at java.base/java.lang.Integer.parseInt(Integer.java:668)
> > >     at java.base/java.lang.Integer.parseInt(Integer.java:786)
> > >     at
> > > opennlp.tools.ml.model.AbstractModelReader.getOutcomePatterns(Abstr
> > > ac
> > > tM
> > > odelReader.java:106)
> > >     at
> > > opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelRe
> > > ad
> > > er
> > > .java:76)
> > >     at
> > > opennlp.tools.ml.model.GenericModelReader.constructModel(GenericMod
> > > el
> > > Re
> > > ader.java:62)
> > >     at
> > > opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelRe
> > > ad
> > > er
> > > .java:85)
> > >     at
> > > opennlp.tools.util.model.GenericModelSerializer.create(GenericModel
> > > Se
> > > ri
> > > alizer.java:32)
> > >     at
> > > opennlp.tools.util.model.GenericModelSerializer.create(GenericModel
> > > Se
> > > ri
> > > alizer.java:29)
> > >     at
> > > opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel
> > > .j
> > > av
> > > a:312)
> > >     at
> > > opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:242)
> > >     at
> > > opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:176)
> > >     at
> > > opennlp.tools.lemmatizer.LemmatizerModel.<init>(LemmatizerModel.jav
> > > a:
> > > 74
> > > )
> > >
> > > Don't know if this was caused by the change from writeUTF / readTF
> > > to
> > > readShort / writeShort but looks a bit odd. Might be data set
> > > related
> > > but no idea about it.
> > >
> > > I don't know the size of the binary (yet), but if it helps, I can
> > > ask
> > > my colleague, if we can share the related model and upload it
> > > somewhere, if it is not too large.
> > >
> > > Gruß
> > > Richard
> > >
> > >
> > > Am Mittwoch, dem 27.07.2022 um 17:09 -0400 schrieb Jeff Zemerick:
> > > > Hi Richard,
> > > >
> > > > I know it's been a while but I wanted to circle back to this to
> > > > see
> > > > if
> > > > there are any updates.
> > > >
> > > > Thanks,
> > > > Jeff
> > > >
> > > > On Mon, Apr 25, 2022 at 1:48 PM Richard Eckart de Castilho <
> > > > r...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > > On 11. Apr 2022, at 14:50, Zowalla, Richard <
> > > > > richard.zowa...@hs-heilbronn.de> wrote:
> > > > > > This works fine for mid size corpora (just need a little bit
> > > > > > of
> > > > > > RAM and
> > > > > > time). However, we are running into the exception mentioned
> > > > > > in
> > > > > > [1].
> > > > > > Debugging into the DataOutputStream reveals, that this is a
> > > > > > limitation
> > > > > > of the java.io.DataOutputstream.
> > > > > >
> > > > > > Do we have any chance to solve this or do we need to
> > > > > > implement
> > > > > > custom
> > > > > > readers / writers in order to get it work?
> > > > >
> > > > > Looking at the OpenNLP 1.9.3 code, the relevant piece seems to
> > > > > be
> > > > > this:
> > > > >
> > > > > opennlp.tools.ml.maxent.io.GISModelWriter.class
> > > > > ----
> > > > >     // the mapping from predicates to the outcomes they
> > > > > contributed
> > > > > to.
> > > > >     // The sorting is done so that we actually can write this
> > > > > out
> > > > > more
> > > > >     // compactly than as the entire list.
> > > > >     ComparablePredicate[] sorted = sortValues();
> > > > >     List<List<ComparablePredicate>> compressed =
> > > > > compressOutcomes(sorted);
> > > > >
> > > > >     writeInt(compressed.size());
> > > > >
> > > > >     for (List<ComparablePredicate> aCompressed : compressed) {
> > > > >       writeUTF(aCompressed.size() + ((List<?>)
> > > > > aCompressed).get(0).toString());
> > > > >     }
> > > > > ----
> > > > >
> > > > > opennlp.tools.ml.model.ComparablePredicate.ComparablePredicate(
> > > > > St
> > > > > ri
> > > > > ng,
> > > > > int[], double[])
> > > > > ----
> > > > >   public String toString() {
> > > > >     StringBuilder s = new StringBuilder();
> > > > >     for (int outcome : outcomes) {
> > > > >       s.append(" ").append(outcome);
> > > > >     }
> > > > >     return s.toString();
> > > > >   }
> > > > > ----
> > > > >
> > > > > If I read it correctly, then the UTF-8-encoded list of outcomes
> > > > > that a
> > > > > single ComparablePredicate contributed to
> > > > > is larger than 383769 bytes.
> > > > >
> > > > > I'm not familiar with the code, but it seems strange to me that
> > > > > such a
> > > > > long list should be valid to start with.
> > > > > Maybe set a breakpoint and check if you have any *way too long*
> > > > > labels or
> > > > > maybe too many labels in total?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > -- Richard
>

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to