Hi Jeff,

no real updates from our side. We were quite busy in the last weeks
finishing and correcting student course work ;)

My last status in this matter is:

The change from writeUTF to writeShort worked. Training and writing the
MaxEnt model just worked for this huge corpus. No (runtime) erros were
logged.

However, loading the resulting binary file failed with 

java.lang.NumberFormatException: For input string: "24178�A1"

        at
java.base/java.lang.NumberFormatException.forInputString(NumberFormatEx
ception.java:67)
        at java.base/java.lang.Integer.parseInt(Integer.java:668)
        at java.base/java.lang.Integer.parseInt(Integer.java:786)
        at
opennlp.tools.ml.model.AbstractModelReader.getOutcomePatterns(AbstractM
odelReader.java:106)
        at
opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelReader
.java:76)
        at
opennlp.tools.ml.model.GenericModelReader.constructModel(GenericModelRe
ader.java:62)
        at
opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelReader
.java:85)
        at
opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSeri
alizer.java:32)
        at
opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSeri
alizer.java:29)
        at
opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel.jav
a:312)
        at
opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:242)
        at
opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:176)
        at
opennlp.tools.lemmatizer.LemmatizerModel.<init>(LemmatizerModel.java:74
)

Don't know if this was caused by the change from writeUTF / readTF to
readShort / writeShort but looks a bit odd. Might be data set related
but no idea about it. 

I don't know the size of the binary (yet), but if it helps, I can ask
my colleague, if we can share the related model and upload it
somewhere, if it is not too large. 

Gruß
Richard


Am Mittwoch, dem 27.07.2022 um 17:09 -0400 schrieb Jeff Zemerick:
> Hi Richard,
> 
> I know it's been a while but I wanted to circle back to this to see
> if
> there are any updates.
> 
> Thanks,
> Jeff
> 
> On Mon, Apr 25, 2022 at 1:48 PM Richard Eckart de Castilho <
> [email protected]>
> wrote:
> 
> > Hi,
> > 
> > > On 11. Apr 2022, at 14:50, Zowalla, Richard <
> > [email protected]> wrote:
> > > This works fine for mid size corpora (just need a little bit of
> > > RAM and
> > > time). However, we are running into the exception mentioned in
> > > [1].
> > > Debugging into the DataOutputStream reveals, that this is a
> > > limitation
> > > of the java.io.DataOutputstream.
> > > 
> > > Do we have any chance to solve this or do we need to implement
> > > custom
> > > readers / writers in order to get it work?
> > 
> > Looking at the OpenNLP 1.9.3 code, the relevant piece seems to be
> > this:
> > 
> > opennlp.tools.ml.maxent.io.GISModelWriter.class
> > ----
> >     // the mapping from predicates to the outcomes they contributed
> > to.
> >     // The sorting is done so that we actually can write this out
> > more
> >     // compactly than as the entire list.
> >     ComparablePredicate[] sorted = sortValues();
> >     List<List<ComparablePredicate>> compressed =
> > compressOutcomes(sorted);
> > 
> >     writeInt(compressed.size());
> > 
> >     for (List<ComparablePredicate> aCompressed : compressed) {
> >       writeUTF(aCompressed.size() + ((List<?>)
> > aCompressed).get(0).toString());
> >     }
> > ----
> > 
> > opennlp.tools.ml.model.ComparablePredicate.ComparablePredicate(Stri
> > ng,
> > int[], double[])
> > ----
> >   public String toString() {
> >     StringBuilder s = new StringBuilder();
> >     for (int outcome : outcomes) {
> >       s.append(" ").append(outcome);
> >     }
> >     return s.toString();
> >   }
> > ----
> > 
> > If I read it correctly, then the UTF-8-encoded list of outcomes
> > that a
> > single ComparablePredicate contributed to
> > is larger than 383769 bytes.
> > 
> > I'm not familiar with the code, but it seems strange to me that
> > such a
> > long list should be valid to start with.
> > Maybe set a breakpoint and check if you have any *way too long*
> > labels or
> > maybe too many labels in total?
> > 
> > Cheers,
> > 
> > -- Richard

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to