Thanks for the PR! I just merged it. I'm glad this will be in the 2.1 release which should go out for vote next week.
Thanks, Jeff On Tue, Oct 25, 2022 at 2:31 AM Richard Zowalla <r...@apache.org> wrote: > Hi, > > here is a PR by my collegue Martin W.: > https://github.com/apache/opennlp/pull/427 > > Some more details are contained in > https://issues.apache.org/jira/browse/OPENNLP-1366 > > The change is tested with the huge corpus on the HPC system. > > Gruß > Richard Z > > Am Freitag, dem 14.10.2022 um 08:18 +0200 schrieb Richard Zowalla: > > Hi Jeff, > > > > just to drop a short notice on that one: > > > > My collegue, who is affected by this, is preparing a PR (might take > > some time though because of testing on the HPC system...), which will > > hopefully solve reading / writing "large" models without breaking > > existing ones in the process. > > > > Gruß > > Richard Z > > > > > > Am Donnerstag, dem 28.07.2022 um 12:13 +0000 schrieb Zowalla, > > Richard: > > > Hi Jeff, > > > > > > no real updates from our side. We were quite busy in the last weeks > > > finishing and correcting student course work ;) > > > > > > My last status in this matter is: > > > > > > The change from writeUTF to writeShort worked. Training and writing > > > the > > > MaxEnt model just worked for this huge corpus. No (runtime) erros > > > were > > > logged. > > > > > > However, loading the resulting binary file failed with > > > > > > java.lang.NumberFormatException: For input string: "24178�A1" > > > > > > at > > > java.base/java.lang.NumberFormatException.forInputString(NumberForm > > > at > > > Ex > > > ception.java:67) > > > at java.base/java.lang.Integer.parseInt(Integer.java:668) > > > at java.base/java.lang.Integer.parseInt(Integer.java:786) > > > at > > > opennlp.tools.ml.model.AbstractModelReader.getOutcomePatterns(Abstr > > > ac > > > tM > > > odelReader.java:106) > > > at > > > opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelRe > > > ad > > > er > > > .java:76) > > > at > > > opennlp.tools.ml.model.GenericModelReader.constructModel(GenericMod > > > el > > > Re > > > ader.java:62) > > > at > > > opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelRe > > > ad > > > er > > > .java:85) > > > at > > > opennlp.tools.util.model.GenericModelSerializer.create(GenericModel > > > Se > > > ri > > > alizer.java:32) > > > at > > > opennlp.tools.util.model.GenericModelSerializer.create(GenericModel > > > Se > > > ri > > > alizer.java:29) > > > at > > > opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel > > > .j > > > av > > > a:312) > > > at > > > opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:242) > > > at > > > opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:176) > > > at > > > opennlp.tools.lemmatizer.LemmatizerModel.<init>(LemmatizerModel.jav > > > a: > > > 74 > > > ) > > > > > > Don't know if this was caused by the change from writeUTF / readTF > > > to > > > readShort / writeShort but looks a bit odd. Might be data set > > > related > > > but no idea about it. > > > > > > I don't know the size of the binary (yet), but if it helps, I can > > > ask > > > my colleague, if we can share the related model and upload it > > > somewhere, if it is not too large. > > > > > > Gruß > > > Richard > > > > > > > > > Am Mittwoch, dem 27.07.2022 um 17:09 -0400 schrieb Jeff Zemerick: > > > > Hi Richard, > > > > > > > > I know it's been a while but I wanted to circle back to this to > > > > see > > > > if > > > > there are any updates. > > > > > > > > Thanks, > > > > Jeff > > > > > > > > On Mon, Apr 25, 2022 at 1:48 PM Richard Eckart de Castilho < > > > > r...@apache.org> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > > On 11. Apr 2022, at 14:50, Zowalla, Richard < > > > > > richard.zowa...@hs-heilbronn.de> wrote: > > > > > > This works fine for mid size corpora (just need a little bit > > > > > > of > > > > > > RAM and > > > > > > time). However, we are running into the exception mentioned > > > > > > in > > > > > > [1]. > > > > > > Debugging into the DataOutputStream reveals, that this is a > > > > > > limitation > > > > > > of the java.io.DataOutputstream. > > > > > > > > > > > > Do we have any chance to solve this or do we need to > > > > > > implement > > > > > > custom > > > > > > readers / writers in order to get it work? > > > > > > > > > > Looking at the OpenNLP 1.9.3 code, the relevant piece seems to > > > > > be > > > > > this: > > > > > > > > > > opennlp.tools.ml.maxent.io.GISModelWriter.class > > > > > ---- > > > > > // the mapping from predicates to the outcomes they > > > > > contributed > > > > > to. > > > > > // The sorting is done so that we actually can write this > > > > > out > > > > > more > > > > > // compactly than as the entire list. > > > > > ComparablePredicate[] sorted = sortValues(); > > > > > List<List<ComparablePredicate>> compressed = > > > > > compressOutcomes(sorted); > > > > > > > > > > writeInt(compressed.size()); > > > > > > > > > > for (List<ComparablePredicate> aCompressed : compressed) { > > > > > writeUTF(aCompressed.size() + ((List<?>) > > > > > aCompressed).get(0).toString()); > > > > > } > > > > > ---- > > > > > > > > > > opennlp.tools.ml.model.ComparablePredicate.ComparablePredicate( > > > > > St > > > > > ri > > > > > ng, > > > > > int[], double[]) > > > > > ---- > > > > > public String toString() { > > > > > StringBuilder s = new StringBuilder(); > > > > > for (int outcome : outcomes) { > > > > > s.append(" ").append(outcome); > > > > > } > > > > > return s.toString(); > > > > > } > > > > > ---- > > > > > > > > > > If I read it correctly, then the UTF-8-encoded list of outcomes > > > > > that a > > > > > single ComparablePredicate contributed to > > > > > is larger than 383769 bytes. > > > > > > > > > > I'm not familiar with the code, but it seems strange to me that > > > > > such a > > > > > long list should be valid to start with. > > > > > Maybe set a breakpoint and check if you have any *way too long* > > > > > labels or > > > > > maybe too many labels in total? > > > > > > > > > > Cheers, > > > > > > > > > > -- Richard >