[jira] [Commented] (OPENNLP-1366) Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

ASF GitHub Bot (Jira) Tue, 25 Oct 2022 23:28:04 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624214#comment-17624214
 ]


ASF GitHub Bot commented on OPENNLP-1366:
-----------------------------------------

mawiesne commented on code in PR #427:
URL: https://github.com/apache/opennlp/pull/427#discussion_r1005268338


##########
opennlp-tools/src/main/java/opennlp/tools/ml/model/ModelParameterChunker.java:
##########
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.ml.model;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.io.UTFDataFormatException;
+import java.nio.ByteBuffer;
+import java.nio.CharBuffer;
+import java.nio.charset.CharsetEncoder;
+import java.nio.charset.CoderResult;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * A helper class that handles Strings with more than 64k (65535 bytes) in 
length.
+ * This is achieved via the signature {@link #SIGNATURE_CHUNKED_PARAMS} at the 
beginning of
+ * the String instance to be written to a {@link DataOutputStream}.
+ * <p>
+ * Background: In OpenNLP, for large(r) corpora, we train models whose (UTF 
String) parameters will exceed
+ * the {@link #MAX_CHUNK_SIZE_BYTES} bytes limit set in {@link 
DataOutputStream}.
+ * For writing and reading those models, we have to chunk up those string 
instances in 64kB blocks and
+ * recombine them correctly upon reading a (binary) model file.
+ * <p>
+ * The problem was raised in <a 
href="https://issues.apache.org/jira/browse/OPENNLP-1366";>ticket 
OPENNLP-1366</a>.
+ * <p>
+ * Solution strategy:
+ * <ul>
+ * <li>If writing parameters to a {@link DataOutputStream} blows up with a 
{@link UTFDataFormatException} a
+ * large String instance is chunked up and written as appropriate blocks.</li>
+ * <li>To indicate that chunking was conducted, we start with the {@link 
#SIGNATURE_CHUNKED_PARAMS} indicator,
+ * directly followed by the number of chunks used. This way, when reading in 
chunked model parameters,
+ * recombination is achieved transparently.</li>
+ * </ul>
+ * <p>
+ * Note: Both, existing (binary) model files and newly trained models which 
don't require the chunking
+ * technique, will be supported like in previous OpenNLP versions.
+ *
+ * @author <a href="mailto:martin.wies...@hs-heilbronn.de";>Martin Wiesner</a>
+ * @author <a href="mailto:strub...@apache.org";>Mark Struberg</a>
+ */
+public final class ModelParameterChunker {
+
+  /*
+   * A signature that denotes the start of a String that required chunking.
+   *
+   * Semantics:
+   * If a model parameter (String) carries the below signature at the very 
beginning, this indicates
+   * that 'n > 1' chunks must be processed to obtain the whole model 
parameters. Otherwise, those would not be
+   * written to the binary model files (as reported in OPENNLP-1366) if the 
training occurs on large corpora
+   * as used, for instance, in the context of (very large) German NLP models.
+   */
+  public static final String SIGNATURE_CHUNKED_PARAMS = 
"CHUNKED-MODEL-PARAMS:"; // followed by no of chunks!
+
+  private static final int MAX_CHUNK_SIZE_BYTES = 65535; // the maximum 
'utflen' DataOutputStream can handle
+
+  private ModelParameterChunker(){
+    // private utility class ct s
+  }
+
+  /**
+   * Reads model parameters from {@code dis}. In case the stream start with 
{@link #SIGNATURE_CHUNKED_PARAMS},
+   * the number of chunks is detected and the original large parameter string 
is reconstructed from several
+   * chunks.
+   *
+   * @param dis   The stream which will be used to read the model parameter 
from.
+   */
+  public static String readUTF(DataInputStream dis) throws IOException {
+    String data = dis.readUTF();
+    if (data.startsWith(SIGNATURE_CHUNKED_PARAMS)) {
+      String chunkElements = data.replace(SIGNATURE_CHUNKED_PARAMS, "");
+      int chunkSize = Integer.parseInt(chunkElements);
+      StringBuilder sb = new StringBuilder();
+      for (int i = 0; i < chunkSize; i++) {
+        sb.append(dis.readUTF());
+      }
+      return sb.toString(); // the reconstructed model parameter string
+    } else {  // default case: no chunked data -> just return the read data / 
parameter information
+      return data;
+    }
+  }
+
+  /**
+   * Writes the model parameter {@code s} to {@code dos}. In case {@code s} 
does exceed
+   * {@link #MAX_CHUNK_SIZE_BYTES} in length, the chunking mechanism is used; 
otherwise the parameter is
+   * written 'as is'.
+   *
+   * @param dos   The {@link DataOutputStream} stream which will be used to 
persist the model.
+   * @param s     The input string that is checked for length and chunked if 
{@link #MAX_CHUNK_SIZE_BYTES} is
+   *              exceeded.
+   */
+  public static void writeUTF(DataOutputStream dos, String s) throws 
IOException {
+    try {
+      dos.writeUTF(s);
+    } catch (UTFDataFormatException dfe) {

Review Comment:
   @jzonthemtn In general, this looks entirely plausible. Yet, there are 
specific details for encoding characters in UTF(8) format. Therefore, byte 
length of `s` is not sufficient or valid for all cases. 
   
   Thus, @struberg and I decided to leave it to the actual implementation of 
`DataOutputStream.writeUTF(...)` to tell us if the string contains too many 
characters. Have a look and check the code there, if interested, for the 
"difficult" part. We decided not to cnp the code over to OpenNLP, but to rely 
on the checks made in `DataOutputStream`. 
   
   I hope, you can follow our idea/decision to go for "avoid cnp" in favor of 
catching `UTFDataFormatException` here once, in case the limit exceeds 64k. 
This seems acceptable for those "rare" occasions when training with large 
corpora. 
   
   As a background: The 'Tueba Wikipedia' treebank only runs into this 
exception twice; other (well- known) German corpora, like the one from Hamburg 
I also used for validation, hit it 3 or 4 times during training.
   
   





>  Training of MaxEnt Model with large corpora fails with 
> java.io.UTFDataFormatException
> --------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-1366
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1366
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Lemmatizer
>    Affects Versions: 1.9.4, 2.0.0
>            Reporter: Richard Zowalla
>            Assignee: Jeff Zemerick
>            Priority: Major
>              Labels: pull-request-available
>
> As written on 
> [dev@opennlp.a.o|[https://lists.apache.org/thread/vc5lfzj81tco703noqxpvy8sfj8fw8b1]],
>  we are working on training a large opennlp maxent model for lemmatizing
> German texts. We use a wikipedia tree bank from Tübingen.
> This consumes > 2 TB of RAM during training but will finish after some time. 
> However, writing this model will result in a  java.io.UTFDataFormatException 
> However, training such a big model isn't feasable for debugging. Gladly, a 
> similar with a smaller dataset is found on Stackoverflow: 
> [https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long]
>  
> It contains the OpenNLP CLI command to train a lemmatizer on a much smaller 
> dataset.
> The stacktrace is raced while writing a String as UTF in DataOutputStream, 
> which has a hard-coded size limitation in the JDK (for reasons behind my 
> knowledge ;))
> Stacktrace:
> {code:java}
> java.io.UTFDataFormatException: encoded string too long: 383769 bytes
>         at 
> java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
>         at 
> java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
>         at 
> opennlp.tools.ml.maxent.io.BinaryGISModelWriter.writeUTF(BinaryGISModelWriter.java:71)
>         at 
> opennlp.tools.ml.maxent.io.GISModelWriter.persist(GISModelWriter.java:97)
>         at 
> opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
>         at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
>         at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
>         at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
>         at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)
>         at opennlp.tools.cmdline.CmdLineUtil.writeModel(CmdLineUtil.java:182)
>         at 
> opennlp.tools.cmdline.lemmatizer.LemmatizerTrainerTool.run(LemmatizerTrainerTool.java:77)
>         at opennlp.tools.cmdline.CLI.main(CLI.java:256) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (OPENNLP-1366) Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to