[ https://issues.apache.org/jira/browse/OPENNLP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624214#comment-17624214 ]
ASF GitHub Bot commented on OPENNLP-1366: ----------------------------------------- mawiesne commented on code in PR #427: URL: https://github.com/apache/opennlp/pull/427#discussion_r1005268338 ########## opennlp-tools/src/main/java/opennlp/tools/ml/model/ModelParameterChunker.java: ########## @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.tools.ml.model; + +import java.io.DataInputStream; +import java.io.DataOutputStream; +import java.io.IOException; +import java.io.UTFDataFormatException; +import java.nio.ByteBuffer; +import java.nio.CharBuffer; +import java.nio.charset.CharsetEncoder; +import java.nio.charset.CoderResult; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.List; + +/** + * A helper class that handles Strings with more than 64k (65535 bytes) in length. + * This is achieved via the signature {@link #SIGNATURE_CHUNKED_PARAMS} at the beginning of + * the String instance to be written to a {@link DataOutputStream}. + * <p> + * Background: In OpenNLP, for large(r) corpora, we train models whose (UTF String) parameters will exceed + * the {@link #MAX_CHUNK_SIZE_BYTES} bytes limit set in {@link DataOutputStream}. + * For writing and reading those models, we have to chunk up those string instances in 64kB blocks and + * recombine them correctly upon reading a (binary) model file. + * <p> + * The problem was raised in <a href="https://issues.apache.org/jira/browse/OPENNLP-1366">ticket OPENNLP-1366</a>. + * <p> + * Solution strategy: + * <ul> + * <li>If writing parameters to a {@link DataOutputStream} blows up with a {@link UTFDataFormatException} a + * large String instance is chunked up and written as appropriate blocks.</li> + * <li>To indicate that chunking was conducted, we start with the {@link #SIGNATURE_CHUNKED_PARAMS} indicator, + * directly followed by the number of chunks used. This way, when reading in chunked model parameters, + * recombination is achieved transparently.</li> + * </ul> + * <p> + * Note: Both, existing (binary) model files and newly trained models which don't require the chunking + * technique, will be supported like in previous OpenNLP versions. + * + * @author <a href="mailto:martin.wies...@hs-heilbronn.de">Martin Wiesner</a> + * @author <a href="mailto:strub...@apache.org">Mark Struberg</a> + */ +public final class ModelParameterChunker { + + /* + * A signature that denotes the start of a String that required chunking. + * + * Semantics: + * If a model parameter (String) carries the below signature at the very beginning, this indicates + * that 'n > 1' chunks must be processed to obtain the whole model parameters. Otherwise, those would not be + * written to the binary model files (as reported in OPENNLP-1366) if the training occurs on large corpora + * as used, for instance, in the context of (very large) German NLP models. + */ + public static final String SIGNATURE_CHUNKED_PARAMS = "CHUNKED-MODEL-PARAMS:"; // followed by no of chunks! + + private static final int MAX_CHUNK_SIZE_BYTES = 65535; // the maximum 'utflen' DataOutputStream can handle + + private ModelParameterChunker(){ + // private utility class ct s + } + + /** + * Reads model parameters from {@code dis}. In case the stream start with {@link #SIGNATURE_CHUNKED_PARAMS}, + * the number of chunks is detected and the original large parameter string is reconstructed from several + * chunks. + * + * @param dis The stream which will be used to read the model parameter from. + */ + public static String readUTF(DataInputStream dis) throws IOException { + String data = dis.readUTF(); + if (data.startsWith(SIGNATURE_CHUNKED_PARAMS)) { + String chunkElements = data.replace(SIGNATURE_CHUNKED_PARAMS, ""); + int chunkSize = Integer.parseInt(chunkElements); + StringBuilder sb = new StringBuilder(); + for (int i = 0; i < chunkSize; i++) { + sb.append(dis.readUTF()); + } + return sb.toString(); // the reconstructed model parameter string + } else { // default case: no chunked data -> just return the read data / parameter information + return data; + } + } + + /** + * Writes the model parameter {@code s} to {@code dos}. In case {@code s} does exceed + * {@link #MAX_CHUNK_SIZE_BYTES} in length, the chunking mechanism is used; otherwise the parameter is + * written 'as is'. + * + * @param dos The {@link DataOutputStream} stream which will be used to persist the model. + * @param s The input string that is checked for length and chunked if {@link #MAX_CHUNK_SIZE_BYTES} is + * exceeded. + */ + public static void writeUTF(DataOutputStream dos, String s) throws IOException { + try { + dos.writeUTF(s); + } catch (UTFDataFormatException dfe) { Review Comment: @jzonthemtn In general, this looks entirely plausible. Yet, there are specific details for encoding characters in UTF(8) format. Therefore, byte length of `s` is not sufficient or valid for all cases. Thus, @struberg and I decided to leave it to the actual implementation of `DataOutputStream.writeUTF(...)` to tell us if the string contains too many characters. Have a look and check the code there, if interested, for the "difficult" part. We decided not to cnp the code over to OpenNLP, but to rely on the checks made in `DataOutputStream`. I hope, you can follow our idea/decision to go for "avoid cnp" in favor of catching `UTFDataFormatException` here once, in case the limit exceeds 64k. This seems acceptable for those "rare" occasions when training with large corpora. As a background: The 'Tueba Wikipedia' treebank only runs into this exception twice; other (well- known) German corpora, like the one from Hamburg I also used for validation, hit it 3 or 4 times during training. > Training of MaxEnt Model with large corpora fails with > java.io.UTFDataFormatException > -------------------------------------------------------------------------------------- > > Key: OPENNLP-1366 > URL: https://issues.apache.org/jira/browse/OPENNLP-1366 > Project: OpenNLP > Issue Type: Improvement > Components: Lemmatizer > Affects Versions: 1.9.4, 2.0.0 > Reporter: Richard Zowalla > Assignee: Jeff Zemerick > Priority: Major > Labels: pull-request-available > > As written on > [dev@opennlp.a.o|[https://lists.apache.org/thread/vc5lfzj81tco703noqxpvy8sfj8fw8b1]], > we are working on training a large opennlp maxent model for lemmatizing > German texts. We use a wikipedia tree bank from Tübingen. > This consumes > 2 TB of RAM during training but will finish after some time. > However, writing this model will result in a java.io.UTFDataFormatException > However, training such a big model isn't feasable for debugging. Gladly, a > similar with a smaller dataset is found on Stackoverflow: > [https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long] > > It contains the OpenNLP CLI command to train a lemmatizer on a much smaller > dataset. > The stacktrace is raced while writing a String as UTF in DataOutputStream, > which has a hard-coded size limitation in the JDK (for reasons behind my > knowledge ;)) > Stacktrace: > {code:java} > java.io.UTFDataFormatException: encoded string too long: 383769 bytes > at > java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) > at > java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) > at > opennlp.tools.ml.maxent.io.BinaryGISModelWriter.writeUTF(BinaryGISModelWriter.java:71) > at > opennlp.tools.ml.maxent.io.GISModelWriter.persist(GISModelWriter.java:97) > at > opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75) > at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29) > at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597) > at opennlp.tools.cmdline.CmdLineUtil.writeModel(CmdLineUtil.java:182) > at > opennlp.tools.cmdline.lemmatizer.LemmatizerTrainerTool.run(LemmatizerTrainerTool.java:77) > at opennlp.tools.cmdline.CLI.main(CLI.java:256) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)