rzo1 commented on code in PR #1057: URL: https://github.com/apache/opennlp/pull/1057#discussion_r3363199944
########## opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/normalizer/SpellCheckingCharSequenceNormalizer.java: ########## @@ -0,0 +1,398 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package opennlp.spellcheck.normalizer; + +import java.util.List; +import java.util.Locale; +import java.util.Objects; +import java.util.regex.Pattern; + +import opennlp.spellcheck.SpellChecker; +import opennlp.spellcheck.SuggestItem; +import opennlp.spellcheck.Verbosity; +import opennlp.spellcheck.dictionary.SymSpellModel; +import opennlp.tools.util.normalizer.AggregateCharSequenceNormalizer; +import opennlp.tools.util.normalizer.CharSequenceNormalizer; + +/** + * A {@link CharSequenceNormalizer} that corrects spelling in text using a + * {@link SpellChecker} (typically a SymSpell engine). + * + * <p>The normalizer works in one of two {@linkplain Mode modes}:</p> + * <ul> + * <li>{@link Mode#PER_TOKEN PER_TOKEN} (default) – the input is split into + * whitespace-delimited tokens and each token is corrected independently with + * {@link SpellChecker#lookup}. The original whitespace runs between tokens are + * preserved verbatim, so the shape of the line is kept. Tokens the dictionary + * already contains (best suggestion at edit distance {@code 0}) are left + * untouched.</li> + * <li>{@link Mode#COMPOUND COMPOUND} – the whole input is passed to + * {@link SpellChecker#lookupCompound}, which additionally repairs wrongly + * inserted or omitted spaces (word splits and merges). This collapses runs of + * whitespace to single spaces, as the compound corrector re-tokenizes the + * input.</li> + * </ul> + * + * <p>Several guards keep the corrector from "fixing" tokens that should be left as + * they are (configurable through the {@link Builder}):</p> + * <ul> + * <li>tokens shorter than {@code minTokenLength} are skipped;</li> + * <li>numeric tokens are skipped ({@code skipNumbers}, on by default);</li> + * <li>URL- and email-like tokens are skipped ({@code skipUrls}, on by default);</li> + * <li>a token whose lower-cased form is already in the dictionary is never + * changed (the engine returns it at edit distance {@code 0}).</li> + * </ul> + * + * <p><b>Casing.</b> Dictionaries are normally lower-cased, so lookups are performed on + * the lower-cased token, and the original casing pattern is re-applied to the + * correction: an all-upper token yields an all-upper correction, a leading-capital + * token yields a leading-capital correction, otherwise the suggestion's own casing is + * used. When no correction applies, the original token (including its casing and any + * surrounding punctuation) is emitted unchanged.</p> + * + * <p>This normalizer composes cleanly inside an + * {@link AggregateCharSequenceNormalizer}; place it after noise-removing normalizers + * (URL, emoji, shrink) so it sees clean tokens.</p> + * + * <p><b>Serialization.</b> {@link CharSequenceNormalizer} is {@link java.io.Serializable}, + * but the backing {@link SpellChecker} usually is not; it is therefore held in a + * {@code transient} field and is {@code null} after Java deserialization. A deserialized + * instance is inert until a checker is re-attached: obtain a working copy with the same + * settings via {@link #withSpellChecker(SpellChecker)} (this matches how the engine is + * rebuilt from a model rather than Java-serialized). Calling {@link #normalize} on an + * instance with no checker throws {@link IllegalStateException}.</p> + */ +public class SpellCheckingCharSequenceNormalizer implements CharSequenceNormalizer { + + private static final long serialVersionUID = 1L; Review Comment: This is inherited from `CharSequenceNormalizer`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
