Re: [PR] OPENNLP-1832: Add SymSpell-based SpellChecker component (opennlp)

via GitHub Fri, 05 Jun 2026 07:10:50 -0700


rzo1 commented on code in PR #1057:
URL: https://github.com/apache/opennlp/pull/1057#discussion_r3363199944



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/normalizer/SpellCheckingCharSequenceNormalizer.java:
##########
@@ -0,0 +1,398 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.normalizer;
+
+import java.util.List;
+import java.util.Locale;
+import java.util.Objects;
+import java.util.regex.Pattern;
+
+import opennlp.spellcheck.SpellChecker;
+import opennlp.spellcheck.SuggestItem;
+import opennlp.spellcheck.Verbosity;
+import opennlp.spellcheck.dictionary.SymSpellModel;
+import opennlp.tools.util.normalizer.AggregateCharSequenceNormalizer;
+import opennlp.tools.util.normalizer.CharSequenceNormalizer;
+
+/**
+ * A {@link CharSequenceNormalizer} that corrects spelling in text using a
+ * {@link SpellChecker} (typically a SymSpell engine).
+ *
+ * <p>The normalizer works in one of two {@linkplain Mode modes}:</p>
+ * <ul>
+ *   <li>{@link Mode#PER_TOKEN PER_TOKEN} (default) &ndash; the input is split 
into
+ *       whitespace-delimited tokens and each token is corrected independently 
with
+ *       {@link SpellChecker#lookup}. The original whitespace runs between 
tokens are
+ *       preserved verbatim, so the shape of the line is kept. Tokens the 
dictionary
+ *       already contains (best suggestion at edit distance {@code 0}) are left
+ *       untouched.</li>
+ *   <li>{@link Mode#COMPOUND COMPOUND} &ndash; the whole input is passed to
+ *       {@link SpellChecker#lookupCompound}, which additionally repairs 
wrongly
+ *       inserted or omitted spaces (word splits and merges). This collapses 
runs of
+ *       whitespace to single spaces, as the compound corrector re-tokenizes 
the
+ *       input.</li>
+ * </ul>
+ *
+ * <p>Several guards keep the corrector from "fixing" tokens that should be 
left as
+ * they are (configurable through the {@link Builder}):</p>
+ * <ul>
+ *   <li>tokens shorter than {@code minTokenLength} are skipped;</li>
+ *   <li>numeric tokens are skipped ({@code skipNumbers}, on by default);</li>
+ *   <li>URL- and email-like tokens are skipped ({@code skipUrls}, on by 
default);</li>
+ *   <li>a token whose lower-cased form is already in the dictionary is never
+ *       changed (the engine returns it at edit distance {@code 0}).</li>
+ * </ul>
+ *
+ * <p><b>Casing.</b> Dictionaries are normally lower-cased, so lookups are 
performed on
+ * the lower-cased token, and the original casing pattern is re-applied to the
+ * correction: an all-upper token yields an all-upper correction, a 
leading-capital
+ * token yields a leading-capital correction, otherwise the suggestion's own 
casing is
+ * used. When no correction applies, the original token (including its casing 
and any
+ * surrounding punctuation) is emitted unchanged.</p>
+ *
+ * <p>This normalizer composes cleanly inside an
+ * {@link AggregateCharSequenceNormalizer}; place it after noise-removing 
normalizers
+ * (URL, emoji, shrink) so it sees clean tokens.</p>
+ *
+ * <p><b>Serialization.</b> {@link CharSequenceNormalizer} is {@link 
java.io.Serializable},
+ * but the backing {@link SpellChecker} usually is not; it is therefore held 
in a
+ * {@code transient} field and is {@code null} after Java deserialization. A 
deserialized
+ * instance is inert until a checker is re-attached: obtain a working copy 
with the same
+ * settings via {@link #withSpellChecker(SpellChecker)} (this matches how the 
engine is
+ * rebuilt from a model rather than Java-serialized). Calling {@link 
#normalize} on an
+ * instance with no checker throws {@link IllegalStateException}.</p>
+ */
+public class SpellCheckingCharSequenceNormalizer implements 
CharSequenceNormalizer {
+
+  private static final long serialVersionUID = 1L;

Review Comment:
   This is inherited from `CharSequenceNormalizer`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1832: Add SymSpell-based SpellChecker component (opennlp)

Reply via email to