Re: [PR] OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) (opennlp)

via GitHub Sat, 20 Jun 2026 08:02:23 -0700


Copilot commented on code in PR #1104:
URL: https://github.com/apache/opennlp/pull/1104#discussion_r3446747424



##########
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBoundaryConformanceTest.java:
##########
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Runs the official Unicode {@code WordBreakTest.txt} conformance suite 
against
+ * {@link WordSegmenter}. Each line marks boundaries with U+00F7 (division 
sign) and non-boundaries
+ * with U+00D7 (multiplication sign) between code points.
+ */
+public class WordBoundaryConformanceTest {
+
+  private static final int BOUNDARY = 0x00F7; // division sign
+  private static final int NO_BOUNDARY = 0x00D7; // multiplication sign

Review Comment:
   `NO_BOUNDARY` is declared but never used, which adds dead code noise to this 
test. Removing it keeps the intent clear and avoids unused-constant drift.



##########
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBoundaryConformanceTest.java:
##########
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.tokenize.uax29;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Runs the official Unicode {@code WordBreakTest.txt} conformance suite 
against
+ * {@link WordSegmenter}. Each line marks boundaries with U+00F7 (division 
sign) and non-boundaries
+ * with U+00D7 (multiplication sign) between code points.
+ */
+public class WordBoundaryConformanceTest {
+
+  private static final int BOUNDARY = 0x00F7; // division sign
+  private static final int NO_BOUNDARY = 0x00D7; // multiplication sign
+
+  @Test
+  void testOfficialUnicodeWordBreakConformance() throws IOException {
+    int total = 0;
+    int passed = 0;
+    final List<String> failures = new ArrayList<>();
+
+    try (InputStream in = 
WordBoundaryConformanceTest.class.getResourceAsStream("WordBreakTest.txt");
+         BufferedReader reader =
+             new BufferedReader(new InputStreamReader(in, 
StandardCharsets.UTF_8))) {

Review Comment:
   `getResourceAsStream(...)` can return `null`. As written, that would surface 
as a `NullPointerException` inside `InputStreamReader`, which makes 
missing/relocated resources hard to diagnose. Guard the stream with an explicit 
null check (e.g., `Objects.requireNonNull`).



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfiles.java:
##########
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import java.util.HashMap;
+import java.util.Locale;
+import java.util.Map;
+import java.util.MissingResourceException;
+import java.util.Optional;
+import java.util.Set;
+
+import opennlp.tools.langdetect.LanguageDetector;
+import opennlp.tools.stemmer.snowball.SnowballStemmer;
+
+/**
+ * A registry of {@link NormalizationProfile}s by language, with 
detection-based fallback. This is
+ * the language dispatch the design note calls for: pick the profile for a 
requested language, or
+ * detect the language with a {@link LanguageDetector} when it is unspecified. 
The covered languages
+ * are exactly those with a Snowball stemmer.
+ *
+ * <p>Profiles are keyed by ISO 639-3 code (what {@link LanguageDetector} 
produces);
+ * {@link #forLanguage(String)} also accepts ISO 639-1 two-letter codes.</p>
+ */
+public final class NormalizationProfiles {
+
+  private static final Map<String, NormalizationProfile> BY_LANGUAGE = build();
+
+  private NormalizationProfiles() {
+  }
+
+  private static Map<String, NormalizationProfile> build() {
+    final Map<String, NormalizationProfile> map = new HashMap<>();
+    // The generic accent fold is used for English and the major Romance 
languages, German uses its
+    // own ae/oe/ue/ss fold, and folding is disabled elsewhere (Nordic, 
non-Latin) where diacritics
+    // mark distinct letters.
+    final CharSequenceNormalizer latin = 
AccentFoldCharSequenceNormalizer.getInstance();
+    final CharSequenceNormalizer german = 
GermanUmlautCharSequenceNormalizer.getInstance();
+    add(map, "ara", SnowballStemmer.ALGORITHM.ARABIC, null);
+    add(map, "cat", SnowballStemmer.ALGORITHM.CATALAN, latin);
+    add(map, "dan", SnowballStemmer.ALGORITHM.DANISH, null);
+    add(map, "deu", SnowballStemmer.ALGORITHM.GERMAN, german);
+    add(map, "ell", SnowballStemmer.ALGORITHM.GREEK, null);
+    add(map, "eng", SnowballStemmer.ALGORITHM.ENGLISH, latin);
+    add(map, "fin", SnowballStemmer.ALGORITHM.FINNISH, null);
+    add(map, "fra", SnowballStemmer.ALGORITHM.FRENCH, latin);
+    add(map, "gle", SnowballStemmer.ALGORITHM.IRISH, null);
+    add(map, "hun", SnowballStemmer.ALGORITHM.HUNGARIAN, null);
+    add(map, "ind", SnowballStemmer.ALGORITHM.INDONESIAN, null);
+    add(map, "ita", SnowballStemmer.ALGORITHM.ITALIAN, latin);
+    add(map, "nld", SnowballStemmer.ALGORITHM.DUTCH, null);
+    add(map, "nor", SnowballStemmer.ALGORITHM.NORWEGIAN, null);
+    add(map, "por", SnowballStemmer.ALGORITHM.PORTUGUESE, latin);
+    add(map, "ron", SnowballStemmer.ALGORITHM.ROMANIAN, null);
+    add(map, "rus", SnowballStemmer.ALGORITHM.RUSSIAN, null);
+    add(map, "spa", SnowballStemmer.ALGORITHM.SPANISH, latin);
+    add(map, "swe", SnowballStemmer.ALGORITHM.SWEDISH, null);
+    return Map.copyOf(map);
+  }
+
+  private static void add(Map<String, NormalizationProfile> map, String 
language,
+      SnowballStemmer.ALGORITHM algorithm, CharSequenceNormalizer accentFold) {
+    map.put(language, new NormalizationProfile(language, algorithm, 
accentFold));
+  }
+
+  /**
+   * Returns the profile for a language.
+   *
+   * @param language An ISO 639-3 or ISO 639-1 language code; case-insensitive.
+   * @return The profile, or empty if the language has no Snowball stemmer.
+   */
+  public static Optional<NormalizationProfile> forLanguage(String language) {
+    String code = language.strip().toLowerCase(Locale.ROOT);
+    if (code.length() == 2) {

Review Comment:
   `forLanguage` calls `language.strip()` directly, so a null argument will 
throw an unhelpful `NullPointerException`. Since this is a public entry point, 
consider an explicit null check to fail fast with a clear message.



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TermAnalyzer.java:
##########
@@ -0,0 +1,363 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.EnumMap;
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Locale;
+import java.util.Objects;
+import java.util.Set;
+
+import opennlp.tools.lemmatizer.Lemmatizer;
+import opennlp.tools.stemmer.Stemmer;
+import opennlp.tools.tokenize.uax29.WordTokenizer;
+import opennlp.tools.util.Span;
+
+/**
+ * Builds {@link Term}s by segmenting text and applying a configured stack of 
normalization
+ * {@link Dimension}s to each token. The analyzer is the configuration; each 
{@link Term} is the
+ * layered result for one token, with the configured dimensions computed 
eagerly and any other
+ * dimension computed lazily on first request.
+ *
+ * <p>Segmentation uses the Unicode {@linkplain WordTokenizer UAX&#160;#29 
word tokenizer}, so the
+ * input does not need to be pre-tokenized. The character-level dimensions 
({@link Dimension#NFC}
+ * through {@link Dimension#ACCENT_FOLD}) have built-in defaults; {@link 
Dimension#STEM} and
+ * {@link Dimension#LEMMA} are enabled by supplying a {@link Stemmer} or 
{@link Lemmatizer}.</p>
+ *
+ * <p>An instance is immutable and is thread-safe when its configured 
transforms are. The built-in
+ * character normalizers are stateless, but the Snowball stemmers are not, so 
an analyzer configured
+ * with a {@link Stemmer} (for example through {@link 
NormalizationProfile#searchAnalyzer()}) should
+ * not be shared across threads when {@link Dimension#STEM} is used. Build one 
with
+ * {@link #builder()}.</p>
+ */
+public final class TermAnalyzer {
+
+  private final List<Dimension> chain;
+  private final Dimension finalDimension;
+  private final EnumMap<Dimension, CharSequenceNormalizer> transforms;
+  private final Stemmer stemmer;
+  private final Lemmatizer lemmatizer;
+  private final WordTokenizer tokenizer;
+
+  private TermAnalyzer(Builder builder) {
+    final List<Dimension> ordered = new ArrayList<>(builder.chain);
+    Collections.sort(ordered); // canonical pipeline order (enum declaration 
order)
+    this.chain = List.copyOf(ordered);
+    this.finalDimension = ordered.isEmpty() ? Dimension.ORIGINAL : 
ordered.get(ordered.size() - 1);
+    // Only the per-analyzer overrides from the builder; the defaults live on 
Dimension itself.
+    this.transforms = new EnumMap<>(builder.transforms);
+    this.stemmer = builder.stemmer;
+    this.lemmatizer = builder.lemmatizer;
+    this.tokenizer = builder.tokenizer;
+  }
+
+  /**
+   * {@return a new builder}
+   */
+  public static Builder builder() {
+    return new Builder();
+  }
+
+  /**
+   * Segments {@code text} with the UAX&#160;#29 word tokenizer and returns 
one {@link Term} per
+   * word token, in order. The terms carry no part-of-speech tag, so {@link 
Dimension#LEMMA} is not
+   * available from them.
+   *
+   * @param text The text to analyze.
+   * @return The terms.
+   */
+  public List<Term> analyze(CharSequence text) {
+    final List<Span> spans = tokenizer.tokenizeSpans(text);
+    final List<Term> terms = new ArrayList<>(spans.size());
+    for (final Span span : spans) {
+      terms.add(new Term(this, span.getCoveredText(text).toString(), span, 
null));
+    }
+    return terms;
+  }
+
+  /**
+   * Returns one {@link Term} per supplied token, attaching the matching 
part-of-speech tag so that
+   * {@link Dimension#LEMMA} can be computed. The terms have no source span.
+   *
+   * @param tokens The tokens.
+   * @param tags   The part-of-speech tag for each token; must be the same 
length as {@code tokens}.
+   * @return The terms.
+   * @throws IllegalArgumentException if {@code tokens} and {@code tags} 
differ in length.
+   */
+  public List<Term> analyze(String[] tokens, String[] tags) {
+    if (tokens.length != tags.length) {
+      throw new IllegalArgumentException(
+          "tokens and tags must be the same length, got " + tokens.length + " 
and " + tags.length);
+    }
+    final List<Term> terms = new ArrayList<>(tokens.length);
+    for (int i = 0; i < tokens.length; i++) {
+      terms.add(new Term(this, tokens[i], null, tags[i]));
+    }
+    return terms;
+  }
+
+  /**
+   * {@return the configured dimensions that are computed eagerly, in 
canonical order} The list
+   * never includes {@link Dimension#ORIGINAL}, which is always present.
+   */
+  public List<Dimension> dimensions() {
+    return chain;
+  }
+
+  Dimension finalDimension() {
+    return finalDimension;
+  }
+
+  // Applies one dimension's transform to a single token value. Fails loudly 
when a token-level
+  // dimension was requested without the engine (or tag) it needs.
+  String apply(Dimension dimension, String input, String posTag) {
+    switch (dimension) {
+      case ORIGINAL:
+        return input;
+      case STEM:
+        if (stemmer == null) {
+          throw new IllegalStateException(
+              "Dimension STEM requires a Stemmer; configure it with 
builder().stem(...)");
+        }
+        return stemmer.stem(input).toString();
+      case LEMMA:
+        if (lemmatizer == null) {
+          throw new IllegalStateException(
+              "Dimension LEMMA requires a Lemmatizer; configure it with 
builder().lemmatize(...)");
+        }
+        if (posTag == null) {
+          throw new IllegalStateException(
+              "Dimension LEMMA requires a part-of-speech tag; use 
analyze(tokens, tags)");
+        }
+        return lemmatizer.lemmatize(new String[] {input}, new String[] 
{posTag})[0];
+      default:
+        // A builder override wins; otherwise the dimension's own default 
normalizer.
+        final CharSequenceNormalizer normalizer = 
transforms.containsKey(dimension)
+            ? transforms.get(dimension) : dimension.defaultNormalizer();
+        return normalizer.normalize(input).toString();
+    }
+  }
+
+  /** A builder for {@link TermAnalyzer}. */
+  public static final class Builder {
+
+    private final EnumSet<Dimension> chain = EnumSet.noneOf(Dimension.class);
+    private final EnumMap<Dimension, CharSequenceNormalizer> transforms =
+        new EnumMap<>(Dimension.class);
+    private Stemmer stemmer;
+    private Lemmatizer lemmatizer;
+    private WordTokenizer tokenizer = new WordTokenizer();
+
+    private Builder() {
+    }
+
+    /**
+     * Enables {@link Dimension#NFC}.
+     *
+     * @return this builder
+     */
+    public Builder nfc() {
+      chain.add(Dimension.NFC);
+      return this;
+    }
+
+    /**
+     * Enables {@link Dimension#NFKC}.
+     *
+     * @return this builder
+     */
+    public Builder nfkc() {
+      chain.add(Dimension.NFKC);
+      return this;
+    }
+
+    /**
+     * Enables {@link Dimension#WHITESPACE}.
+     *
+     * @return this builder
+     */
+    public Builder whitespace() {
+      chain.add(Dimension.WHITESPACE);
+      return this;
+    }
+
+    /**
+     * Enables {@link Dimension#WHITESPACE} with a specific normalizer, 
choosing the fold target and
+     * behavior. For a custom class and target use a {@link CharClass} method 
reference, for example
+     * {@code whitespace(CharClass.of(members, replacement)::collapse)}.
+     *
+     * @param normalizer The whitespace normalizer to use.
+     * @return this builder
+     */
+    public Builder whitespace(CharSequenceNormalizer normalizer) {
+      return transform(Dimension.WHITESPACE, normalizer);
+    }
+
+    /**
+     * Enables {@link Dimension#DASH}.
+     *
+     * @return this builder
+     */
+    public Builder dashes() {
+      chain.add(Dimension.DASH);
+      return this;
+    }
+
+    /**
+     * Enables {@link Dimension#DASH} with a specific normalizer (a custom 
dash set or target).
+     *
+     * @param normalizer The dash normalizer to use.
+     * @return this builder
+     */
+    public Builder dashes(CharSequenceNormalizer normalizer) {
+      return transform(Dimension.DASH, normalizer);
+    }
+
+    /**
+     * Enables {@link Dimension#CASE_FOLD}.
+     *
+     * @return this builder
+     */
+    public Builder caseFold() {
+      chain.add(Dimension.CASE_FOLD);
+      return this;
+    }
+
+    /**
+     * Enables {@link Dimension#CASE_FOLD} using the given locale's case rules 
(for example Turkish
+     * dotted/dotless i), instead of the default {@link Locale#ROOT}.
+     *
+     * @param locale The locale whose case rules to apply.
+     * @return this builder
+     */
+    public Builder caseFold(Locale locale) {
+      return transform(Dimension.CASE_FOLD, 
CaseFoldCharSequenceNormalizer.getInstance(locale));
+    }

Review Comment:
   `CaseFoldCharSequenceNormalizer.getInstance(locale)` will throw a 
`NullPointerException` if `locale` is null (via `Locale.ROOT.equals(locale)`), 
but the resulting stack trace won’t indicate which parameter was invalid. 
Adding an explicit null check provides a clearer failure mode for API callers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) (opennlp)

Reply via email to